{"title": "Learning long-range spatial dependencies with horizontal gated recurrent units", "book": "Advances in Neural Information Processing Systems", "page_first": 152, "page_last": 164, "abstract": "Progress in deep learning has spawned great successes in many engineering applications. As a prime example, convolutional neural networks, a type of feedforward neural networks, are now approaching -- and sometimes even surpassing -- human accuracy on a variety of visual recognition tasks. Here, however, we show that these neural networks and their recent extensions struggle in recognition tasks where co-dependent visual features must be detected over long spatial ranges. We introduce a visual challenge, Pathfinder, and describe a novel recurrent neural network architecture called the horizontal gated recurrent unit (hGRU) to learn intrinsic horizontal connections -- both within and across feature columns. We demonstrate that a single hGRU layer matches or outperforms all tested feedforward hierarchical baselines including state-of-the-art architectures with orders of magnitude more parameters.", "full_text": "Learning long-range spatial dependencies\n\nwith horizontal gated recurrent units\n\nDrew Linsley, Junkyung Kim, Vijay Veerabadran, Charlie Windolf, Thomas Serre\n\nCarney Institute for Brain Science\n\nDepartment of Cognitive Linguistic & Psychological Sciences\n\n{drew_linsley,junkyung_kim,vijay_veerabadran,thomas_serre}@brown.edu\n\nBrown University\n\nProvidence, RI 02912\n\nAbstract\n\nProgress in deep learning has spawned great successes in many engineering appli-\ncations. As a prime example, convolutional neural networks, a type of feedforward\nneural networks, are now approaching \u2013 and sometimes even surpassing \u2013 human\naccuracy on a variety of visual recognition tasks. Here, however, we show that\nthese neural networks and their recent extensions struggle in recognition tasks\nwhere co-dependent visual features must be detected over long spatial ranges. We\nintroduce a visual challenge, Path\ufb01nder, and describe a novel recurrent neural\nnetwork architecture called the horizontal gated recurrent unit (hGRU) to learn\nintrinsic horizontal connections \u2013 both within and across feature columns. We\ndemonstrate that a single hGRU layer matches or outperforms all tested feedfor-\nward hierarchical baselines including state-of-the-art architectures with orders of\nmagnitude more parameters.\n\n1\n\nIntroduction\n\nConsider Fig. 1a which shows a sample image from a representative segmentation dataset [1]\n(left) and the corresponding contour map produced by a state-of-the-art deep convolutional neural\nnetwork (CNN) [2] (right). Although this task has long been considered challenging because of the\nneed to integrate global contextual information with inherently ambiguous local edge information,\nmodern CNNs are capable to detect contours in natural scenes at a level that rivals that of human\nobservers [2\u20136]. Now, consider Fig. 1b which depicts a variant of a visual psychology task referred\nto as \u201cPath\ufb01nder\u201d [7]. Reminiscent of the everyday task of reading a subway map to plan a commute\n(Fig. 1c), the goal in Path\ufb01nder is to determine if two white circles in an image are connected by a\npath. These images are visually simple compared to natural images like the one shown in Fig. 1a,\nand the task is indeed easy for human observers [7]. Nonetheless, we will demonstrate that modern\nCNNs struggle to solve this task.\nWhy is it that a CNN can accurately detect contours in a natural scene like Fig. 1a but also struggle\nto integrate paths in the stimuli shown in Fig. 1b? In principle, the ability of CNNs to learn such\nlong-range spatial dependencies is limited by their localized receptive \ufb01elds (RFs) \u2013 hence the need\nto consider deeper networks because they allow the buildup of larger and more complex RFs. Here,\nwe use a large-scale analysis of CNN performance on the Path\ufb01nder challenge to demonstrate that\nsimply increasing depth in feedforward networks constitutes an inef\ufb01cient solution to learning the\nlong-range spatial dependencies needed to solve the Path\ufb01nder challenge.\nAn alternative solution to problems that stress long-range spatial dependencies is provided by biology.\nThe visual cortex contains abundant horizontal connections which mediate non-linear interactions\nbetween neurons across distal regions of the visual \ufb01eld [8, 9]. These intrinsic connections, popularly\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: State-of-the-art CNNs excel at detecting contours in natural scenes, but they are strained\nby a task that requires the detection of long-range spatial dependencies. (a) Representative contour\ndetection performance of a leading neural network architecture [23].\n(b) Exemplars from the\nPath\ufb01nder challenge: a task consisting of synthetic images which are parametrically controlled for\nlong-range dependencies. (c) Long-range dependencies similar to those in the Path\ufb01nder challenge\nare critical for everyday behaviors, such as reading a subway map to navigate a city.\n\ncalled \u201cassociation \ufb01elds\u201d, are thought to form the main substrate for mechanisms of contour grouping\naccording to Gestalt principles, by mutually exciting colinear elements while also suppressing\nclutter elements that do not form extended contours [10\u201315]. Such \u201cextra-classical receptive \ufb01eld\u201d\nmechanisms, mediated by horizontal connections, allow receptive \ufb01elds to adaptively \u201cgrow\u201d without\nadditional processing depth. Building on previous computational neuroscience work [e.g., 10, 16\u201319],\nour group has recently developed a recurrent network model of classical and extra-classical receptive\n\ufb01elds that is constrained by the anatomy and physiology of the visual cortex [20]. The model was\nshown to account for diverse visual illusions providing computational evidence for a novel canonical\ncircuit that is shared across visual modalities.\nHere, we show how this computational neuroscience model can be turned into a modern end-to-\nend trainable neural network module. We describe an extension of the popular gated recurrent unit\n(GRU) [21], which we call the horizontal GRU (hGRU). Unlike CNNs, which exhibit a sharp decrease\nin accuracy for increasingly long paths, we show that the hGRU is highly effective at solving the\nPath\ufb01nder challenge with just one layer and a fraction of the number of parameters and training\nsamples needed by CNNs. We further \ufb01nd that, when trained on natural scenes, the hGRU learns\nconnection patterns that coarsely resemble anatomical patterns of horizontal connectivity found in\nthe visual cortex, and exhibits a detection pro\ufb01le that strongly correlates with human behavior on a\nclassic contour detection task [22].\n\nRelated work Much previous work on recurrent neural networks (RNNs) has focused on modeling\nsequences with learnable gates in the form of long-short term memory (LSTM) units [24] or gated\nrecurrent units (GRUs) [21]. RNNs have also been extended to learning spatial dependencies\nin static images with broad applications [25\u201329]. In this approach, images are transformed into\none-dimensional sequences that are used to train an RNN. In recent years, several approaches\nhave introduced convolutions into RNNs, using the recursive application of convolutional \ufb01lters\nas a method for increasing the depth of processing through time on tasks like object recognition\nand super-resolution without additional parameters [30\u201332]. Other groups have constrained these\nconvolutional-RNNs with insights from neuroscience and cognitive science, engineering speci\ufb01c\npatterns of connectivity between processing layers [33\u201336]. The proposed hGRU builds on this line of\nbiologically-inspired implementations of RNNs, adding connectivity patterns and circuit mechanisms\nthat are typically found in computational neuroscience models of neural circuits [e.g., 10, 16\u201320].\nAnother class of models related to our proposed approach is Conditional Random Fields (CRFs),\nprobabilistic models aimed at explicitly capturing associations between nearby features. The con-\nnectivity implemented in CRFs is similar to the horizontal connections used in the hGRU, and has\nbeen successfully applied as a post-processing stage in visual tasks such as segmentation [37, 38]\nto smooth out and increase the spatial resolution of prediction maps. Recently, such probabilistic\nmethods have been successfully incorporated in a generative vision model shown to break text-based\nCAPTCHAs [39]. Originally formulated as probabilistic models, CRFs can also be cast as RNNs [40].\n\n2\n\n(a)(b)STARTGOAL(c)\f2 Horizontal gated recurrent units (hGRUs)\n\nOriginal contextual neural circuit model We begin by referencing the recurrent neural model of\ncontextual interactions developed by M\u00e9ly et al. [20]. Below we adapted the model notations to a\ncomputer vision audience. Model units are indexed by their 2D positions (x, y) and feature channel\nk. Neural activity is governed by the following differential equations (see Supp. Material for the full\ntreatment):\n\n\u03b7 \u02d9H (1)\n\nxyk + \u00012H (1)\n\nxyk =\n\n\u03c4 \u02d9H (2)\n\nxyk + \u03c32H (2)\n\nxyk =\n\n(cid:104)\n(cid:104)\n\n(cid:105)\n\n\u03b3C (2)\nxyk\n\n.\n\n+\n\n(cid:105)\n\n\u03beXxyk \u2212 (\u03b1H (1)\n\nxyk + \u00b5) C (1)\n\nxyk\n\n+\n\n(1)\n\nwhere\n\nxyk = (WI \u2217 H(2))xyk\nC (1)\nxyk = (WE \u2217 H(1))xyk,\nC (2)\n\nHere, X \u2208 RW\u00d7H\u00d7K is the feedforward drive (i.e., neural responses to a stimulus), H(1) \u2208 RW\u00d7H\u00d7K\nis the recurrent circuit input, and H(2) \u2208 RW\u00d7H\u00d7K the recurrent circuit output. Modeling input and\noutput states separately allows for the implementation of a particular form of inhibition known as\n\u201cshunting\u201d (or divisive) inhibition. Unlike the excitation in the model which acts linearly on a unit\u2019s\ninput, inhibition acts on a unit\u2019s output and hence, regulates the unit response non-linearly (i.e., given\na \ufb01xed amount of inhibition and excitation, inhibition will increase with the unit\u2019s activity unlike\nexcitation which will remained constant).\nThe convolutional kernels WI , WE \u2208 RS\u00d7S\u00d7K\u00d7K describe inhibitory vs. excitatory hypercolumn\nconnectivity (constrained by anatomical data1). The scalar parameters \u00b5 and \u03b1 control linear and\nquadratic (i.e., shunting) inhibition by C(1) \u2208 RW\u00d7H\u00d7K, \u03b3 scales excitation by C(2) \u2208 RW\u00d7H\u00d7K, and\n\u03be scales the feedforward drive. Activity at each stage is linearly recti\ufb01ed (ReLU) [\u00b7]+ = max(\u00b7, 0).\nFinally, \u03b7, \u0001, \u03c4 and \u03c3 are time constants. To make this model amenable to modern computer vision\napplications, we set out to develop a version where all parameters could be trained from data. If we\nlet \u03b7 = \u03c4 and \u03c3 = \u0001 for symmetry and apply Euler\u2019s method to Eq. 1 with a time step of \u2206t = \u03b7/\u00012,\nthen we obtain the discrete-time equations:\n\nxyk[t] = \u0001\u22122(cid:104)\nxyk[t] = \u0001\u22122(cid:104)\n\nH (1)\n\nH (2)\n\n\u03beXxyk \u2212 (\u03b1H (1)\n\nxyk[t \u2212 1] + \u00b5)C (1)\n\nxyk[t]\n\n(cid:105)\n\n(cid:105)\n\n+\n\n\u03b3C (2)\n\n.\n\n+\n\nxyk[t]\n\n(2)\nHere, \u00b7[t] denotes the approximation at the t-th discrete timestep. This results in a trainable convo-\nlutional recurrent neural network (RNN) which performs Euler integration of a dynamical system\nsimilar to the neural model of [20].\nhGRU formulation We build on Eq. 2 to introduce the hGRU \u2013 a model with the ability to learn\ncomplex interactions between units via horizontal connections within a single processing layer\n(Fig. 2). The hGRU extends the derivation from Eq. 2 with three modi\ufb01cations that improve the\ntraining of the model with gradient descent and its expressiveness2. (i) We introduce learnable gates,\nborrowed from the gated recurrent unit (GRU) framework (see Supp. Material for the full derivation\nfrom Eq. 2). (ii) The hGRU makes the operations for computing H(2) (excitation) symmetric with\nthose of H(1) (inhibition), providing the circuit the ability to learn how to implement linear and\nquadratic interactions at each of these processing stages. (iii) To control unstable gradients, the hGRU\nuses a squashing pointwise non-linearity and a learned parameter to globally scale activity at every\nprocessing timestep (akin to a constrained version of the recurrent batchnorm [41]).\n\n1There are four separate connectivity patterns in [20] to describe inhibition vs. excitation and near vs. far\ninteractions between units. We combine these into a separate inhibitory vs. excitatory kernels to simplify\nnotation.\n\n2These modi\ufb01cations involved relaxing several constraints from the original neuroscience model that are less\nuseful for solving the tasks investigated here (see Supp. Material for performance of an hGRU with constrained\ninhibition and excitation.)\n\n3\n\n\fFigure 2: The hGRU circuit. The hGRU can learn highly non-linear interactions between spatially\nneighboring units in the feedforward drive X, which are encoded in its hidden state H(2). This\ncomputation involves two stages, which are inspired by a recurrent neural circuit of horizontal\nconnections [20]. First, the horizontal inhibition (blue) is calculated by applying a gain to H(2)[t\u2212 1],\nand convolving the resulting activity with the kernel W which characterizes these interactions. Linear\n(+ symbol) and quadratic (\u00d7 symbol) operations control the convergence of this inhibition onto\nX. Second, the horizontal excitation (red) is computed by convolving H(1)[t] with W . Another\nset of linear and quadratic operations modulate this activity, before it is mixed with the persistent\nhidden state H(2)[t \u2212 1]. Note that the excitation computation involves an additional \u201cpeephole\u201d\nconnection, not depicted here. Small solid-line squares within the hypothetical activities that the\ncircuit operates on denote the unit indexed by 2D position (x, y) and feature channel k, whereas\ndotted-line squares depict the unit\u2019s receptive \ufb01eld (a union of both classical and extra-classical\nde\ufb01nitions) in the previous activity.\n\nIn our hGRU implementation, the feedforward drive X corresponds to activity from a preceding\nconvolutional layer. The hGRU encodes spatial dependencies between feedforward units via its\n(time-varying) hidden states H(1) and H(2). Updates to the hidden states are managed using two\nactivities, referred to as the reset and update \u201cgates\u201d: G(1) and G(2). These activities are derived from\nconvolutions, denoted by \u2217, between the kernels U(1), U(2) \u2208 R1\u00d71\u00d7K\u00d7K and hidden states H(1) and\nH(2), shifted by biases b(1), b(2) \u2208 R1\u00d71\u00d7K, respectively. The pointwise non-linearity \u03c3 is applied\nto each activity, normalizing them in the range [0, 1]. Because these activities are real-valued, we\nhereafter refer to the reset gate as the \u201cgain\u201d, and the update gate as the \u201cmix\u201d.\nHorizontal interactions between units are calculated by the kernel W \u2208 RS\u00d7S\u00d7K\u00d7K, where S describes\nthe spatial extent of these connections in a single timestep (Fig. 2; but see Supp. Material for a version\nwith separate kernels for excitation vs. inhibition, as in Eq. 2). Consistent with computational models\nof neural circuits (e.g., [10, 16\u201320]), W is constrained to have symmetric weights between channels,\nsuch that the weight Wx0+\u2206x,y0+\u2206y,k1,k2 is equal to the weight Wx0+\u2206x,y0+\u2206y,k2,k1 where x0 and\ny0 denote the center of the kernel. This constraint reduces the number of learnable parameters by\nnearly half vs. a normal convolutional kernel. Hidden states H(1) and H(2) are recomputed via\nhorizontal interactions at every timestep t \u2208 [0, T ]. We begin by describing computation of H(1)[t]:\n(3)\n(4)\n\nG(1)[t] = \u03c3(U(1) \u2217 H(2)[t \u2212 1] + b(1))\nxyk[t] = (W \u2217 (G(1)[t] (cid:12) H(2)[t \u2212 1]))xyk\nC (1)\nxyk[t] = \u03b6(Xxyk \u2212 C (1)\nH (1)\n\n(5)\n\nxyk[t](\u03b1kH (2)\n\nxyk[t \u2212 1] + \u00b5k))\n\n4\n\nW\u03c3++InhibitionExcitation++\u03c3MixGain\u03b6\u03b6U(1)U(2)W(x, y, k)(x, y, k)H(2)[t-1]H(2)[t]H(1)[t]X\fChannels in H(2)[t \u2212 1] are \ufb01rst modulated by the gain3 G(1)[t]. The resulting activity is convolved\nwith W to compute C(1)[t], which is the horizontal inhibition of the hGRU at this timestep. This inhi-\nbition is applied to X via the parameters \u00b5 and \u03b1, which are k-dimensional vectors that respectively\nscale linear and quadratic (akin to shunting inhibition described in Eq. 1) terms of the horizontal\ninteraction with X. The pointwise \u03b6 is a hyperbolic tangent that squashes activity into the range\n[\u22121, 1] (but see Supp. Material for a hGRU with a recti\ufb01ed linearity). Importantly, in contrast to the\noriginal circuit, in this formulation the update to H(1)[t] (Eq. 5) is calculated by combining horizontal\nconnection contributions of C(1)[t] with H(2)[t\u2212 1] rather than H(1)[t\u2212 1], which we found improved\nlearning on the visual tasks explored here.\nThe updated H(1)[t] is next used to calculate H(2)[t].\n\nxyk[t] = \u03c3((U(2) \u2217 H(1)[t])xyk + b(2)\nG(2)\nk )\nxyk[t] = (W \u2217 H(1)[t])xyk\nC (2)\nxyk[t] = \u03b6(\u03bakH (1)\n\u02dcH (2)\nH (2)\nxyk[t] = \u03b7t(H (2)\n\nxyk[t] + \u03b2kC (2)\nxyk[t \u2212 1](1 \u2212 G(2)\n\nxyk[t] + \u03c9kH (1)\nxyk[t]) + \u02dcH (2)\n\nxyk[t]C (2)\nxyk[t]G(2)\n\nxyk[t])\n\nxyk[t])\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nThe mix G(2)[t] is calculated by convolving U(2)[t] with H(1)[t], followed by the addition of b(2).\nThe activity C(2)[t] represents the excitation of horizontal connections onto the newly-computed\nH(1)[t]. Linear and quadratic contributions of horizontal interactions at this stage are controlled by\nthe k-dimensional parameters \u03ba, \u03c9, and \u03b2. The parameters \u03ba and \u03c9 control the linear and quadratic\ncontributions of horizontal connections to \u02dcH(2)\n[t]. The parameter \u03b2 is a gain applied to C(2)[t], giving\nW an additional degree of freedom in expressing this excitation. With this full suite of interactions,\nthe hGRU can in principle implement both a linear and a quadratic form of excitation (i.e., to assess\nself-similarity), each of which play speci\ufb01c computational roles in perception [42]. Note that the\ninclusion of H(1)[t] in Eq. 8 functions as a \u201cpeephole\u201d connection between it and \u02dcH(2)\n[t]. Finally, the\nmix G(2) integrates the candidate \u02dcH(2)\n. The learnable T -dimensional parameter \u03b7, which\nwe refer to as a time-gain, helps control unstable gradients during training. This time-gain modulates\nH(2)\nt with the scalar, \u03b7t, which as we show in our experiments below improves model performance.\n\nt with H(2)\n\nt\n\n3 The Path\ufb01nder challenge\n\nWe evaluated the limits of feedforward and recurrent architectures on the \u201cPath\ufb01nder challenge\u201d, a\nsynthetic visual task inspired by cognitive psychology [7]. The task, depicted in Fig. 1b, involves\ndetecting whether two circles are connected by a path. This is made more dif\ufb01cult by allowing\ntarget paths to curve and introducing multiple shorter unconnected \u201cdistractor\u201d paths. The Path\ufb01nder\nchallenge involves three separate datasets, for which the length of paths and distractors are parametri-\ncally increased. This challenge therefore screens models for their effectiveness in detecting complex\nlong-range spatial relationships in cluttered scenes.\n\nStimulus design Path\ufb01nder images were generated by placing oriented \u201cpaddles\u201d on a canvas to\nform dashed paths. Each image contained two paths made of a \ufb01xed number of paddles and multiple\ndistractors made of one third as many paddles. Positive examples were generated by placing two\ncircles at the ends of a single path (Fig. 1b, left) and negative examples by placing one circle at\nthe end of each of the paths (Fig. 1b, right). The paths were curved and variably shaped, with the\npossible number of shapes exponential to the path length. The Path\ufb01nder challenge consisted of three\ndatasets, in which path and distractor length was successively increased, and with them, the overall\ntask dif\ufb01culty. These datasets had path lengths of 6, 9 and 14 paddles, and each contained 1,000,000\nunique images of 150\u00d7150 pixels. See Supp. Material for a detailed description of the stimulus\ngeneration procedure.\n\n3GRU gate activities are often a function of a hidden state and X[t]. Because the feedforward drive here is\n\nconstant w.r.t. time, we omit it from these calculations. In practice, its inclusion did not affect performance.\n\n5\n\n\fModel implementation We performed a large-scale analysis of the effectiveness of feedforward\nand recurrent computations on the Path\ufb01nder challenge. We controlled for the effects of idiosyncratic\nmodel speci\ufb01cations by using a standard architecture, consisting of \u201cinput\u201d, \u201cfeature extraction\u201d,\nand \u201creadout\u201d processing stages. Swapping different feedforward or recurrent layers into the feature\nextraction stage let us measure the relative effectiveness of each on the challenge. All models except\nfor state-of-the-art \u201cresidual networks\u201d (ResNets) [43] and per-pixel prediction architectures were\nembedded in this architecture, and these exceptions are detailed below. See Supp. Material for a\ndetailed description of the input and readout stage. Models were trained on each Path\ufb01nder challenge\ndataset (Fig. 3d), with 90% of the images used for training (900,000) and the remainder for testing\n(100,000). We measured model performance in two ways. First, as the accuracy on test images.\nSecond, as the \u201carea under the learning curve\u201d (ALC), or mean accuracy on the test set evaluated\nafter every 1000 batches of training, which summarized the rate at which a model learned the task.\nAccuracy and ALC were taken from the model that achieved the highest accuracy across 5 separate\nruns of model training. All models were trained for two epochs except for the ResNets, which were\ntrained for four. Model training procedures are detailed in Supp. Material.\n\nRecurrent models We tested 6 different recurrent layers in the feature extraction stage of the\nstandard architecture: hGRUs with 8, 6, and 4-timesteps of processing; a GRU; and hGRUs with\nlesions applied to parameters controlling linear or quadratic horizontal interactions. Both the GRU\nand lesioned versions of the hGRU ran for 8 timesteps. These layers had 15\u00d715 horizontal connection\nkernels (W ) with an equal number of channels as their input layer (25 channels).\nWe observed 3 overarching trends: First, each model\u2019s performance monotonically decreased, or\n\u201cstrained\u201d, as path length increased. Increasing path length reduced model accuracy (Fig. 3a), and\nincreased the number of batches it took to learn a task (Fig. 3b). Second, the 8-timestep hGRU was\nmore effective than any other recurrent model, and it outperformed each of its lesioned variants\nas well as a standard GRU. Notably, this hGRU was strained the least by the Path\ufb01nder challenge\nout of all tested models, with a negligible drop in accuracy as path length increased. This \ufb01nding\nhighlights the effectiveness of the hGRU for processing long-range spatial dependencies, and how the\ndynamics implemented by its linear and quadratic horizontal interactions are important. Third, hGRU\nperformance monotonically decreased with processing time. This revealed a minimum number of\ntimesteps that the hGRU needed to solve each Path\ufb01nder dataset: 4 for the length-6 condition, 6 for\nthe length-9 condition, and 8 for the length-14 condition (\ufb01rst vs. second columns in Fig. 3a). Such\ntime-dependency in the Path\ufb01nder task is consistent with the accuracy-reaction-time tradeoff found\nin humans as the distance between endpoints of a curve increases [7].\n\nFeedforward models We screened an array of feedforward models on the Path\ufb01nder challenge.\nModel performance revealed the importance of kernel size vs. kernel width, model depth, and\nfeedforward operations for incorporating additional scene context for solving Path\ufb01nder. Model\nconstruction began by embedding the feature extraction stage of the standard model with kernels\nof one of three different sizes: 10\u00d710, 15\u00d715, or 20\u00d720. These are referred to as small, medium,\nand large kernel models (Fig. 3). To control for the effect of network capacity on performance, the\nnumber of kernels given to each model was varied so that the number of parameters in each model\ncon\ufb01guration was equal to each other and the hGRU (36, 16, and 9 kernels). We also tested two\nother feedforward models that featured candidate operations for incorporating contextual information\ninto local convolutional activities. One version used (2-pixel) dilated convolutions, which involves\napplying a stride to the kernel before convolving the input [44, 45], and has been found useful for\nmany computer vision problems [38, 46, 47]. The other version applied a non-local operation to\nconvolutional activities [48], which can introduce (non-recurrent) interactions between units in a\nlayer. These operations were incorporated into the \ufb01rst feature extraction layer of the medium kernel\n(15\u00d715 \ufb01lter) model described above. We also considered deeper versions of each of the above\n\u201c1-layer\u201d models (referring to the depth of the feature extraction stage), stacking them to build 3- and\n5-layer versions. This yielded a total of 15 different feedforward models.\nWithout exception, the performance of each feedforward model was signi\ufb01cantly strained by the\nPath\ufb01nder challenge. The magnitude of this straining was well predicted by model depth and size, and\noperations for incorporating additional contextual information made no discernible difference to the\noverall pattern of results. The 1-layer models were most effective on the 6-length Path\ufb01nder dataset,\nbut were unable to do better than chance on the remaining conditions. Increasing model capacity to 3\nlayers rescued the performance of all but the small kernel model on the 9-length Path\ufb01nder dataset,\n\n6\n\n\fFigure 3: The hGRU ef\ufb01ciently learns long-range spatial dependencies that otherwise strain feedfor-\nward architectures. (a) Model accuracy is plotted for the three Path\ufb01nder challenge datasets, which\nfeatured paths of 6- 9- and 14-paddles. Each panel depicts the accuracy of a different model class\nafter training on for each path\ufb01nder dataset (see Supp. Material for additional models). Only the\nhGRU and state-of-the-art models for classi\ufb01cation (the two right-most panels) approached perfect\naccuracy on each dataset. (b) Measuring the area under the learning curve (ALC) of each model\n(mean accuracy) demonstrates that the rate of learning achieved by the hGRU across the Path\ufb01nder\nchallenge is only rivaled by the U-Net architecture (far right). (c) The hGRU is signi\ufb01cantly more\nparameter-ef\ufb01cient than feedforward models at the Path\ufb01nder challenge, with its closest competitors\nneeding at least 200\u00d7 the number of parameters to match its performance. The x-axis shows the\nnumber of parameters in each model versus the hGRU, as a multiple of the latter. The y-axis depicts\nmodel accuracy on the 14-length Path\ufb01nder dataset. (d) Path\ufb01nder challenge exemplars of different\npath lengths (all are positive examples).\n\nbut even then did little to improve performance on the 14-length dataset. Of the 5-layer models, only\nthe large kernel con\ufb01guration came close to solving the 14-length dataset. The ALC of this model,\nhowever, demonstrates that its rate of learning was slow, especially compared to the hGRU (Fig. 3b).\nThe failures of these feedforward models is all the more striking when considering that each had\nbetween 1\u00d7 and 10\u00d7 the number of parameters as the hGRU (Fig. 3c, compare the red and green\nmarkers).\n\nResidual networks We reasoned that if the performance of feedforward models on the Path\ufb01nder\nchallenge is a function of model depth, then state-of-the-art networks for object recognition with\nmany times the number of layers should easily solve the challenge. We tested this possibility by\n\n7\n\nPath Length ResNet-50ResNet-152ResNet-18hGRUFF, large kernelsFF, dilated kernelsFF, non-local kernelsFF, medium kernelsFF, small kernelshGRU (additive lesion)hGRU (No time-gain)GRUhGRU (multiplicative lesion)AccuracyhGRU (4 timesteps)hGRU (6 timesteps)(1-layer)(3-layer)(5-layer)FCNU-NetSegNet(b)Accuracy (Curve length = 14)Free parameter multiplierFF (5-layer)FCNResNet-152ResNet-50ResNet-18FF (3-layer)FF (1-layer)hGRUhGRU (multiplicative lesion)SegNet(c)U-Net6 9 14 Area under the Validation Curve(a)Path Length 6 9 14 (d)\ftraining ResNets with 18, 50, and 152 layers on the Path\ufb01nder challenge. Each model was trained\n\u201cfrom scratch\u201d with standard weight initialization [49], and given additional epochs of training (4)\nto learn the task because of their large number of parameters. However, even with this additional\ntraining, only the deepest 152-layer ResNet was able to solve the challenge (Fig. 3a). Even so, the\n152-layer ResNet was less ef\ufb01cient at learning the 14-length dataset than the hGRU (Fig. 3b), and\nachieved its performance with nearly 1000\u00d7 as many parameters (Fig. 3c; see Supp. Material for\nadditional ResNet experiments).\n\nPer-pixel prediction models We considered the possibility that CNN architectures for per-pixel\nprediction tasks, such as contour detection and segmentation, might be better suited to the Path\ufb01nder\nchallenge than those designed for classi\ufb01cation. We therefore tested three representative per-pixel\nprediction models: the fully-convolutional network (FCN), the skip-connection U-Net, and the\nunpooling SegNet. These models used an encoder/decoder style architecture, which was followed by\nthe readout processing stage of the standard architecture described above to make them suitable for\nPath\ufb01nder classi\ufb01cation. Encoders were the VGG16 [50], and each model was trained from scratch\nwith Xavier initialized weights. Like the ResNets, these models were given 4 epochs of training to\naccommodate their large number of parameters.\nThe fully-convolutional network (FCN) architecture is one of the \ufb01rst successful uses of CNNs for\nper-pixel prediction [3, 44, 51, 52]. Decoders in these models use \u201c1\u00d71\u201d convolutions to combine\nupsampled activity maps from several layers of the encoder. We created an FCN model which applied\nthis procedure to the last layer of each of the 5 VGG16-convolution blocks. These activity maps were\nupsampled by learnable kernels, which were initialized with weights for bilinear interpolation. In\ncontrast to the feedforward models discussed above, the FCN successfully learned all conditions in\nthe Path\ufb01nder challenge (Fig. 3a, purple circle). It did so less ef\ufb01ciently than the hGRU, however,\nwith a lower ALC score on the 14-length dataset and 200\u00d7 as many free parameters (Fig. 3b).\nAnother approach to per-pixel prediction uses \u201cskip connections\u201d to connect speci\ufb01c layers of a\nmodel\u2019s encoder to its decoder. This approach was \ufb01rst described in [44] as a method for more\neffectively merging coarse-layer information into a model\u2019s decoder, and later extended to the U-\nNet [53]. We implemented a version of the U-Net architecture that had a VGG16 encoder and a\ndecoder. The decoder consisted of 5 randomly initialized and learned upsampling layers, which had\nadditive connections to the \ufb01nal convolutional layer in each of the encoder\u2019s VGG16 blocks. Using\nstandard VGG16 nomenclature to de\ufb01ne one of these connections, this meant that \u201cconv 4_3\u201d activity\nfrom the encoder was added to the second upsampled activity map in the decoder. The U-Net was on\npar with the hGRU and the FCN at solving the Path\ufb01nder challenge. It was also nearly as ef\ufb01cient as\nthe hGRU in doing so (Fig. 3b), but used over 350\u00d7 as many parameters as the hGRU (Fig. 3c; see\nSupp. Materials for additional U-Net experiments).\nUnpooling models eliminate the need for feature map upsampling by routing decoded activities\nto the locations of the winning max-pooling units derived from the encoder. Unpooling is also a\nleading approach for a variety of dense per-pixel prediction tasks, including segmentation, which\nis exempli\ufb01ed by SegNet [54]. We tested a SegNet on the Path\ufb01nder challenge. This model has\na decoder that mirrors its encoder, with unpooling operations replacing its pooling. The SegNet\nachieved high accuracy on each of the Path\ufb01nder datasets, but was less ef\ufb01cient at learning them\nthan the hGRU, with worse ALC scores across the challenge (Fig. 3b). The SegNet also featured the\nsecond-most parameters of any model tested, which was 400\u00d7 more than the hGRU.\n\n4 Explaining biological horizontal connections with the hGRU\n\nStatistical image analysis studies have suggested that cortical patterns of horizontal connections,\ncommonly referred to as \u201cassociation \ufb01elds\u201d, may re\ufb02ect the geometric regularities of oriented\nelements present in natural scenes [55]. Because the hGRU is designed to capture such spatial\nregularities, we investigated whether it learned patterns of horizontal connections that resemble these\nassociation \ufb01elds. We visualized the horizontal kernels learned by the hGRU to solve tasks (Fig. S5).\nWhen trained on the the Path\ufb01nder challenge, hGRU kernels resembled the dominant patterns of\nhorizontal connectivity in visual cortex. Prominent among these patterns are (1) the antagonistic\nnear-excitatory vs. far-inhibitory surround organization also found in the visual cortex [56]; (2) the\nassociation \ufb01eld, with collinear excitation and orthogonal inhibition [8, 9]; and (3) other higher-order\nsurround computations [57]. We also visualized these patterns after training the hGRU to detect\n\n8\n\n\fcontours in the naturalistic BSDS500 image dataset [1]. These horizontal kernels took on similar\npatterns of connectivity, but with far more de\ufb01nition and regularity, suggesting that the hGRU learns\nbest from natural scene statistics.\nHow well does the hGRU explain human psychophysics data? We tested this by recreating the\nsynthetic contour detection dataset used in [22]. This task had human participants detect a contour\nformed by co-linearly aligned paddles in an array of randomly oriented distractors. Multiple versions\nof the task were created by varying the distance between paddles in the contour (5 conditions).\nContour detection accuracy of the hGRU was recorded on each of dataset for comparison with\nparticipants in [22], whose responses were digitally extracted with WebPlotDigitizer from [22]\nand averaged (N=2). Plotting hGRU accuracy against the reported \u201cdetection score\u201d revealed that\nincreasing inter-paddle distance caused similar performance straining for both (Fig. S6).\n\n5 Discussion\nThe present study demonstrates that long-range spatial dependencies generally strain CNNs, with\nonly very deep and state-of-the-art networks overcoming the visual variability introduced by long\npaths in the Path\ufb01nder challenge. Although feedforward networks are generally effective at learning\nand detecting relatively rigid objects shown in well-de\ufb01ned poses, these models tend towards a\nbrute-force solution when tasked with the recognition of less constrained structures, such as a path\nconnecting two distant locations. This study adds to a body of work highlighting examples of routine\nvisual tasks where CNNs fall short of human performance [58\u201362].\nWe demonstrate a solution to the Path\ufb01nder challenge inspired by neuroscience. The hGRU leverages\ncomputational principles of visual cortical circuits to learn complex spatial interactions between\nunits. For the Path\ufb01nder challenge, this translates into an ability to represent the elements forming an\nextended path while ignoring surrounding clutter. We \ufb01nd that the hGRU can reliably detect paths of\nany tested length or form using just a single layer. This contrasts sharply with the successful state-of-\nthe-art feedforward alternatives, which used much deeper architectures and orders of magnitude more\nparameters to achieve similar success. The key mechanisms underlying the hGRU\u2019s performance\nare well known in computational neuroscience [10, 16\u201320]. However, these mechanisms have been\ntypically overlooked in computer vision (but see [39] for a successful vision model using horizontal\nconnections and shown to break text-based CAPTCHAs).\nWe also found that hGRU performance on the Path\ufb01nder challenge is a function of the amount of\ntime it was given for processing. This \ufb01nding suggests that it concurrently expands the facilitative\nin\ufb02uence of one end of a target curve to the other while suppressing the in\ufb02uence of distractors. The\nperformance of the hGRU on the Path\ufb01nder challenge captures the iterative nature of computations\nused by our visual system during similar tasks [63] \u2013 exhibiting a comparable tradeoff between\nperformance and processing-time [7]. Visual cortex is replete with association \ufb01elds that are thought\nto underlie perceptual grouping [11, 64]. Theoretical models suggest that patterns of horizontal\nconnections re\ufb02ect the statistics of natural scenes, and here too we \ufb01nd that horizontal kernels in the\nhGRU learned from natural scenes resemble cortical patterns of horizontal connectivity, including\nassociation \ufb01elds and the paired near-excitatory / far-inhibitory surrounds that may be responsible\nfor many contextual illusions [20, 56]. The horizontal connections learned by the hGRU reproduce\nanother aspect of human behavior, in which the saliency of a straight contour decreases as the\ndistance between its paddles increases. This sheds light on a possible relationship between horizontal\nconnections and saliency computation.\nIn summary, this work diagnoses a computational de\ufb01ciency of feedforward networks, and intro-\nduces a biologically-inspired solution that can be easily incorporated into existing deep learning\narchitectures. The weights and patterns of behavior learned by the hGRU appear consistent with\nthose associated with the visual cortex, demonstrating its potential for establishing novel connections\nbetween machine learning, cognitive science, and neuroscience.\n\nAcknowledgments\n\nThis research was supported by NSF early career award [grant number IIS-1252951] and DARPA\nyoung faculty award [grant number YFA N66001-14-1-4037]. Additional support was provided by\nthe Carney Institute for Brain Science and the Center for Computation and Visualization (CCV) at\nBrown University.\n\n9\n\n\fReferences\n[1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image\nsegmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898\u2013916,\nMay 2011.\n\n[2] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-Supervised nets. In Proceedings of\nthe International Conference on Arti\ufb01cial Intelligence and Statistics, pages 562\u2013570, February\n2015.\n\n[3] S. Xie and Z. Tu. Holistically-Nested edge detection. International Journal of Computer Vision,\n\n125(1):3\u201318, December 2017.\n\n[4] Y. Liu, M. M. Cheng, X. Hu, K. Wang, and X. Bai. Richer convolutional features for edge\ndetection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 5872\u20135881, July 2017.\n\n[5] K.-K. Maninis, J. Pont-Tuset, P. Arbelaez, and L. Van Gool. Convolutional oriented boundaries:\nFrom image segmentation to High-Level tasks. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 40(4):819\u2013833, April 2018.\n\n[6] Y. Wang, X. Zhao, and K. Huang. Deep crisp boundaries. In Proceedings of the IEEE Conference\n\non Computer Vision and Pattern Recognition, pages 3892\u20133900, 2017.\n\n[7] R. Houtkamp and P. R. Roelfsema. Parallel and serial grouping of image elements in visual\nperception. Journal of Experimental Psychology: Human Perception and Performance, 36(6):\n1443\u20131459, December 2010.\n\n[8] D. D. Stettler, A. Das, J. Bennett, and C. D. Gilbert. Lateral connectivity and contextual\n\ninteractions in macaque primary visual cortex. Neuron, 36(4):739\u2013750, November 2002.\n\n[9] K. S. Rockland and J. S. Lund. Intrinsic laminar lattice connections in primate visual cortex.\n\nJournal of Comparative Neurology, 216(3):303\u2013318, May 1983.\n\n[10] S. Grossberg and E. Mingolla. Neural dynamics of perceptual grouping: textures, boundaries,\n\nand emergent segmentations. Perception & Psychophysics, 38(2):141\u2013171, August 1985.\n\n[11] D. J. Field, A. Hayes, and R. F. Hess. Contour integration by the human visual system: Evidence\n\nfor a local \u201cassociation \ufb01eld\u201d. Vision Research, 33(2):173\u2013193, 1993.\n\n[12] G. W. Lesher and E. Mingolla. The role of edges and line-ends in illusory contour formation.\n\nVision Research, 33(16):2253\u20132270, November 1993.\n\n[13] W. Li, V. Pi\u00ebch, and C. D. Gilbert. Contour saliency in primary visual cortex. Neuron, 50(6):\n\n951\u2013962, June 2006.\n\n[14] W. Li, V. Pi\u00ebch, and C. D. Gilbert. Learning to link visual contours. Neuron, 57(3):442\u2013451,\n\nFebruary 2008.\n\n[15] S. W. Zucker. Stereo, shading, and surfaces: Curvature constraints couple neural computations.\n\nProceedings of the IEEE, 102(5):812\u2013829, May 2014.\n\n[16] P. Series, J. Lorenceau, and Y. Fr\u00e9gnac. The \u201csilent\u201d surround of V1 receptive \ufb01elds: theory\n\nand experiments. Journal of Physiology-Paris, 97:453\u2013474, 2003.\n\n[17] L. Zhaoping. Neural circuit models for computations in early visual cortex. Current Opinion in\n\nNeurobiology, 21(5):808\u2013815, October 2011.\n\n[18] S. Shushruth, P. Mangapathy, J. M. Ichida, P. C. Bressloff, L. Schwabe, and A. Angelucci.\nStrong recurrent networks compute the orientation tuning of surround modulation in the primate\nprimary visual cortex. Journal of Neuroscience, 32(1):308\u2013321, 2012.\n\n[19] D. B. Rubin, S. D. Van Hooser, and K. D. Miller. The stabilized supralinear network: a unifying\ncircuit motif underlying multi-input integration in sensory cortex. Neuron, 85(2):402\u2013417,\nJanuary 2015.\n\n10\n\n\f[20] D. A. M\u00e9ly, D. Linsley, and T. Serre. Complementary surrounds explain diverse contextual\n\nphenomena across visual modalities. Psychological Review, 125(5):769\u2013784, October 2018.\n\n[21] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and\nY. Bengio. Learning phrase representations using RNN Encoder\u2013Decoder for statistical machine\ntranslation. In Proceedings of the Conference on Empirical Methods in Natural Language\nProcessing, pages 1724\u20131734, Stroudsburg, PA, USA, 2014.\n\n[22] W. Li and C. D. Gilbert. Global contour saliency and local colinear interactions. Journal of\n\nNeurophysiology, 88(5):2846\u20132856, November 2002.\n\n[23] S. Xie and Z. Tu. Holistically-Nested edge detection. International Journal of Computer Vision,\n\n125(1-3):3\u201318, December 2017.\n\n[24] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):\n\n1735\u20131780, November 1997.\n\n[25] A. Graves, S. Fern\u00e1ndez, and J. Schmidhuber. Multi-dimensional recurrent neural networks.\nIn Proceedings of the International Conference on Arti\ufb01cial Neural Networks, pages 549\u2013558,\nBerlin, Heidelberg, 2007.\n\n[26] A. Graves and J. Schmidhuber. Of\ufb02ine handwriting recognition with multidimensional recurrent\nneural networks. In Advances in Neural Information Processing Systems, pages 545\u2013552, 2009.\n\n[27] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in\ncontext with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 2874\u20132883, 2016.\n\n[28] L. Theis and M. Bethge. Generative image modeling using spatial LSTMs. In Advances in\n\nNeural Information Processing Systems, pages 1927\u20131935, 2015.\n\n[29] A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In\nProceedings of the International Conference on Machine Learning, vol. 48, pages 1747\u20131756,\nNew York, NY, USA, 2016.\n\n[30] M. Liang and X. Hu. Recurrent convolutional neural network for object recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3367\u20133375, 2015.\n\n[31] Q. Liao and T. Poggio. Bridging the gaps between residual learning, recurrent neural networks\n\nand visual cortex. arXiv preprint arXiv:1604.03640, April 2016.\n\n[32] J. Kim, J. K. Lee, and K. M. Lee. Deeply-Recursive convolutional network for image Super-\nResolution. In Proceeedings of the IEEE Conference on Computer Vision and Pattern Recogni-\ntion, pages 1637\u20131645, 2016.\n\n[33] C. J. Spoerer, P. McClure, and N. Kriegeskorte. Recurrent convolutional neural networks: A\nbetter model of biological object recognition. Frontiers in Psychology, 8:1551, September 2017.\n\n[34] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction\nIn Proceedings of the International Conference on Learning\n\nand unsupervised learning.\nRepresentations, 2017.\n\n[35] A. R. Zamir, T.-L. Wu, L. Sun, W. Shen, B. E. Shi, J. Malik, and S. Savarese. Feedback\n\nNetworks. arXiv preprint arxiv: 1612.09508, December 2016.\n\n[36] A. Nayebi, D. Bear, J. Kubilius, K. Kar, S. Ganguli, D. Sussillo, J. J. DiCarlo, and D. L. K.\nYamins. Task-Driven convolutional recurrent models of the visual system. arXiv preprint\narXiv:1807.00053, June 2018.\n\n[37] I. Kokkinos. Pushing the boundaries of boundary detection using deep learning. In Proceedings\n\nof the International Conference on Learning Representations, 2016.\n\n11\n\n\f[38] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic\nimage segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs.\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834\u2013848, April 2018.\n[39] D. George, W. Lehrach, K. Kansky, M. L\u00e1zaro-Gredilla, C. Laan, B. Marthi, X. Lou, Z. Meng,\nY. Liu, H. Wang, A. Lavin, and D. S. Phoenix. A generative vision model that trains with high\ndata ef\ufb01ciency and breaks text-based CAPTCHAs. Science, 358(6368), December 2017.\n\n[40] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S.\nTorr. Conditional random \ufb01elds as recurrent neural networks. In Proceedings of the IEEE\nInternational Conference on Computer Vision, pages 1529\u20131537, 2015.\n\n[41] T. Cooijmans, N. Ballas, C. Laurent, \u00c7. G\u00fcl\u00e7ehre, and A. Courville. Recurrent batch nor-\nIn Proceedings of the International Conference on Learning Representations,\n\nmalization.\n2017.\n\n[42] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using\nlocal brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 26(5):530\u2013549, May 2004.\n[43] K. He, X. Zhang, S. Ren, and J. Sun.\n\nIn\nProceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands,\npages 630\u2013645, 2016.\n\nIdentity mappings in deep residual networks.\n\n[44] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3431\u20133440, 2015.\n\n[45] F. Yu and V. Koltun. Multi-Scale context aggregation by dilated convolutions. In Proceedings\n\nof the International Conference on Learning Representations, 2016.\n\n[46] T. Wang, M. Sun, and K. Hu. Dilated deep residual network for image denoising. In 29th\nIEEE International Conference on Tools with Arti\ufb01cial Intelligence, Boston, MA, USA, pages\n1272\u20131279, 2017.\n\n[47] R. Hamaguchi, A. Fujita, K. Nemoto, T. Imaizumi, and S. Hikosaka. Effective use of di-\nlated convolutions for segmenting small object instances in remote sensing imagery. CoRR,\nabs/1709.00179, 2017.\n\n[48] X. Wang, R. Girshick, A. Gupta, and K. He. Non-Local neural networks. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, 2018.\n\n[49] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level\nperformance on imagenet classi\ufb01cation. In Proceedings of the IEEE International Conference\non Computer Vision, pages 1026\u20131034, Washington, DC, USA, 2015.\n\n[50] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. CoRR, abs/1409.1556, 2014.\n\n[51] G. Papandreou, I. Kokkinos, and P.-A. Savalle. Modeling local and global deformations in deep\nlearning: Epitomic convolution, multiple instance learning, and sliding window detection. In\nProceedings of the Conference on Computer Vision and Pattern Recognition, pages 390\u2013399,\nJune 2015.\n\n[52] K.-K. Maninis, J. Pont-Tuset, P. Arbelaez, and L. Van Gool. Convolutional oriented boundaries:\nFrom image segmentation to High-Level tasks. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, May 2017.\n\n[53] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical\nimage segmentation. In Medical Image Computing and Computer-Assisted Intervention, pages\n234\u2013241. Springer International Publishing, 2015.\n\n[54] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A deep convolutional Encoder-Decoder\narchitecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 39(12):2481\u20132495, December 2017.\n\n12\n\n\f[55] O. Ben-Shahar and S. Zucker. Geometrical computations explain projection patterns of long-\nrange horizontal connections in visual cortex. Neural Computation, 16(3):445\u2013476, March\n2004.\n\n[56] S. Shushruth, L. Nurminen, M. Bijanzadeh, J. M. Ichida, S. Vanni, and A. Angelucci. Different\norientation tuning of near- and far-surround suppression in macaque primary visual cortex\nmirrors their tuning in human perception. Journal of Neuroscience, 33(1):106\u2013119, January\n2013.\n\n[57] H. Tanaka and I. Ohzawa. Surround suppression of V1 neurons mediates orientation-based\nrepresentation of high-order visual features. Journal of Neurophysiology, 101(3):1444\u20131462,\nMarch 2009.\n\n[58] J. Kim, M. Ricci, and T. Serre. Not-So-CLEVR: learning same-different relations strains\n\nfeedforward neural networks. Interface Focus, 8(4):20180011, August 2018.\n\n[59] C. Szegedy et al. Intriguing properties of neural networks. In Proceedings of the International\n\nConference on Learning Representations, 2014.\n\n[60] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High con\ufb01dence\npredictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 427\u2013436, 2015.\n\n[61] A. Volokitin, G. Roig, and T. A. Poggio. Do deep neural networks suffer from crowding? In\n\nAdvances in Neural Information Processing Systems 30, pages 5628\u20135638, 2017.\n\n[62] K. Ellis, A. Solar-Lezama, and J. Tenenbaum. Unsupervised learning by program synthesis. In\n\nAdvances in Neural Information Processing Systems 28, pages 973\u2013981, 2015.\n\n[63] P. R. Roelfsema and R. Houtkamp. Incremental grouping of image elements in vision. Attention,\n\nPerception & Psychophysics, 73(8):2542\u20132572, November 2011.\n\n[64] C. D. Gilbert and W. Li. Top-down in\ufb02uences on visual processing. Nature Reviews Neuro-\n\nscience, 14(5):350\u2013363, May 2013.\n\n[65] C. Tallec and Y. Ollivier. Can recurrent neural networks warp time? arXiv preprint arxiv:\n\n1804.11188, March 2018.\n\n[66] J. W. Peirce. PsychoPy\u2014Psychophysics software in python. Journal of Neuroscience Methods,\n\n162(1):8\u201313, May 2007.\n\n[67] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 249\u2013256, Sardinia, Italy, 2010.\n\n[68] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arxiv:\n\n1412.6980, December 2014.\n\n[69] M. Abadi et al. TensorFlow: A system for large-scale machine learning. In Proceedings of\nthe USENIX Conference on Operating Systems Design and Implementation, pages 265\u2013283,\nBerkeley, CA, USA, 2016.\n\n[70] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arxiv: 1502.03167, February 2015.\n\n[71] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arxiv:\n\n1505.00387, May 2015.\n\n13\n\n\f", "award": [], "sourceid": 102, "authors": [{"given_name": "Drew", "family_name": "Linsley", "institution": "Brown University"}, {"given_name": "Junkyung", "family_name": "Kim", "institution": "Brown University"}, {"given_name": "Vijay", "family_name": "Veerabadran", "institution": "University of California, San Diego"}, {"given_name": "Charles", "family_name": "Windolf", "institution": "Brown University"}, {"given_name": "Thomas", "family_name": "Serre", "institution": "Brown University"}]}