{"title": "Task-Driven Convolutional Recurrent Models of the Visual System", "book": "Advances in Neural Information Processing Systems", "page_first": 5290, "page_last": 5301, "abstract": "Feed-forward convolutional neural networks (CNNs) are currently state-of-the-art for object classification tasks such as ImageNet. Further, they are quantitatively accurate models of temporally-averaged responses of neurons in the primate brain's visual system.  However, biological visual systems have two ubiquitous architectural features not shared with typical CNNs: local recurrence within cortical areas, and long-range feedback from downstream areas to upstream areas.  Here we explored the role of recurrence in improving classification performance. We found that standard forms of recurrence (vanilla RNNs and LSTMs) do not perform well within deep CNNs on the ImageNet task. In contrast, novel cells that incorporated two structural features, bypassing and gating, were able to boost task accuracy substantially. We extended these design principles in an automated search over thousands of model architectures, which identified novel local recurrent cells and long-range feedback connections useful for object recognition. Moreover, these task-optimized ConvRNNs matched the dynamics of neural activity in the primate visual system better than feedforward networks, suggesting a role for the brain's recurrent connections in performing difficult visual behaviors.", "full_text": "Task-Driven Convolutional Recurrent Models of the\n\nVisual System\n\nAran Nayebi1,*, Daniel Bear2,*, Jonas Kubilius5,7,*, Kohitij Kar5, Surya Ganguli4,8, David\n\nSussillo8, James J. DiCarlo5,6, and Daniel L. K. Yamins2,3,9\n\n1Neurosciences PhD Program, Stanford University, Stanford, CA 94305\n2Department of Psychology, Stanford University, Stanford, CA 94305\n\n3Department of Computer Science, Stanford University, Stanford, CA 94305\n4Department of Applied Physics, Stanford University, Stanford, CA 94305\n\n5McGovern Institute for Brain Research, MIT, Cambridge, MA 02139\n\n6Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 02139\n\n7Brain and Cognition, KU Leuven, Leuven, Belgium\n\n8Google Brain, Google, Inc., Mountain View, CA 94043\n9Wu Tsai Neurosciences Institute, Stanford, CA 94305\n\n*Equal contribution. {anayebi,dbear}@stanford.edu; qbilius@mit.edu\n\nAbstract\n\nFeed-forward convolutional neural networks (CNNs) are currently state-of-the-art\nfor object classi\ufb01cation tasks such as ImageNet. Further, they are quantitatively\naccurate models of temporally-averaged responses of neurons in the primate brain\u2019s\nvisual system. However, biological visual systems have two ubiquitous architectural\nfeatures not shared with typical CNNs: local recurrence within cortical areas, and\nlong-range feedback from downstream areas to upstream areas. Here we explored\nthe role of recurrence in improving classi\ufb01cation performance. We found that\nstandard forms of recurrence (vanilla RNNs and LSTMs) do not perform well\nwithin deep CNNs on the ImageNet task. In contrast, novel cells that incorporated\ntwo structural features, bypassing and gating, were able to boost task accuracy\nsubstantially. We extended these design principles in an automated search over\nthousands of model architectures, which identi\ufb01ed novel local recurrent cells and\nlong-range feedback connections useful for object recognition. Moreover, these\ntask-optimized ConvRNNs matched the dynamics of neural activity in the primate\nvisual system better than feedforward networks, suggesting a role for the brain\u2019s\nrecurrent connections in performing dif\ufb01cult visual behaviors.\n\n1\n\nIntroduction\n\nThe visual system of the brain must discover meaningful patterns in a complex physical world [James,\n1890]. Animals are more likely to survive and reproduce if they reliably notice food, signs of danger,\nand memorable social partners. However, the visual appearance of these objects varies widely in\nposition, pose, contrast, background, foreground, and many other factors from one occasion to the\nnext: it is not easy to identify them from low-level image properties [Pinto et al., 2008]. In primates,\nthe visual system solves this problem by transforming the original \u201cpixel values\u201d of an image into a\nnew internal representation, in which high-level properties are more explicit [DiCarlo et al., 2012].\nThis can be modeled as an algorithm that encodes each image as a set of nonlinear features. For an\nanimal, a good encoding algorithm is one that allows behaviorally-relevant information to be decoded\nsimply from these features, such as by linear classi\ufb01cation [Hung et al., 2005, Majaj et al., 2015].\nUnderstanding this encoding is key to explaining animals\u2019 visual behavior [Rajalingham et al., 2018].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTask-optimized, deep convolutional neural networks (CNNs) have emerged as quantitatively accurate\nmodels of encoding in primate visual cortex [Yamins et al., 2014, Khaligh-Razavi and Kriegeskorte,\n2014, G\u00fc\u00e7l\u00fc and van Gerven, 2015]. This is due to (1) their cortically-inspired architecture, a cascade\nof spatially-tiled linear and nonlinear operations; and (2) their being optimized to perform the same\nbehaviors that animals must perform to survive, such as object recognition [Yamins and DiCarlo,\n2016]. CNNs trained to recognize objects in the ImageNet dataset predict the time-averaged neural\nresponses of cortical neurons better than any other model class. Model units from lower, middle, and\nupper convolutional layers, respectively, provide the best-known linear predictions of time-averaged\nvisual responses in neurons of early (area V1 [Khaligh-Razavi and Kriegeskorte, 2014, Cadena et al.,\n2017]), intermediate (area V4 [Yamins et al., 2014]), and higher visual cortical areas (inferotemporal\ncortex, IT, [Khaligh-Razavi and Kriegeskorte, 2014, Yamins et al., 2014]) These results are especially\nstriking given that the parameters of task-optimized CNNs are entirely determined by the high-level\ncategorization goal and never directly \ufb01t to neural data.\nThe choice to train on a high-variation, challenging, real-world task, such as object categorization on\nthe ImageNet dataset, is crucial. So far, training CNNs on easier, lower-variation object recognition\ndatasets [Hong et al., 2016] or unsupervised tasks (e.g. image autoencoding [Khaligh-Razavi and\nKriegeskorte, 2014]) has not produced accurate models of per-image neural responses, especially in\nhigher visual cortex. This may indicate that capturing the wide variety of visual stimuli experienced\nby primates is critical for building accurate models of their visual systems.\nWhile these results are promising, complete models of the visual system must explain not only\ntime-averaged responses, but also the complex temporal dynamics of neural activity. Such dynamics,\neven when evoked by static stimuli, are likely functionally relevant [Gilbert and Wu, 2013, Kar\net al., 2018, Issa et al., 2018]. However, it is not obvious how to extend the architecture and task-\noptimization of CNNs to the case where responses change over time. Nontrivial dynamics result\nfrom biological features not present in strictly feedforward CNNs, including synapses that facilitate\nor depress, dense local recurrent connections within each cortical region, and long-range connections\nbetween different regions, such as feedback from higher to lower visual cortex [Gilbert and Wu,\n2013]. Furthermore, the behavioral roles of recurrence and dynamics in the visual system are not\nwell understood. Several conjectures are that recurrence \u201c\ufb01lls in\u201d missing data, [Spoerer et al., 2017,\nMichaelis et al., 2018, Rajaei et al., 2018, Linsley et al., 2018] such as object parts occluded by other\nobjects; that it \u201csharpens\u201d representations by top-down attentional feature re\ufb01nement, allowing for\neasier decoding of certain stimulus properties or performance of certain tasks [Gilbert and Wu, 2013,\nLindsay, 2015, McIntosh et al., 2017, Li et al., 2018, Kar et al., 2018]; that it allows the brain to\n\u201cpredict\u201d future stimuli (such as the frames of a movie) [Rao and Ballard, 1999, Lotter et al., 2017,\nIssa et al., 2018]; or that recurrence \u201cextends\u201d a feedforward computation, re\ufb02ecting the fact that an\nunrolled recurrent network is equivalent to a deeper feedforward network that conserves on neurons\n(and learnable parameters) by repeating transformations several times [Liao and Poggio, 2016, Zamir\net al., 2017, Leroux et al., 2018, Rajaei et al., 2018]. Formal computational models are needed to\ntest these hypotheses: if optimizing a model for a certain task leads to accurate predictions of neural\ndynamics, then that task may be a primary reason those dynamics occur in the brain.\nWe therefore attempted to extend the method of goal-driven modeling from solving tasks with\nfeedforward CNNs [Yamins and DiCarlo, 2016] or RNNs [Mante et al., 2013] to explain dynamics\nin the primate visual system, resulting in convolutional recurrent neural networks (ConvRNNs).\nIn particular, we hypothesized that adding recurrence and feedback to CNNs would help these\nmodels perform an ethologically-relevant task and that such an augmented network would predict the\n\ufb01ne-timescale trajectories of neural responses in the visual pathway.\nAlthough CNNs augmented with recurrence have been used to solve some simple occlusion and\nfuture-prediction tasks [Spoerer et al., 2017, Lotter et al., 2017], these models have not been shown\nto generalize to harder tasks already performed by feedforward CNNs \u2014 such as recognizing objects\nin the ImageNet dataset [Deng et al., 2009, Krizhevsky et al., 2012] \u2014 nor have they explained\nneural responses as well as ImageNet-optimized CNNs. For this reason, we focused here on building\nConvRNNs at a scale capable of performing the ImageNet object recognition task. Because of\nits natural variety, the ImageNet dataset contains many images with properties hypothesized to\nrecruit recurrent processing (e.g. heavy occlusion, the presence of multiple foreground objects, etc.).\nMoreover, some of the most effective recent solutions to ImageNet (e.g. the ResNet models [He\net al., 2016]) repeat the same structural motif across many layers, which suggests that they might\nperform computations similar to unrolling shallower recurrent networks over time [Liao and Poggio,\n\n2\n\n\fFeedforward Convolutions\n\nImage\n\nConvRNN Cells\n\nObject\nCategory\n\nLong-Range Feedback\n\nFigure 1: Schematic of model architecture. Convolutional recurrent networks (ConvRNNs) have a\ncombination of local recurrent cells and long-range feedback connections added on top of a CNN\nbackbone. In our implementation, propagation along each black or red arrow takes one time step (10\nms) to mimic conduction delays between cortical layers.\n\n2016]. Although other work has used the output of CNNs as input to RNNs to solve visual tasks\nsuch as object segmentation [McIntosh et al., 2017] or action recognition [Shi et al., 2018], here we\nintegrated and optimized recurrent structures within the CNN itself.\nWe found that standard recurrent motifs (e.g. vanilla RNNs, LSTMs [Elman, 1990, Hochreiter and\nSchmidhuber, 1997]) do not increase ImageNet performance above parameter-matched feedforward\nbaselines. However, we designed novel local cell architectures, embodying structural properties\nuseful for integration into CNNs, that do. To identify even better structures within the large space\nof model architectures, we then performed an automated search over thousands of models that\nvaried in their locally recurrent and long-range feedback connections. Strikingly, this procedure\ndiscovered new patterns of recurrence not present in conventional RNNs. For instance, the most\nsuccessful models exclusively used depth-separable convolutions for their local recurrent connections.\nIn addition, a small subset of long-range feedback connections boosted task performance even as\nthe majority had a negative effect. Overall this search resulted in recurrent models that matched the\nperformance of a much deeper feedforward architecture (ResNet-34) while using only \u21e0 75% as\nmany parameters. Finally, comparing recurrent model features to neural responses in the primate\nvisual system, we found that ImageNet-optimized ConvRNNs provide a quantitatively accurate model\nof neural dynamics at a 10 millisecond resolution across intermediate and higher visual cortex. These\nresults offer a model of how recurrent motifs might be adapted for performing object recognition.\n\n2 Methods\n\nComputational Architectures. To explore the architectural space of ConvRNNs and compare\nthese models with the primate visual system, we used the Tensor\ufb02ow library [Abadi et al., 2016] to\naugment standard CNNs with both local and long-range recurrence (Figure 1). Conduction from one\narea to another in visual cortex takes approximately 10 milliseconds (ms) [Mizuseki et al., 2009],\nwith signal from photoreceptors reaching IT cortex at the top of the ventral stream by 70-100 ms.\nNeural dynamics indicating potential recurrent connections take place over the course of 100-200\nms [Issa et al., 2018]. A single feedforward volley of responses thus cannot be treated as if it were\ninstantaneous relative to the timescale of recurrence and feedback. Hence, rather than treating each\nentire feedforward pass from input to output as one integral time step, as is normally done with\nRNNs [Spoerer et al., 2017], each time step in our models corresponds to a single feedforward layer\nprocessing its input and passing it to the next layer. This choice required an unrolling scheme different\nfrom that used in the standard Tensor\ufb02ow RNN library, the code for which (and for all of our models)\ncan be found at https://github.com/neuroailab/tnn.\nWithin each ConvRNN layer, feedback inputs from higher layers are resized to match the spatial\ndimensions of the feedforward input to that layer. Both types of input are processed by standard\n2-D convolutions. If there is any local recurrence at that layer, the output is next passed to the\nrecurrent cell as input. Feedforward and feedback inputs are combined within the recurrent cell (see\nSupplemental Methods). Finally, the output of the cell is passed through any additional nonlinearities,\n\n3\n\n\ft = A`(H `\n\nt ), where R`\n\nt\u2318 + \u2327`H `\n\nt\u2318 , H `\n\nt\u2318 , and R`\n\nt+1 = C`\u21e3F`\u21e3Lj6=` Rj\n\nsuch as max-pooling. The generic update rule for the discrete-time trajectory of such a network\nis thus H `\nt is the output of layer `\nat time t and H `\nt is the hidden state of the locally recurrent cell C` at time t. A` is the activation\nfunction and any pooling post-memory operations, and  denotes concatenation along the channel\ndimension with appropriate resizing to align spatial dimensions. The learned parameters of such\na network consist of F`, comprising any feedforward and feedback connections coming into layer\n` = 1, . . . , L, and any of the learned parameters associated with the local recurrent cell C`.\nThe simplest form of recurrence that we consider is the \u201cTime Decay\u201d model, which has a discrete-\ntime trajectory given by H `\nt , where \u2327` is the learned time constant at a\ngiven layer `. This model is intended to be a control for simplicity, where the time constants could\nmodel synaptic facilitation and depression in a cortical layer.\nIn this work, all forms of recurrence add parameters to the feedforward base model. Because this\ncould improve task performance for reasons unrelated to recurrent computation, we trained two types\nof control model to compare to ConvRNNs: (1) Feedforward models with more convolution \ufb01lters\n(\u201cwider\u201d) or more layers (\u201cdeeper\u201d) to approximately match the number of parameters in a recurrent\nmodel; and (2) Replicas of each ConvRNN model unrolled for a minimal number of time steps,\nde\ufb01ned as the number that allows all model parameters to be used at least once. A minimally unrolled\nmodel has exactly the same number of parameters as its fully unrolled counterpart, so any increase in\nperformance from unrolling longer can be attributed to recurrent computation. Fully and minimally\nunrolled ConvRNNs were trained with identical learning hyperparameters.\n\nt+1 = F`\u21e3Lj6=` Rj\n\n3 Results\n\nNovel RNN Cell Structures Improve Task Performance. We \ufb01rst tested whether augmenting\nCNNs with standard RNN cells, vanilla RNNs and LSTMs, could improve performance on ImageNet\nobject recognition (Figure 2a). We found that these cells added a small amount of accuracy when\nintroduced into an AlexNet-like 6-layer feedforward backbone (Figure 2b). However, there were two\nproblems with these recurrent architectures: \ufb01rst, these ConvRNNs did not perform substantially\nbetter than parameter-matched, minimally unrolled controls, suggesting that the observed performance\ngain was due to an increase in the number of unique parameters rather than recurrent computation.\nSecond, making the feedforward model wider or deeper yielded an even larger performance gain than\nadding these standard RNN cells, but with fewer parameters. Adding other recurrent cell structures\nfrom the literature, including the UGRNN and IntersectionRNN [Collins et al., 2017], had similar\neffects to adding LSTMs (data not shown). This suggested that standard RNN cell structures, although\nwell-suited for a range of temporal tasks, are less well-suited for inclusion within deep CNNs.\nWe speculated that this was because standard cells lack a combination of two key properties: (1)\nGating, in which the value of a hidden state determines how much of the bottom-up input is passed\nthrough, retained, or discarded at the next time step; and (2) Bypassing, where a zero-initialized\nhidden state allows feedforward input to pass on to the next layer unaltered, as in the identity shortcuts\nof ResNet-class architectures (Figure 2a, top left). Importantly, both of these features are thought\nto address the problem of gradients vanishing as they are backpropagated to early time steps of a\nrecurrent network or to early layers of a deep feedforward network, respectively. LSTMs employ\ngating, but no bypassing, as their inputs must pass through several nonlinearities to reach their output;\nwhereas vanilla RNNs do bypass a zero-initialized hidden state, but do not gate their input (Figure 2a).\nWe thus implemented recurrent cell structures with both features to determine whether they function\nbetter than standard cells within CNNs. One example of such a structure is the \u201cReciprocal Gated\nCell\u201d, which bypasses its zero-initialized hidden state and incorporates LSTM-like gating (Figure 2a,\nbottom right, see Supplemental Methods for the cell equations). Adding this cell to the 6-layer\nCNN improved performance substantially relative to both the feedforward baseline and minimally\nunrolled, parameter-matched control version of this model. Moreover, the Reciprocal Gated Cell\nused substantially fewer parameters than the standard cells to achieve greater accuracy (Figure 2b).\nHowever, this advantage over LSTMs could have resulted accidentally from a better choice of training\nhyperparameters for the Reciprocal Gated Cells. It has been shown that different RNN structures can\nsucceed or fail to perform a given task because of differences in trainability rather than differences in\n\n4\n\n\fA\n\nResNet Block\n\nVanilla RNN Cell\n\nLSTM Cell\n\nReciprocal Gated Cell\n\nh\n\nc\n\nh\n\nc\n\nB\n\ny\nc\na\nr\nu\nc\nc\nA\n \n1\np\no\nT\n\nReciprocal Gated HP Opt (T = 16)\n\nFF Deeper\n\nReciprocal Gated (T = 16)\n\nFF Wider\n\nLSTM HP Opt (T=16)\n\nVanilla RNN (T = 16)\n\nLSTM (T = 16)\nLSTM (T = 7)\n\nVanilla RNN (T = 7)\n\nReciprocal Gated (T = 7)\n\nFF\n\nNumber of Parameters x106\n\nFigure 2: Comparison of local recurrent cell architectures. (a) Architectural differences be-\ntween ConvRNN cells. Standard ResNet blocks and vanilla RNN cells have bypassing (see text).\nThe LSTM cell has gating, denoted by \u201cT\u201d-junctions, but no bypassing. The Reciprocal Gated\nCell has both. (b) Performance of various ConvRNN and feedforward models as a function of\nnumber of parameters. Colored points incorporate the respective RNN cell into the the 6-layer\nfeedforward architecture (\u201cFF\u201d). \u201cT\u201d denotes number of unroll steps. Hyperparameter-optimized\nversions of the LSTM and Reciprocal Gated Cell ConvRNNs are connected to their non-optimized\nversions by black lines.\n\ncapacity [Collins et al., 2017]. To test whether the LSTM models underperformed for this reason, we\nsearched over training hyperparameters and common structural variants of the LSTM to better adapt\nthis local structure to deep convolutional networks, using hundreds of second generation Google Cloud\nTensor Processing Units (TPUv2s). We searched over learning hyperparameters (e.g. gradient clip\nvalues, learning rate) as well as structural hyperparameters (e.g. gate convolution \ufb01lter sizes, channel\ndepth, whether or not to use peephole connections, etc.). While this did yield a better LSTM-based\nConvRNN, it did not eclipse the performance of the smaller Reciprocal Gated Cell-based ConvRNN.\nMoreover, applying the same hyperparameter optimization procedure to the Reciprocal Gated Cell\nmodels equally increased that architecture class\u2019s performance and further reduced its parameter\ncount (Figure 2b). Taken together, these results suggest that the choice of local recurrent structure\nstrongly affects the accuracy of a ConvRNN on a challenging object recognition task. Although\nprevious work has shown that standard RNN structures can improve performance on simpler object\nrecognition tasks, these gains have not been shown to generalize to more challenging datasets such as\nImageNet [Liao and Poggio, 2016, Zamir et al., 2017]. Differences between architectures may only\nappear on dif\ufb01cult tasks.\nHyperparameter Optimization for Deeper Recurrent Architectures. The success of Reciprocal\nGated Cells on ImageNet prompted us to search for even better-adapted recurrent structures in the\ncontext of deeper feedforward backbones. In particular, we designed a novel search space over\nConvRNN architectures that generalized that cell\u2019s use of gating and bypassing. To measure the\nquality of a large number of architectures, we installed RNN cell variants in a convolutional network\nbased on the ResNet-18 model [He et al., 2016]. Deeper ResNet models, such as ResNet-34, reach\nhigher performance on ImageNet and therefore provided a benchmark for how much recurrent\ncomputation could substitute for feedforward depth in this model class. Recurrent architectures varied\nin (a) how multiple inputs and hidden states were combined to produce the output of each cell; (b)\nwhich linear operations were applied to each input or hidden state; (c) which point-wise nonlinearities\nwere applied to each hidden state; and (d) the values of additive and multiplicative biases (Figure 3a).\nThe original Reciprocal Gated Cells fall within this larger hyperparameter space. We simultaneously\nsearched over all possible long-range feedback connections, which use the output of a later layer as\ninput to an earlier ConvRNN layer or cell hidden state. Finally, we varied learning hyperparameters\nof the network, such as learning rate and number of time steps to present the input image.\nSince we did not want to assume how hyperparameters would interact, we jointly optimized architec-\ntural and learning hyperparameters using a Tree-structured Parzen Estimator (TPE) [Bergstra et al.,\n2011], a type of Bayesian algorithm which searches over continuous and categorical variable con-\n\ufb01gurations (see Supplemental Methods for details). We trained hundreds of models simultaneously,\n\n5\n\n\fFigure 3: ConvRNN hyperparametrization and search results. (a) Hyperparametrization of\nlocal recurrent cells. Arrows indicate connections between cell input, hidden states, and output.\nQuestion marks denote optional connections, which may be conventional or depth-separable convolu-\ntions with a choice of kernel size. Feedforward connections between layers (`  1 out, ` in, and `\nout) are always present. Boxes with question marks indicate a choice of multiplicative gating with\nsigmoid or tanh nonlinearities, addition, or identity connections (as in ResNets). Finally, long-range\nfeedback connections from ` + k out may enter at either the local cell input, hidden state, or output.\n(b) ConvRNN search results. Each blue dot represents a model, sampled from hyperparameter\nspace, trained for 5 epochs. The orange line is the average performance of the last 50 models. The\nred line denotes the top performing model at that point in the search.\n\neach for 5 epochs. To reduce the amount of search time, we trained these sample models on 128 px\nimages, which yield nearly equal performance as larger images [Mishkin et al., 2016].\nMost models in this architecture space failed.\nIn the initial TPE phase of randomly sampling\nhyperparameters, more than 80% of models either failed to improve task performance above chance\nor had exploding gradients (Figure 3b, bottom left). However, over the course of the search, an\nincreasing fraction of models reached higher performance. The median and peak sample performance\nincreased over the next \u21e0 7000 model samples, with the best model reaching performance >15%\nhigher than the top model from the initial phase of the search (Figure 3b). Like the contrast between\nLSTMs and Reciprocal Gated Cells, these dramatic improvements further support our hypothesis that\nsome recurrent architectures are substantially better than others for object recognition at scale.\nSampling thousands of recurrent architectures allowed us to ask which structural features contribute\nmost to task performance and which are relatively unimportant. Because TPE does not sample\nmodels according to a stationary parameter distribution, it is dif\ufb01cult to infer the causal roles of\nindividual hyperparameters. Still, several clear patterns emerged in the sequence of best-performing\narchitectures over the course of the search (Figure 3b, red line). These models used both bypassing\nand gating within their local RNN cells; depth-separable convolutions for updating hidden states,\nbut conventional convolutions for connecting bottom-up input to output (Figure 4a); and several key\nlong-range feedbacks, which were \u201cdiscovered\u201d early in the search, then subsequently used in the\nmajority of models (Figure 4b). Overall, these recurrent structures act like feedforward ResNet blocks\non their \ufb01rst time step but incorporate a wide array of operations on future time steps, including many\nof the same ones selected for in other work that optimized feedforward modules [Zoph et al., 2017].\nThus, this diversity of operations and pathways from input to output may re\ufb02ect a natural solution for\nobject recognition, whether implemented through recurrent or feedforward structures.\nIf the primate visual system uses recurrence in lieu of greater network depth to perform object\nrecognition, then a shallower recurrent model with a suitable form of recurrence should achieve\nrecognition accuracy equal to a deeper feedforward model [Liao and Poggio, 2016]. We therefore\ntested whether our search had identi\ufb01ed such well-adapted recurrent architectures by fully training a\nrepresentative ConvRNN, the model with the median \ufb01ve-epoch performance after 5000 TPE samples.\nThis median model reached a \ufb01nal Top1 ImageNet accuracy nearly equal to a ResNet-34 model\nwith nearly twice as many layers, even though the ConvRNN used only \u21e0 75% as many parameters\n(ResNet-34, 21.8 M parameters, 73.1% Validation Top1; Median ConvRNN, 15.5 M parameters,\n72.9% Validation Top1). The fully unrolled model from the random phase of the search did not\n\n6\n\n\fA\n\n+k out\n\n...\n\n1x1 \n\nds4x4\n\nB\n\n= In Best Cells\n= Add\n= Sigmoid Mult\n= Tanh Mult\n\nId.\n\nds2x2\n\n out\n\n1x1 \n\nin\n\n3x3 \n\n-1 out\n\nds2x2\n\nds4x4\n\nds2x2\n\nhidden\n\nds2x2\n\n10 to 8\n9 to 5\n7 to 6\n8 to 6\n10 to 5\n9 to 8\n6 to 5\n10 to 7\n8 to 5\n9 to 7\n9 to 6\n0\n\nk\nc\na\nb\nd\ne\ne\nF\nh\nt\ni\n\n \n\nw\n \ns\nl\ne\nd\no\nM\n\n \n.\n\np\no\nr\nP\n\n1\n0\n\n.03\n\n0.0\n\n-.03\n\nC\n\n0.75\n\ny\nc\na\nr\nu\nc\nc\nA\n\n \nt\ne\nN\ne\ng\na\nm\n\nI\n \n1\np\no\nT\n\n0.70\n\n0.65\n\n2000\nSearch Model Number\n\n4000\n\n6000\n\nTop1 Gain from Feedback (Best to Worst)\n\n0.60\n\n14.8\n\n14.8\n\n12.8\n\n12.8\n\n16.6\n\n2\n1\n \n=\nT\n\n \n\n7\n1\n \n=\nT\n\n \n\n7\n1\n \n=\nT\n\n \n\n \n \ny\na\nc\ne\nD\n \ne\nm\nT\n\ni\n\n0\n1\n \n=\nT\n\n \n\n \n \nt\ne\nn\ne\ns\na\nB\n\nFF\n\n \n\n \n\n \n\n \n \nl\ne\nd\no\nM\nn\na\ni\nd\ne\nM\n\n \nl\n \ne\nl\ne\nd\nd\no\no\nM\nM\nm\nn\no\na\ni\nd\nd\nn\ne\na\nM\nR\nConvRNNs\n\n15.5\n\n21.8\n\n \n\nB\nF\n+\n \n7\n1\n \n=\nT\n\n \n\n \n \nl\ne\nd\no\nM\nn\na\ni\nd\ne\nM\n\n \n\n4\n3\n-\nt\ne\nN\ns\ne\nR\nFF\n\nFigure 4: Optimal local recurrent cell motif and global feedback connectivity. (a) RNN Cell\nstructure from the top-performing search model. Red lines indicate that this hyperparameter\nchoice (connection and \ufb01lter size) was chosen in each of the top unique models from the search (red\nline in Figure 3b). K \u21e5 K denotes a convolution and dsK \u21e5 K denotes a depth-separable convolution\nwith \ufb01lter size K \u21e5 K. (b) Long-range feedback connections from the search. (Top) Each trace\nshows the proportion of models in a 100-sample window that have a particular feedback connection.\n(Bottom) Each bar indicates the difference between the median performance of models with a given\nfeedback and the median performance of models without that feedback. Colors correspond to the\nsame feedbacks as above. (c) Performance of models fully trained on 224px-size ImageNet. We\ncompared the performance of an 18-layer feedforward base model (basenet) modeled after ResNet-18,\na model with trainable time constants (\u201cTime Decay\u201d) based on this basenet, the median model from\nthe search with or without global feedback connectivity, and its minimally-unrolled control (T = 12).\nThe \u201cRandom Model\u201d was selected randomly from the initial, random phase of the model search.\nParameter counts in millions are shown on top of each bar. ResNet models were trained as in [He\net al., 2016] but with the same batch size of 64 to compare to ConvRNNs.\n\nperform substantially better than the basenet, despite using more parameters \u2013 though the minimally\nunrolled control did (Figure 4c). We also considered a control model (\u201cTime Decay\u201d), described in\nSection 2, that produces temporal dynamics by learning time constants on the activations at each\nlayer, rather than by learning connectivity between units. However, this model did not perform any\nbetter than the basenet, implying that ConvRNN performance is not a trivial result of outputting\na dynamic time course of responses. Together these results demonstrate that the median model\nuses recurrent computations to perform object recognition and that particular motifs in its recurrent\narchitecture, found through hyperparameter optimization, are required for its improved accuracy.\nThus, given well-selected forms of local recurrence and long-range feedback, a ConvRNN can do the\nsame challenging visual task as a deeper feedforward CNN.\nConvRNNs Capture Neuronal Dynamics in the Primate Visual System. ConvRNNs produce a\ndynamic time series of outputs given a static input. While these recurrent dynamics could be used\nfor tasks involving time, here we optimized the ConvRNNs to perform the \u201cstatic\u201d task of object\nclassi\ufb01cation on ImageNet. It is possible that the primate visual system is optimized for such a task,\nbecause even static images reliably produce dynamic neural response trajectories [Issa et al., 2018].\nObjects in some images become decodable from the neuronal population later than objects in other\nimages, even though animals recognize both object sets equally well. Interestingly, late-decoding\nimages are not well classi\ufb01ed by feedforward CNNs, raising the possibility that they are encoded\nin animals through recurrent computations [Kar et al., 2018]. If this were the case, then recurrent\nnetworks trained to perform a dif\ufb01cult, static object recognition task might explain neural dynamics\nin the primate visual system, just as feedforward models explain time-averaged responses [Yamins\net al., 2014, Khaligh-Razavi and Kriegeskorte, 2014]. We therefore asked whether a task-optimized\nConvRNN could predict the \ufb01ring dynamics of primate neurons recorded during encoding of visual\nstimuli. These data comprise multi-unit array recordings from the ventral stream of rhesus macaques\nduring Rapid Serial Visual Presentation (RSVP, see Supplemental Methods [Majaj et al., 2015]).\nWe presented the fully trained median ConvRNN and its feedforward baseline with the same images\nshown to monkeys and collected the time series of features from each model layer. To decide which\nlayer should be used to predict which neural responses, we \ufb01t linear models from each feedforward\nlayer\u2019s features to the neural population and measured where explained variance on held-out images\npeaked (see Supplemental Methods). Units recorded from distinct arrays \u2013 placed in the successive\n\n7\n\n\fFigure5:Modelingpri-mateventralstreamneu-raldynamicswithCon-vRNNs.(a)TheConvRNNmodelusedto\ufb01tneuraldy-namicshaslocalrecurrentcellsatlayers4through10andthelong-rangefeed-backsdenotedbytheredar-rows.(b)Consistentwiththebrain\u2019sventralhierarchy,mostunitsfromV4arebest\ufb01tbylayer6features;pITbylayer7;andcIT/aITbylayers8/9.(c)Fittingmodelfeaturestoneuraldynamicsapproachesthenoiseceil-ingoftheseresponses.They-axisindicatesthemedianacrossunitsofthecorrela-tionbetweenpredictionsandground-truthresponsesonheld-outimages.V4pITcIT/aIT0.30.645678910\u2264Prop. Neurons0.80.4Preferred Model Layer0.20.4V4pITcIT/aIT12345678910AInternal Cons.ConvRNNFeedforwardFit to Neural DataTime From Image Onset (ms)BPerformance-Optimized Recurrence0.10.30.50.10.30.50.10.350100150200250501001502002505010015020025045678910\u226445678910\u2264L-1 outL outL inL hiddenTime DecayCV4,posteriorIT,andcentral/anteriorITcorticalareasofthemacaque\u2013were\ufb01tbestbythesuccessive6th,7th,and8/9thlayersofthefeedforwardmodel,respectively(Figure5a,b).Thisresultre\ufb01nesthemappingofCNNlayerstode\ufb01nedanatomicalareaswithintheventralvisualpathway[Yaminsetal.,2014].Finally,wemeasuredhowwelltheConvRNNfeaturesfromtheselayerspredictedthedynamicsofeachunit.Critically,weusedthesamelinearmappingacrossalltimessothatthetask-optimizeddynamicsoftherecurrentnetworkhadtoaccountfortheneuraldynamics;eachrecordedunittrajectorywas\ufb01tasalinearcombinationofmodelunittrajectories.Onheld-outimages,theConvRNNfeaturedynamics\ufb01ttheneuralresponsetrajectoriesaswellasorbetterthanthefeedforwardbaselinefeaturesforalmostevery10mstimebin(Figure5c).Moreover,becausethefeedforwardmodelhasthesamesquarewavedynamicsasits100msvisualinput,itcannotpredictneuralresponsesafterimageoffsetplusa\ufb01xeddelay,correspondingtothenumberoflayers(Figure5c,dashedblacklines).Incontrast,theactivationsofConvRNNunitshavepersistentdynamics,yieldingstrongpredictionsoftheentireneuralresponse.ThisimprovementisespeciallyclearwhencomparingthepredictedtemporalresponsesofthefeedforwardandConvRNNmodelsonsingleimages(FigureS1a):ConvRNNpredictionshavedynamictrajectoriesthatresemblesmoothedversionsofthegroundtruthresponses,whereasfeedforwardpredictionsnecessarilyhavethesametimecourseforeveryimage.Crucially,thisresultsfromthenonlineardynamicslearnedonImageNet,asbothmodelsare\ufb01ttoneuraldatawiththesameformoftemporally-sharedlinearmodelandthesamenumberofparameters.Acrossheld-outimages,theConvRNNoutperformsthefeedforwardmodelinitspredictionofsingle-imagetemporalresponses(FigureS1b).Sincetheinitialphaseofneuraldynamicswaswell-\ufb01tbyfeedforwardmodels,weaskedwhetherthelaterphasecouldbe\ufb01tbyamuchsimplermodelthantheConvRNN,namelytheTimeDecaymodelwithImageNet-trainedtimeconstantsatconvolutionallayers.Inthismodel,unitactivationshaveexponentialratherthanimmediatefalloffoncefeedforwarddriveceases.Thesedynamicscouldarisefromsingle-neuronbiophysics(e.g.synapticdepression)ratherthaninterneuronalconnections.IftheTimeDecaymodelweretoexplainneuraldataaswellasConvRNNs,itwouldimplythatinterneuronalrecurrentconnectionsarenotneededtoaccountfortheobserveddynamics;however,thismodeldidnot\ufb01tthelatephasedynamicsofintermediateareas(V4andpIT)ortheearlyphaseofcIT/aITresponsesaswellastheConvRNN(Figure5c).Thus,themorecomplexrecurrencefoundinConvRNNsisneededbothtoimproveobjectrecognitionperformanceandtoaccountforneuraldynamicsintheventralstream,evenwhenanimalsareonlyrequiredto\ufb01xateonvisualstimuli.8\f4 Discussion\n\nHere, we show that careful introduction of recurrence into CNNs can improve performance on a\nchallenging visual task. Even hyperparameter optimization of traditional recurrent motifs yielded\nImageNet performance inferior to hand-designed motifs. Our \ufb01ndings raise the possibility that\ndifferent locally recurrent structures within the brain may be adapted for different behaviors. In this\ncase, the combination of locally recurrent gating, bypassing, and depth-separability was required to\nimprove ImageNet performance. Although we did not explicitly design these structures to resemble\ncortical microcircuits, the presence of reciprocal, cell type-speci\ufb01c connections between discrete\nneuronal populations is consistent with known anatomy [Costa et al., 2017]. With the proper local\nrecurrence in place, speci\ufb01c patterns of long-range feedback connections also increased performance.\nThis may indicate that long-range connections in visual cortex are important for object recognition.\nConvRNNs optimized for the ImageNet task strongly predicted neural dynamics in macaque V4 and\nIT. Speci\ufb01cally, the temporal trajectories of neuronal \ufb01ring rates were well-\ufb01t by linear regression\nfrom task-optimized ConvRNN features. Thus we extend a core result from the feedforward to the\nrecurrent regime: given an appropriate architecture class, task-driven convolutional recurrent neural\nnetworks provide the best normative models of encoding dynamics in the primate visual system.\nThe results of \ufb01tting ConvRNN features to neural dynamics are somewhat surprising in two respects.\nFirst, ConvRNN predictions are signi\ufb01cantly stronger than feedforward CNN predictions only at\ntime points when the feedforward CNNs have no visual response (Figure 5c). This implies that early\ndynamics in V4 and IT are, on average across stimuli, dominated by feedforward processing. However,\nmany neural responses to single images do not have the square-wave trajectory output by a CNN\n(Figure S1a); for this reason, single-image ConvRNN trajectories are much more highly correlated\nwith ground truth trajectories across time (Figure S1). Although some \ufb01ring rate dynamics may result\nfrom neuronal biophysics (e.g. synaptic depression and facilitation) rather than recurrent connections\nbetween neurons, models with task-optimized neuronal time constants at each convolutional layer did\nnot predict neural dynamics as well as ConvRNNs (Figure 5c). This simple form of \u201crecurrence\u201d also\nfailed to improve ImageNet performance. Thus, non-trivial recurrent connectivity is likely needed\nto explain the visually-evoked responses observed here. Since model predictions do not reach the\ninternal consistency of the responses, further elaborations of network architecture or the addition of\nnew tasks may yield further improvements in \ufb01tting. Alternatively, the internal consistency computed\nhere may incorporate task-irrelevant neural activity, such as from animal-speci\ufb01c anatomy or circuits\nnot functionally engaged during RSVP. These models will therefore need to be tested on neural\ndynamics of animals performing active visual tasks.\nSecond, it has often been suggested that recurrence is important for processing images with particular\nsemantic properties, such as clutter or occlusion. Here, in contrast, we have learned recurrent struc-\ntures that improve performance on the same general categorization task performed by feedforward\nnetworks, without a priori mandating a \u201crole\u201d for recurrence that is qualitatively distinct from that of\nany other model component. A specialized role for recurrence might be identi\ufb01ed by analyzing the\nlearned motifs ex post facto. However, recent electrophysiology experiments suggest this might not\nbe the most effective approach. Kar et al. [2018] have identi\ufb01ed a set of \u201cchallenge images\u201d on which\nfeedforward CNNs fail to solve object recognition tasks, but which are easily solved by humans and\nnon-human primates. Moreover, objects in challenge images are decodable from IT cortex population\nresponses signi\ufb01cantly later in the neuronal timecourse than control images, suggesting that recurrent\ncircuits are meaningfully engaged when processing the challenge images. Interestingly, however,\nthe challenge images appear not to have any clear de\ufb01ning properties or language-level description,\nsuch as containing occlusion, clutter, or unusual object poses. This observation is consistent with\nour results, in which the non-speci\ufb01c \u201crole\u201d of recurrence in ConvRNNs is simply to help recognize\nobjects across the wide variety of conditions present in natural scenes. Since biological neural circuits\nhave been honed by natural selection on animal behavior, not on human intelligibility, the role of an\nindividual circuit component may in general have no simple description.\nOptimizing recurrent augmentations for more sophisticated feedforward CNNs, such as NAS-\nNet [Zoph et al., 2017], could lead to improvements to these state-of-the-art baselines. Further\nexperiments will explore whether different tasks, such as those that do not rely on dif\ufb01cult-to-obtain\ncategory labels, can substitute for supervised object recognition in matching ConvRNNs to neural\nresponses.\n\n9\n\n\f5 Acknowledgments\n\nThis project has received funding from the James S. McDonnell foundation (grant #220020469,\nD.L.K.Y.), the Simons Foundation (SCGB grants 543061 (D.L.K.Y) and 325500/542965 (J.J.D)), the\nUS National Science Foundation (grant IIS-RI1703161, D.L.K.Y.), the European Union\u2019s Horizon\n2020 research and innovation programme (agreement #705498, J.K.), the US National Eye Institute\n(grant R01-EY014970, J.J.D.), and the US Of\ufb01ce of Naval Research (MURI-114407, J.J.D). We thank\nGoogle (TPUv2 team) and the NVIDIA corporation for generous donation of hardware resources.\n\nReferences\nM. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: A system for large-scale machine learning. In OSDI, volume 16, pages\n265\u2013283, 2016.\n\nJ. Bergstra, R. Bardenet, Y. Bengio, and B. K\u00e9gl. Algorithms for hyper-parameter optimization. In\n\nNIPS, 2011.\n\nJ. Bergstra, B. Komer, C. Eliasmith, D. Yamins, and D. D. Cox. Hyperopt: a python library for model\n\nselection and hyperparameter optimization. Computational Science & Discovery, 8(1), 2015.\n\nS. A. Cadena, G. H. Den\ufb01eld, E. Y. Walker, L. A. Gatys, A. S. Tolias, M. Bethge, and A. S. Ecker.\nDeep convolutional models improve predictions of macaque v1 responses to natural images.\nbioRxiv, page 201764, 2017.\n\nJ. Collins, J. Sohl-Dickstein, and D. Sussillo. Capacity and trainability in recurrent neural networks.\n\nIn ICLR, 2017.\n\nR. Costa, I. A. Assael, B. Shillingoford, N. de Freitas, and T. Vogels. Cortical microcircuits as\n\ngated-recurrent neural networks. In NIPS, 2017.\n\nJ. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical\nimage database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference\non, pages 248\u2013255. IEEE, 2009.\n\nJ. J. DiCarlo, D. Zoccolan, and N. C. Rust. How does the brain solve visual object recognition?\n\nNeuron, 73(3):415\u201334, 2012.\n\nJ. L. Elman. Finding structure in time. Cognitive Science, 14(2):179\u2013211, 1990.\n\nC. D. Gilbert and L. Wu. Top-down in\ufb02uences on visual processing. Nat. Rev. Neurosci., 14(5):\n\n350\u2013363, 2013.\n\nU. G\u00fc\u00e7l\u00fc and M. A. van Gerven. Deep neural networks reveal a gradient in the complexity of neural\nrepresentations across the ventral stream. The Journal of Neuroscience, 35(27):10005\u201310014,\n2015.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n\ndoi: https://doi.org/10.1109/CVPR.2016.90.\n\nS. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780,\n\n1997.\n\nH. Hong, D. L. Yamins, N. J. Majaj, and J. J. DiCarlo. Explicit information for category-orthogonal\n\nobject properties increases along the ventral stream. Nat. Neurosci., 19(4):613\u2013622, Apr 2016.\n\nC. P. Hung, G. Kreiman, T. Poggio, and J. J. DiCarlo. Fast readout of object identity from macaque\n\ninferior temporal cortex. Science, 310(5749):863\u20136, 2005.\n\nE. B. Issa, C. F. Cadieu, and J. J. DiCarlo. Neural dynamics at successive stages of the ventral visual\nstream are consistent with hierarchical error signals. bioRxiv preprint, 2018. doi: 10.1101/092551.\n\nW. James. The principles of psychology (vol. 1). New York: Holt, 474, 1890.\n\n10\n\n\fK. Kar, J. Kubilius, K. M. Schmidt, E. B. Issa, and J. J. DiCarlo. Evidence that recurrent circuits are\ncritical to the ventral stream\u2019s execution of core object recognition behavior. bioRxiv, 2018. doi:\n10.1101/354753. URL https://www.biorxiv.org/content/early/2018/06/26/354753.\nS.-M. Khaligh-Razavi and N. Kriegeskorte. Deep supervised, but not unsupervised, models may\n\nexplain it cortical representation. PLoS computational biology, 10(11):e1003915, 2014.\n\nA. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\nS. Leroux, P. Molchanov, P. Simoens, B. Dhoedt, T. Breuel, and J. Kautz. Iamnn: iterative and\nadaptive mobile neural network for ef\ufb01cient image classi\ufb01cation. In ICLR Workshop 2018, 2018.\n\nX. Li, Z. Jie, J. Feng, C. Liu, and S. Yan. Learning with rethinking: recurrently improving convolu-\n\ntional neural networks through feedback. Pattern Recognition, 79:183\u2013194, 2018.\n\nQ. Liao and T. Poggio. Bridging the gaps between residual learning, recurrent neural networks and\n\nvisual cortex. arXiv preprint arXiv:1604.03640, 2016.\n\nG. W. Lindsay.\n\nFeature-based attention in convolutional neural networks.\n\narXiv:1511.06408, 2015.\n\narXiv preprint\n\nD. Linsley, J. Kim, V. Veerabadran, and T. Serre. Learning long-range spatial dependencies with\n\nhorizontal gated-recurrent units. arXiv preprint arXiv:1805.08315, 2018.\n\nW. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and\n\nunsupervised learning. In ICLR, 2017.\n\nN. J. Majaj, H. Hong, E. A. Solomon, and J. J. DiCarlo. Simple learned weighted sums of inferior\ntemporal neuronal \ufb01ring rates accurately predict human core object recognition performance.\nJournal of Neuroscience, 35(39):13402\u201313418, 2015.\n\nV. Mante, D. Sussillo, K. V. Shenoy, and W. T. Newsome. Context-dependent computation by\n\nrecurrent dynamics in prefrontal cortex. Nature, 503:78\u201384, 2013.\n\nL. McIntosh, N. Maheswaranathan, D. Sussillo, and J. Shlens. Recurrent segmentation for variable\n\ncomputational budgets. arXiv preprint arXiv:1711.10151, 2017.\n\nC. Michaelis, M. Bethge, and A. S. Ecker. One-shot segmentation in clutter. arXiv preprint\n\narXiv:1803.09597, 2018.\n\nD. Mishkin, N. Sergievskiy, and J. Matas. Systematic evaluation of cnn advances on the imagenet.\n\narXiv preprint arXiv:1606.02228, 2016.\n\nK. Mizuseki, A. Sirota, E. Pastalkova, and G. Buzs\u00e1ki. Theta oscillations provide temporal windows\nfor local circuit computation in the entorhinal-hippocampal loop. Neuron, pages 267\u2013280, 2009.\n\nN. Pinto, D. D. Cox, and J. J. Dicarlo. Why is real-world visual object recognition hard? PLoS\n\nComputational Biology, 2008.\n\nK. Rajaei, Y. Mohsenzadeh, R. Ebrahimpour, and S.-M. Khaligh-Razavi. Beyond core object\nrecognition: Recurrent processes account for object recognition under occlusion. bioRxiv, 2018. doi:\n10.1101/302034. URL https://www.biorxiv.org/content/early/2018/04/17/302034.\nR. Rajalingham, E. B. Issa, P. Bashivan, K. Kar, K. Schmidt, and J. J. DiCarlo. Large-scale, high-\nresolution comparison of the core visual object recognition behavior of humans, monkeys, and\nstate-of-the-art deep arti\ufb01cial neural networks. bioRxiv, 2018. doi: 10.1101/240614. URL\nhttps://www.biorxiv.org/content/early/2018/01/01/240614.\n\nR. P. Rao and D. H. Ballard. Predictive coding in the visual cortex: a functional interpretation of\n\nsome extra-classical receptive-\ufb01eld effects. Nat. Neurosci., 2(1):79\u201387, Jan 1999.\n\nJ. Shi, H. Wen, Y. Zhang, K. Han, and Z. Liu. Deep recurrent neural network reveals a hierarchy of\n\nprocess memory during dynamic natural vision. Hum Brain Mapp., 39:2269\u20132282, 2018.\n\n11\n\n\fC. J. Spoerer, P. McClure, and N. Kriegeskorte. Recurrent convolutional neural networks: a better\n\nmodel of biological object recognition. Front. Psychol., 8:1\u201314, 2017.\n\nD. L. Yamins and J. J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex.\n\nNature neuroscience, 19(3):356, 2016.\n\nD. L. K. Yamins, H. Hong, C. F. Cadieu, E. A. Solomon, D. Seibert, and J. J. DiCarlo. Performance-\noptimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the\nNational Academy of Sciences, 111(23):8619\u20138624, 2014. doi: 10.1073/pnas.1403112111.\n\nA. R. Zamir, T.-L. Wu, L. Sun, W. B. Shen, B. E. Shi, J. Malik, and S. Savarese. Feedback networks.\n\nIn CVPR, 2017. doi: 10.1109/CVPR.2017.196.\n\nB. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable\n\nimage recognition. arXiv preprint arXiv:1707.07012, 2017.\n\n12\n\n\f", "award": [], "sourceid": 2528, "authors": [{"given_name": "Aran", "family_name": "Nayebi", "institution": "Stanford University"}, {"given_name": "Daniel", "family_name": "Bear", "institution": "Stanford University"}, {"given_name": "Jonas", "family_name": "Kubilius", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Kohitij", "family_name": "Kar", "institution": "MIT"}, {"given_name": "Surya", "family_name": "Ganguli", "institution": "Stanford"}, {"given_name": "David", "family_name": "Sussillo", "institution": "Google Inc."}, {"given_name": "James", "family_name": "DiCarlo", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Daniel", "family_name": "Yamins", "institution": "Stanford University"}]}