{"title": "Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs", "book": "Advances in Neural Information Processing Systems", "page_first": 12805, "page_last": 12816, "abstract": "Deep convolutional artificial neural networks (ANNs) are the leading class of candidate models of the mechanisms of visual processing in the primate ventral stream. While initially inspired by brain anatomy, over the past years, these ANNs have evolved from a simple eight-layer architecture in AlexNet to extremely deep and branching architectures, demonstrating increasingly better object categorization performance, yet bringing into question how brain-like they still are. In particular, typical deep models from the machine learning community are often hard to map onto the brain's anatomy due to their vast number of layers and missing biologically-important connections, such as recurrence. Here we demonstrate that better anatomical alignment to the brain and high performance on machine learning as well as neuroscience measures do not have to be in contradiction. We developed CORnet-S, a shallow ANN with four anatomically mapped areas and recurrent connectivity, guided by Brain-Score, a new large-scale composite of neural and behavioral benchmarks for quantifying the functional fidelity of models of the primate ventral visual stream. Despite being significantly shallower than most models, CORnet-S is the top model on Brain-Score and outperforms similarly compact models on ImageNet. Moreover, our extensive analyses of CORnet-S circuitry variants reveal that recurrence is the main predictive factor of both Brain-Score and ImageNet top-1 performance. Finally, we report that the temporal evolution of the CORnet-S \"IT\" neural population resembles the actual monkey IT population dynamics. Taken together, these results establish CORnet-S, a compact, recurrent ANN, as the current best model of the primate ventral visual stream.", "full_text": "Brain-Like Object Recognition with\n\nHigh-Performing Shallow Recurrent ANNs\n\nJonas Kubilius*,1,2, Martin Schrimpf *,1,3,4,\n\nKohitij Kar1,3,4, Rishi Rajalingham1, Ha Hong5, Najib J. Majaj6, Elias B. Issa7, Pouya\nBashivan1,3, Jonathan Prescott-Roy1, Kailyn Schmidt1, Aran Nayebi8, Daniel Bear9,\n\nDaniel L. K. Yamins9,10, and James J. DiCarlo1,3,4\n\n1McGovern Institute for Brain Research, MIT, Cambridge, MA 02139\n\n2Brain and Cognition, KU Leuven, Leuven, Belgium\n\n3Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 02139\n\n4Center for Brains, Minds and Machines, MIT, Cambridge, MA 02139\n\n5Bay Labs Inc., San Francisco, CA 94102\n\n6Center for Neural Science, New York University, New York, NY 10003\n\n7Department of Neuroscience, Zuckerman Mind Brain Behavior Institute, Columbia University, New\n\nYork, NY 10027\n\n8Neurosciences PhD Program, Stanford University, Stanford, CA 94305\n9Department of Psychology, Stanford University, Stanford, CA 94305\n\n10Department of Computer Science, Stanford University, Stanford, CA 94305\n\n*Equal contribution\n\nAbstract\n\nDeep convolutional arti\ufb01cial neural networks (ANNs) are the leading class of\ncandidate models of the mechanisms of visual processing in the primate ventral\nstream. While initially inspired by brain anatomy, over the past years, these ANNs\nhave evolved from a simple eight-layer architecture in AlexNet to extremely deep\nand branching architectures, demonstrating increasingly better object categorization\nperformance, yet bringing into question how brain-like they still are. In particular,\ntypical deep models from the machine learning community are often hard to\nmap onto the brain\u2019s anatomy due to their vast number of layers and missing\nbiologically-important connections, such as recurrence. Here we demonstrate that\nbetter anatomical alignment to the brain and high performance on machine learning\nas well as neuroscience measures do not have to be in contradiction. We developed\nCORnet-S, a shallow ANN with four anatomically mapped areas and recurrent\nconnectivity, guided by Brain-Score, a new large-scale composite of neural and\nbehavioral benchmarks for quantifying the functional \ufb01delity of models of the\nprimate ventral visual stream. Despite being signi\ufb01cantly shallower than most\nmodels, CORnet-S is the top model on Brain-Score and outperforms similarly\ncompact models on ImageNet. Moreover, our extensive analyses of CORnet-S\ncircuitry variants reveal that recurrence is the main predictive factor of both Brain-\nScore and ImageNet top-1 performance. Finally, we report that the temporal\nevolution of the CORnet-S \"IT\" neural population resembles the actual monkey IT\npopulation dynamics. Taken together, these results establish CORnet-S, a compact,\nrecurrent ANN, as the current best model of the primate ventral visual stream.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Synergizing machine learning and neuroscience through Brain-Score (top). By quan-\ntifying brain-likeness of models, we can compare models of the brain and use insights gained to\ninform the next generation of models. Green dots represent popular deep neural networks while gray\ndots correspond to various exemplary small-scale models (BaseNets) that demonstrate the relationship\nbetween ImageNet performance and Brain-Score on a wider range of performances (see Section 4.1).\nCORnet-S is the current best model on Brain-Score. CORnet-S area circuitry (bottom left). The\nmodel consists of four areas which are pre-mapped to cortical areas V1, V2, V4, and IT in the ventral\nstream. V1COR is feed-forward and acts as a pre-processor to reduce the input complexity. V2COR,\nV4COR and ITCOR are recurrent (within area). See Section 2.1 for details.\n\n1\n\nIntroduction\n\nNotorious for their superior performance in object recognition tasks, arti\ufb01cial neural networks (ANNs)\nhave also witnessed a tremendous success in the neuroscience community as currently the best class of\nmodels of the neural mechanisms of visual processing. Surprisingly, after training deep feedforward\nANNs to perform the standard ImageNet categorization task [6], intermediate layers in ANNs can\npartly account for how neurons in intermediate layers of the primate visual system will respond to\nany given image, even ones that the model has never seen before [46, 48, 17, 9, 4, 47]. Moreover,\nthese networks also partly predict human and non-human primate object recognition performance\nand object similarity judgments [33, 21]. Having strong models of the brain opened up unexpected\npossibilities of noninvasive brain-machine interfaces where models are used to generate stimuli,\noptimized to elicit desired responses in primate visual system [2].\nHow can we push these models to capture brain processing even more stringently? Continued\narchitectural optimization on ImageNet alone no longer seems like a viable option. Indeed, more\nrecent and deeper ANNs have not been shown to further improve on measures of brain-likeness\n[33], even though their ImageNet performance has vastly increased [34]. Moreover, while the initial\nlimited number of layers could easily be assigned to the different areas of the ventral stream, the link\nbetween the handful of ventral stream areas and several hundred layers in ResNet [10] or complex,\nbranching structures in Inception and NASNet [41, 27] is not obvious. Finally, high-performing\nmodels for object recognition remain feedforward, whereas recent studies established an important\nfunctional involvement of recurrent processes in object recognition [42, 16].\nWe propose that aligning ANNs to neuroanatomy might lead to more compact, interpretable and, most\nimportantly, functionally brain-like ANNs. To test this, we here demonstrate that a neuroanatomically\nmore aligned ANN, CORnet-S, exhibits an improved match to measurements from the ventral stream\nwhile maintaining high performance on ImageNet. CORnet-S commits to a shallow recurrent\nanatomical structure of the ventral visual stream, and thus achieves a much more compact architecture\nwhile retaining a strong ImageNet top-1 performance of 73.1% and setting the new state-of-the-art in\npredicting neural \ufb01ring rates and image-by-image human behavior on Brain-Score, a novel large-scale\nbenchmark composed of neural recordings and behavioral measurements. We identify that these\n\n2\n\n\u0004\u0010\u0013) /\u048a\u0014\u0001\u0002- \u001c\u0001\u0004$-\u001e0$/-4$)+0/*0/+0/\u001e*)1./\u001c/ \u001e*)1\u0001\u049d\u0001\u001e*)/-\u001c\u001e/\u0001\u04433\u001e*)1\u0001\u049d\u0001 3+\u001c)\u001f\u0001\u04433\u001e*)1\u0001\u049d\u0001./-$\u001f \u0001\u0441\u001e*)1\u0001\u049d\u0001./-$\u001f \u0001\u0441\"\u001c/ )\u001e*\u001f -\u001f \u001e*\u001f -\u0017\u0440\u0017\u0441\u0017\u0443\n\u0015$)+0/\u047c\u047c\u047c\u047c\u001d0$'\u001f$)\"\u0001\u001d-\u001c$)\u048a\u001c'$\") \u001f\u0001(*\u001f '.\u001e*(+\u001c-$)\"\u0001(*\u001f '.\u0001/*\u0001\u001d-\u001c$)0.2.4.6.8ImageNet top-1 performance.25.3.35.4.45Brain-Scorer = .90\u0004\u0010\u0013) /\u048a\u0014\u0002' 3\u000f /\u0003\u001c. \u000f /..7.75.8.85ImageNet top-1 performance.37.38.39.4.41.42Brain-Scorer: n.s.\u0013 .\u000f /\u048a\u0440\u043f\u0440\u00011\u0441\u0005 ). \u000f /\u048a\u0440\u0445\u0448\u0011\u000f\u0002\u0014\u000f /\n)\u001e +/$*)\u00011\u0443\u0017\b\b\u048a\u0440\u0448\u0019\u001e +/$*)\u000e*\u001d$' \u000f /.\u000e*\u001d$' \u000f /.\u0014,0 5 \u000f /.\fresults are primarily driven by recurrent connections, in line with our understanding of how the\nprimate visual system processes visual information [42, 16]. In fact, comparing the high level (\"IT\")\nneural representations between recurrent steps in the model and time-varying primate IT recordings,\nwe \ufb01nd that CORnet-S partly captures these neural response trajectories - the \ufb01rst model to do so on\nthis neural benchmark.\n\n2 CORnet-S: Brain-driven model architecture\n\nWe developed CORnet-S based on the following criteria (based on [20]):\n(1) Predictivity, so that it is a mechanistic model of the brain. We are not only interested in having\ncorrect model outputs (behaviors) but also internals that match the brain\u2019s anatomical and functional\nconstraints. We prefer ANNs because neurons are the units of online information transmission and\nmodels without neurons cannot be obviously mapped to neural spiking data [47].\n(2) Compactness, i.e. among models with similar scores, we prefer simpler models as they are\npotentially easier to understand and more ef\ufb01cient to experiment with. However, there are many ways\nto de\ufb01ne this simplicity. Motivated by the observation that the feedforward path from retinal input\nto IT is fairly limited in length (e.g., [43]), for the purposes of this study we use depth as a simple\nproxy to meeting the biological constraint in arti\ufb01cial neural networks. Here we de\ufb01ned depth as the\nnumber of convolutional and fully connected layers in the longest feedforward path of a model.\n(3) Recurrence: while core object recognition was originally believed to be largely feedforward\nbecause of its fast time scale [7], it has long been suspected that recurrent connections must be\nrelevant for some aspects of object perception [22, 1, 44], and recent studies have shown their role\neven at short time scales [16, 42, 35, 32, 5]. Moreover, responses in the visual system have a temporal\npro\ufb01le, so models at least should be able to produce responses over time too.\n\n2.1 CORnet-S model speci\ufb01cs\n\nCORnet-S (Fig. 1) aims to rival the best models on Brain-Score by transforming very deep feedforward\narchitectures into a shallow recurrent model. Speci\ufb01cally, CORnet-S draws inspiration from ResNets\nthat are some of the best models on our behavioral benchmark (Fig. 1; [33]) and can be thought of as\nunrolled recurrent networks [25]. Recent studies further demonstrated that weight sharing in ResNets\nwas indeed possible without a signi\ufb01cant loss in CIFAR and ImageNet performance [15, 23].\nMoreover, CORnet-S speci\ufb01cally commits to an anatomical mapping to brain areas. While for\ncomparison models we establish this mapping by searching for the layer in the model that best\nexplains responses in a given brain area, ideally such mapping would already be provided by the\nmodel, leaving no free parameters. Thus, CORnet-S has four computational areas, conceptualized as\nanalogous to the ventral visual areas V1, V2, V4, and IT, and a linear category decoder that maps\nfrom the population of neurons in the model\u2019s last visual area to its behavioral choices. This simplistic\nassumption of clearly separate regions with repeated circuitry was a \ufb01rst step for us to aim at building\nas shallow a model as possible, and we are excited about exploring less constrained mappings (such\nas just treating everything as a neuron without the distinction into regions) and more diverse circuitry\n(that might in turn improve model scores) in the future.\nEach visual area implements a particular neural circuitry with neurons performing simple canonical\ncomputations: convolution, addition, nonlinearity, response normalization or pooling over a receptive\n\ufb01eld. The circuitry is identical in each of its visual areas (except for V1COR), but we vary the total\nnumber of neurons in each area. Due to high computational demands, \ufb01rst area V1COR performs a\n7 \u00d7 7 convolution with stride 2, 3 \u00d7 3 max pooling with stride 2, and a 3 \u00d7 3 convolution. Areas\nV2COR, V4COR and ITCOR perform two 1 \u00d7 1 convolutions, a bottleneck-style 3 \u00d7 3 convolution\nwith stride 2, expanding the number of features fourfold, and a 1 \u00d7 1 convolution. To implement\nrecurrence, outputs of an area are passed through that area several times. For instance, after V2COR\nprocessed the input once, that result is passed into V2COR again and treated as a new input (while\nthe original input is discarded, see \"gate\" in Fig. 1). V2COR and ITCOR are repeated twice, V4COR is\nrepeated four times as this results in the most minimal con\ufb01guration that produced the best model\nas determined by our scores (see Fig. 4). As in ResNet, each convolution (except the \ufb01rst 1 \u00d7 1) is\nfollowed by batch normalization [14] and ReLU nonlinearity. Batch normalization was not shared\nover time as suggested by Jastrzebski et al. [15]. There are no across-area bypass or across-area\n\n3\n\n\ffeedback connections in the current de\ufb01nition of CORnet-S and retinal and LGN processing are not\nexplicitly modeled.\nThe decoder part of a model implements a simple linear classi\ufb01er \u2013 a set of weighted linear sums\nwith one sum for each object category. To reduce the amount of neural responses projecting to this\nclassi\ufb01er, we \ufb01rst average responses over the entire receptive \ufb01eld per feature map.\n\n2.2\n\nImplementation Details\n\nWe used PyTorch 0.4.1 and trained the model using ImageNet 2012 [34]. Images were preprocessed\n(1) for training \u2013 random crop to 224 \u00d7 224 pixels and random \ufb02ipping left and right; (2) for validation\n- central crop to 224 \u00d7 224 pixels; (3) for Brain-Score \u2013 resizing to 224 \u00d7 224 pixels. In all cases, this\npreprocessing was followed by normalization by mean subtraction and division by standard deviation\nof the dataset. We used a batch size of 256 images and trained on 2 GPUs (NVIDIA Titan X / GeForce\n1080Ti) for 43 epochs. We use similar learning rate scheduling to ResNet with more variable learning\nrate updates (primarily in order to train faster): 0.1, divided by 10 every 20 epochs. For optimization,\nwe use Stochastic Gradient Descent with momentum .9, a cross-entropy loss between image labels\nand model predictions (logits).\nImageNet-pretrained CORnet-S is available at github.com/dicarlolab/cornet.\n\n2.3 Comparison to other models\n\nLiang & Hu [24] introduced a deep recurrent neural network intended for object recognition by adding\na variant of a simple recurrent cell to a shallow \ufb01ve-layer convolutional neural network backbone.\nZamir et al. [49] built a more powerful version by employing LSTM cells, and a similar approach was\nused by [38] who showed that a simple version of a recurrent net can improve network performance\non an MNIST-based task. Liao & Poggio [25] argued that ResNets can be thought of as recurrent\nneural networks unrolled over time with non-shared weights, and demonstrated the \ufb01rst working\nversion of a folded ResNet, also explored by [15].\nHowever, all of these networks were only tested on CIFAR-100 at best. As noted by Nayebi et al. [29],\nwhile many networks may do well on a simpler task, they may differentiate once the task becomes\nsuf\ufb01ciently dif\ufb01cult. Moreover, our preliminary testing indicated that non-ImageNet-trained models\ndo not appear to score high on Brain-Score, so even for practical purposes we needed models that\ncould be trained on ImageNet. Leroux et al. [23] proposed probably the \ufb01rst recurrent architecture\nthat performed well on ImageNet. In an attempt to explore the recurrent net space in a more principled\nway, Nayebi et al. [29] performed a large-scale search in the LSTM-based recurrent cell space by\nallowing the search to \ufb01nd the optimal combination of local and long-range recurrent connections.\nThe best model demonstrated a strong ImageNet performance while being shallower than feedforward\ncontrols. In this work, we wanted to go one step further and build a maximally compact model that\nwould nonetheless yield top Brain-Score and outperform other recurrent networks on ImageNet.\n\n3 Brain-Score: Comparing models to brain\n\nTo obtain quanti\ufb01ed scores for brain-likeness, we built Brain-Score, a composite benchmark that\nmeasures how well models can predict (a) mean neural response of each neural recording site to each\nand every tested naturalistic image in non-human primate visual areas V4 and IT (data from [28]); (b)\nmean pooled human choices when reporting a target object to each tested naturalistic image (data\nfrom [33]), and (c) when object category is resolved in non-human primate area IT (data from [16]).\nTo rank models on an overall score, we take the mean of the behavioral score, the V4 neural score,\nthe IT neural score, and the neural dynamics score (explained below).\nBrain-Score is open-sourced as a platform to score neural networks on brain data through the\nBrain-Score.org website for an overview of scores and through github.com/brain-score.\n\n3.1 Neural predictivity\n\nNeural predictivity is used to evaluate how well responses to given images in a source system (e.g., a\ndeep ANN) predict the responses in a target system (e.g., a single neuron\u2019s response in visual area IT;\n\n4\n\n\f[48]). As inputs, this metric requires two assemblies of the form stimuli \u00d7 neuroid where neuroids\ncan either be neural recordings or model activations.\nA total of 2,560 images containing a single object pasted randomly on a natural background were\npresented centrally to passively \ufb01xated monkeys for 100 ms and neural responses were obtained from\n88 V4 sites and 168 IT sites. For our analyses, we used normalized time-averaged neural responses\nin the 70-170 ms window. For models, we reported the most predictive layer or (for CORnet-S)\ndesignated model areas and the best time point.\nSource neuroids were mapped to each target neuroid linearly using a PLS regression model with 25\ncomponents. The mapping procedure was performed for each neuron using 90% of image responses\nand tested on the remaining 10% in a 10-fold cross-validation strategy with strati\ufb01cation over objects.\nIn each run, the weights were \ufb01t to map from source neuroids to a target neuroid using training\nimages, and then using these weights predicted responses were obtained for the held-out images. To\nspeed up this procedure, we \ufb01rst reduced input dimensionality to 1000 components using PCA. We\nused the neuroids from V4 and IT separately to compute these \ufb01ts. The median over neurons of the\nPearson\u2019s r between the predicted and actual response constituted the \ufb01nal neural \ufb01t score for each\nvisual area.\n\n3.2 Behavioral predictivity\n\nThe purpose of behavioral benchmarks it to compute the similarity between source (e.g., an ANN\nmodel) and target (e.g., human or monkey) behavioral responses in any given task [33]. For core\nobject recognition tasks, primates (both human and monkey) exhibit behavioral patterns that differ\nfrom ground truth labels. Thus, our primary benchmark here is a behavioral response pattern metric,\nnot an overall accuracy metric, and higher scores are obtained by ANNs that produce and predict the\nprimate patterns of successes and failures. One consequence of this is that ANNs that achieve 100%\naccuracy will not achieve a perfect behavioral similarity score.\nA total of 2,400 images containing a single object pasted randomly on a natural background were\npresented to 1,472 humans for 100 ms and they were asked to choose from two options which object\nthey saw. For further analyses, we used participants response accuracies of 240 images that had\naround 60 responses per object-distractor pair (~300,000 unique responses). For evaluating models,\nwe used model responses to 2,400 images from the layer just prior to 1,000-value category vectors.\n2,160 of those images were used to build a 24-way logistic regression decoder, where each 24-value\nvector entry is the probability that a given object is in the image. This regression was then used to\nestimate probabilities for the 240 held-out images.\nNext, both for human model responses, for each image, all normalized object-distractor pair prob-\np(truth)+p(choice). These\nabilities were computed from the 24-way probability vector as follows:\nprobabilities were converted into a d(cid:48) measure: d(cid:48) = Z(Hit Rate) \u2212 Z(False Alarms Rate), where Z\nis the estimated z-score of responses, Hit Rate is the accuracy of a given object-distractor pair, and\nthe False Alarms Rate corresponds to how often the observers incorrectly reported seeing that target\nobject in images where another object was presented. For instance, if a given image contained a dog\nand distractor was a bear, the Hit Rate for the dog-bear pair for that image came straight from the\n240 \u00d7 24 matrix, while in order to obtain the False Alarms Rate, all cells from that matrix that did\nnot have dogs in the image but had a dog as a distractor were averaged, and 1 minus that value was\nused as a False Alarm Rate. All d(cid:48) above 5 were clipped. This transformation helped to remove bias\nin responses and also to diminish ceiling effects (since many primate accuracies were close to 1), but\nempirically observed bene\ufb01ts of d(cid:48) in this dataset were small; see [33] for a thorough explanation.\nThe resulting response matrix was further re\ufb01ned by subtracting the mean d(cid:48) across trials of the same\nobject-distractor pair (e.g., for dog-bear trials, their mean was subtracted from each trial). Such\nnormalization exposes variance unique to each image and removes global trends that may be easier\nfor models to capture. The behavioral predictivity score was computed as a Pearson\u2019s r correlation\nbetween the actual primate behavioral choices and model\u2019s predictions.\n\np(truth)\n\n3.3 Object solution times\n\nA total of 1318 grayscale images, containing images from Section 3.1 and MS COCO [26], were\npresented centrally to behaving monkeys for 100 ms and neural responses were obtained from 424 IT\nsites. Similar to [16], we \ufb01t a linear classi\ufb01er on 90% of each 10 ms of model activations between\n\n5\n\n\f70-250 ms and used it to decode object category in each image from the non-overlapping 10% of the\ndata. The linear classi\ufb01er was based on a fully-connected layer followed by a softmax, with Xavier\ninitialization for the weights [8], l2 regularized and decaying with 0.463, inputs were z-scored, and\n\ufb01t with a cross-entropy loss, a learning rate of 1e\u22124 over 40 epochs with a training batch size of 64,\nand stopped early if the loss-value went below 1e\u22124. The predictions were converted to normalized\nd(cid:48) scores per image (\"I1\" in [33]) and per time bin. By linearly interpolating between these bins, we\ndetermined the exact millisecond when the prediction surpassed a threshold value de\ufb01ned by the\nmonkey\u2019s behavioral output for that image, which we refer to as \"object solution times\", or OSTs.\nImages for which either the model or the neural recordings did not produce an OST because the\nbehavioral threshold was not met were ignored. We report a Spearman correlation between the model\nOSTs and the actual monkey OSTs (as computed in [16]).\n\n3.4 Generalization to new datasets\n\nNeural: New neurons, old images We evaluated models on an independently collected neural\ndataset (288 neurons, 2 monkeys, 63 trials per image; [16]) where new monkeys were presented with\na subset of 640 images from the 2,560 images we used for neural predictivity.\n\nNeural: New neurons, new images We obtained a neural dataset from [16] for a selection of 1,600\nof grayscale MS COCO images [26]. These images are very dissimilar from the synthetic images we\nused in other tests, providing a strong means to test Brain-Score generalization. The dataset consisted\nof 288 neurons from 2 monkeys and 45 trials per image. Unlike our previous datasets, this one had\na low internal consistency between neural responses, presumably due to the electrodes being near\ntheir end of life and producing unreasonably high amounts of noise. We therefore only used the 86\nneurons with internal consistency of at least 0.9.\n\nBehavioral: New images We collected a new behavioral dataset, consisting of 200 images (20\nobjects \u00d7 10 images) from Amazon Mechanical Turk users (185,106 trials in total). We used the\nsame experimental paradigm as in our original behavioral test but none of the objects were from the\nsame category as before.\n\nCIFAR-100 Following the procedure described in [18], we tested how well these models generalize\nto CIFAR-100 dataset by only allowing a linear classi\ufb01er to be retrained for the 100-way classi\ufb01cation\ntask (that is, without doing any \ufb01ne-tuning). As in [18], we used a scikit-learn implementation of a\nmultinomial logistic regression using L-BFGS [31], with the best C parameter found by searching a\nrange from .0005 to .05 in 10 logarithmic steps (40,000 images from CIFAR-100 train set were used\nfor training and the remaining 10,000 for testing; the search range was reduced from [18] because\nin our earlier tests we found that all models had their optimal parameters in this range). Accuracies\nreported on the 10,000 test images.\n\n4 Results\n\n4.1 CORnet-S is the best brain-predicting model so far\n\nWe performed a large-scale model comparison using most commonly used neural network fami-\nlies: AlexNet [19], VGG [37], ResNet [10], Inception [39\u201341], SqueezeNet [13], DenseNet [12],\nMobileNet [11], and (P)NASNet [50, 27]. These networks were taken from publicly available check-\npoints: AlexNet, SqueezeNet, ResNet-{18,34} from PyTorch [30]; Inception, ResNet-{50,101,152},\n(P)NASNet, MobileNet from TensorFlow-Slim [36]; and Xception, DenseNet, VGG from Keras\n[3]. As such, the training procedure is different between models and our results should be related to\nthose model instantiations and not to architecture families. To further map out the space of possible\narchitectures, we included a family of models called BaseNets: lightweight AlexNet-like architectures\nwith six convolutional layers and a single fully-connected layer, captured at various stages of training.\nVarious hyperparameters were varied between BaseNets, such as kernel sizes, nonlinearities, learning\nrate etc.\nFigure 1 shows how models perform on Brain-Score and ImageNet. CORnet-S outperforms other\nalternatives by a large margin with the Brain-Score of .471. Top ImageNet models also perform\nwell, with leading models stemming from the DenseNet and ResNet families.Interestingly, models\n\n6\n\n\fthat rank the highest on ImageNet performance are also not the ones scoring high on brain data,\nsuggesting a potential disconnect between ImageNet performance and \ufb01delity to brain mechanisms.\nFor instance, despite its superior performance of 82.9% top-1 accuracy on ImageNet, PNASNet only\nranks 13th on the overall Brain-Score. Models with an ImageNet top-1 performance below 70% show\na strong correlation with Brain-Score of .90 but above 70% ImageNet performance there was no\nsigni\ufb01cant correlation (p (cid:29) .05, cf. Figure 1).\n\nFigure 2: Brain-Score generalization across datasets: (a) to neural recordings in new subjects with\nthe same stimulus set, (b) to neural recordings in new subjects with a very different stimulus set (MS\nCOCO), (c) to behavioral responses in new subjects with new object categories, (d) to CIFAR-100.\n\nWe further asked if Brain-Score re\ufb02ects idiosyncracies of the particular datasets that we included in\nthis benchmark or instead, more desirably, provides an overall evaluation of how brain-like models\nare. To address this question, we performed four different tests with various generalization demands\n(Fig. 2; CORnet-S was excluded). First, we compared the scores of models predicting IT neural\nresponses to a set of new IT neural recordings [16] where new monkeys were shown the same images\nas before. We observed a strong correlation between the two sets (Pearson r = .87). When compared\non predicting IT responses to a very different image set (1600 MS COCO images [26]), model\nrankings were still strongly correlated (Pearson r = .57). We also found a strong correlation between\nmodel scores on our original behavioral set and a newly obtained set of behavioral responses to\nimages from 20 new categories that were not used before (200 images total; Pearson r = .85). Finally,\nwe evaluated model feature generalization to CIFAR-100 without \ufb01ne-tuning (following Kornblith\net al. [18]). Again, we observed a compelling correlation to Brain-Score values (Pearson r = .64).\nOverall, we expect that adding more benchmarks to Brain-Score will further lead scores to converge.\n\n4.2 CORnet-S is the best on ImageNet and CIFAR-100 among shallow models\n\nDue to anatomical constraints imposed by the brain, CORnet-S\u2019s architecture is much more compact\nthan the majority of deep models in computer vision (Fig. 3 middle). Compared to similar models\nwith a depth less than 50, CORnet-S is shallower yet better than other models on ImageNet top-1\nclassi\ufb01cation accuracy. AlexNet and IamNN are even shallower (depth of 8 and 14) but suffer on\nclassi\ufb01cation accuracy (57.7% and 69.6% top-1 respectively) \u2013 CORnet-S provides a good trade-off\nbetween the two with a depth of 15 and top-1 accuracy of 73.1%. Several epochs later in training\ntop-1 accuracy actually climbed to 74.4% but since we are optimizing for the brain, we chose the\nepoch with maximum Brain-Score. CORnet-S also achieves the best transfer performance among\nsimilarly shallow models (Fig. 3, right), indicating the robustness of this model.\n\n4.3 CORnet-S mediates between compactness and high performance through recurrence\n\nTo determine which elements in the circuitry are critical to CORnet-S, we attempted to alter its block\nstructure and record changes in Brain-Score (Fig. 4). We only used V4, IT, and behavioral predictivity\nin this analysis in order to understand the non-temporal value of CORnet-S structure. We found that\nthe most important factor was the presence of at least a few steps of recurrence in each block. Having\na fairly wide bottleneck (at least 4x expansion) and a skip connection were other important factors.\nOn the other hand, adding more recurrence or having \ufb01ve areas in the model instead of four did not\nimprove the model or hurt its Brain-Score. Other factors affected mostly ImageNet performance,\nincluding using two convolutions instead of three within a block, having more areas in the model and\nusing batch normalization per time step instead of a global group normalization [45]. The type of\ngating did not seem to matter. However, note that we kept training with identical hyperparameters\n\n7\n\n.53.55.57.6IT score (original neurons).5.52.54.56.58.6IT score (new neurons)r = .87(a)New neural recordings,same images.53.55.57.6IT score (original neurons).58.6.62.64.66.68IT score (new neurons)r = .57(b)New neural recordings,new images.2.3.4Behavioral score (original).4.5.6Behavioral score (new)r = .85(c)New behavioral recordings,new images.35.4.45Brain-Score.6.65.7.75CIFAR-100 transferr = .64(d)CIFAR-100 transfer\fFigure 3: Depth versus (left) Brain-Score, (middle) ImageNet top-1 performance, and (right)\nCIFAR-100 transfer performance. Most simple models perform poorly on Brain-Score and Im-\nageNet, and generalize less well to CIFAR-100, while the best models are very deep. CORnet-S\noffers the best of both worlds with the best Brain-Score, compelling ImageNet performance, the\nshallowest architecture we could achieve to date, and the best transfer performance to CIFAR-100\namong shallow models. (Note: dots corresponding to MobileNets were slightly jittered along the\nx-axis to improve visibility.)\n\nFigure 4: CORnet-S circuitry analysis. Each row indicates how ImageNet top-1 and Brain-Score\nchange with respect to the baseline model (in bold) when a particular hyperparameter is changed.\nThe OSTs of IT are not included in the Brain-Score here.\n\nfor all these model variants. We therefore cannot rule out that the reported differences could be\nminimized if more optimal hyperparameters were found.\n\n4.4 CORnet-S captures neural dynamics in primate IT\n\nFeed-forward networks cannot make any dynamic predictions over time, and thus cannot capture\na critical property of the primate visual system [16, 42]. By introducing recurrence, CORnet-S is\ncapable of producing temporally-varying response trajectories in the same set of neurons. Recent\nexperimental results [16] reveal that the linearly decodable solutions to object recognition are not\nall produced at the same time in the IT neural population \u2013 images that are particularly challenging\nfor deep ANNs take longer to evolve in IT. This timing provides a strong test for the model: Does it\npredict image-by-image temporal trajectories in IT neural responses over time? We thus estimated\nfor each image when explicit object category information becomes available in CORnet-S \u2013 termed\n\"object solution time\" (OST) \u2013 and compared it with the same measurements obtained from monkey\nIT cortex [16]. Importantly, the model was never trained to predict monkey OSTs. Rather, a linear\nclassi\ufb01er was trained to decode object category from neural responses and from model\u2019s responses at\neach 10 ms window (Section 3.3). OST is de\ufb01ned as the time when this decoding accuracy reaches a\nthreshold de\ufb01ned by monkey behavioral accuracy. We converted the two IT timesteps in CORnet-S\nto milliseconds by setting t0 \u02c6= 0-150 ms and t1 \u02c6= 150+ ms. We evaluated how well CORnet-S could\ncapture the \ufb01ne-grained temporal dynamics in primate IT cortex and report a correlation score of .25\n(p < 10\u22126; Figure 5). Feed-forward models cannot capture neural dynamics and thus scored 0.\n\n8\n\n102550100200Model Depth (log scale).34.36.38.4.42.44.46Brain-Score102550100200Model Depth (log scale).4.5.6.7.8ImageNet top-1102550100200Model Depth (log scale).6.65.7.75CIFAR-100 transfer\u0002' 3\u000f /\u0004\u0010\u0013\u000f /\u048a\u0014\u000f\u0002\u0014\u000f /\u0019\u001e +/$*)\u000e*\u001d$' \u000f /.\u0013 .\u000f /.\u0005 ). \u000f /.\n)\u001e +/$*)\u00011\u0443\u0004\u0010\u0013\u000f /\u048a\u0014\u0004\u0010\u0013\u000f /\u048a\u0014\u0002' 3\u000f /\u0002' 3\u000f /\u0017\b\b\u000e*\u001d$' \u000f /.\u00011\u0440\u0019\u001e +/$*)\u000e*\u001d$' \u000f /.\u00011\u0441\u0013 .\u000f /.\u0005 ). \u000f /.\u000f\u0002\u0014\u000f /\n)\u001e +1\u0443\n\u001c(\u000f\u000f\u0017\b\b\u000e*\u001d$' \u000f /.\u00011\u0440\u000f\u0002\u0014\u000f /\n)\u001e +\u00011\u0443\u0019\u001e +/$*)\u000e*\u001d$' \u000f /.\u00011\u0441\u0013 .\u000f /.\u0005 ). \u000f /.\u0017\b\b\u0440\u000f0(\u001d -\u0001*!\u0001- \u001e0-- )/./ +.\u0001$)\u0001\u0017\u0441\u0001\u049d\u0001\u0017\u0443\u0001\u049d\u0001\n\u0015\u0242\u0001\n(\u001c\" \u000f /\u0001/*+\u048a\u0440\u0476\u0001\u06bf\u0242\u0001\u0003-\u001c$)\u048a\u0014\u001e*- \u0476\u0001\u06bf\u0441\u0013 \u001e0-- )/\u0001'**+\u0001' )\"/#!-*(\u0001\u0442\u0001/*\u0001\u0440!-*(\u0001\u0442\u0001/*\u0001\u0441)*\u0001- \u001e0-- )\u001e \u0001\u0497\u0440\u049d\u0440\u049d\u0440\u0498\u0441\u049d\u0441\u049d\u0441\u0441\u049d\u0443\u049d\u0441\u0443\u049d\u0443\u049d\u0443\u0443\u049d\u0447\u049d\u0443\u0440\u043f\u049d\u0440\u043f\u049d\u0440\u043f\u0440\u0441\u0442\u0443\u0444\u0445\u0446\u0440\u0441\u0442\u0442\b\u001c/$)\"#\u001c-\u001f\u0001\"\u001c/ \u0477$)+0/\u0001\u001d'*\u001e& \u001f\u0001\u001c/\u0001/\u0001\u06db\u0001\u043f.*\u0205\u0001\"\u001c/ \u0477.$\"(*$\u001f\u0497$)+0/\u0001\u06d4\u0001./\u001c/ \u0498\u0001\u0443\u0003*//' ) \u001e&\u0001.$5 \u04403\u04413\u04433\u04453\u0444\u0004*)1*'0/$*)\u001c'\u0001'\u001c4 -.\u0441)\u001f\u0001\u001c)\u001f\u0001\u0442-\u001f\u0440./\u0476\u0001\u0441)\u001f\u0476\u0001\u001c)\u001f\u0001\u0442-\u001f\u0446\u000f*-(\u001c'$5\u001c/$*)\u0003\u001c/\u001e#\u0001\u000f*-(\u001c'$5\u001c/$*)\b-*0+\u0001\u000f*-(\u001c'$5\u001c/$*)\u0445\u0014&$+\u0001\u001e*)) \u001e/$*)+- . )/\u001c\u001d. )/\u000f0(\u001d -\u0001*!\u0001(*\u001f '\u0001\u001c- \u001c.\u0443\u0444\u0242\u0001\n(\u001c\" \u000f /\u0001/*+\u048a\u0440\u0476\u0001\u06bf\u0242\u0001\u0003-\u001c$)\u048a\u0014\u001e*- \u0476\u0001\u06bf\u0242\u0001\n(\u001c\" \u000f /\u0001/*+\u048a\u0440\u0476\u0001\u06bf\u0242\u0001\u0003-\u001c$)\u048a\u0014\u001e*- \u0476\u0001\u06bf\u048a\u0446\u048a\u0440\u048a\u043f\u043f\u043f\u048a\u048a\u0441\u048a\u0442\u043f\u048a\u043f\u043f\u043f\u048a\u043f\u048a\u0447\u048a\u0443\u048a\u06d4\u0440\u048a\u0440\u048a\u0440\u048a\u043f\u048a\u0442\u048a\u043f\u048a\u048a\u043f\u048a\u043f\u048a\u048a\u0441\u048a\u048a\u0440\u048a\u048a\u0440\u048a\u043f\u048a\u048a\u0440\u048a\u043f\f5 Discussion\n\nFigure 5: CORnet-S captures neural dy-\nnamics. A linear decoder is \ufb01t to predict ob-\nject category at each 10 ms window of IT re-\nsponses in model and monkey. We then tested\nobject solution times (OST) per image, i.e.\nwhen d(cid:48) scores of model and monkey surpass\nthe threshold of monkey behavioral response.\nr is computed on the raw data, whereas the\nplot visualizes binned OSTs. Error bars de-\nnote s.e.m. across images.\n\nWe developed a relatively shallow recurrent model\nCORnet-S that follows neuroanatomy more closely\nthan standard machine learning ANNs, and is among\nthe top models on Brain-Score yet remains competi-\ntive on ImageNet and on transfer tests to CIFAR-100.\nAs such, it combines the best of both neuroscience\ndesiderata and machine learning engineering require-\nments, demonstrating that models that satisfy both\ncommunities can be developed.\nWhile we believe that CORnet-S is a closer approx-\nimation to the anatomy of the ventral visual stream\nthan current state-of-the-art deep ANNs because we\nspeci\ufb01cally limit the number of areas and include\nrecurrence, it is still far from complete in many ways.\nFrom a neuroscientist\u2019s point of view, on top of the\nlack of biologically-plausible learning mechanisms\n(self-supervised or unsupervised), a better model\nof the ventral visual pathway would include more\nanatomical and circuitry-level details, such as retina\nor lateral geniculate nucleus. Similarly, adding a skip\nconnection was not informed by cortical circuitry\nproperties but rather proposed by He et al. [10] as a\nmeans to alleviate the degradation problem in very\ndeep architectures. But we note that not just any ar-\nchitectural choices work. We have tested hundreds of architectures before \ufb01nding CORnet-S type of\ncircuitries (Figure 4).\nA critical component in establishing that models such as CORnet-S are strong candidate models for\nthe brain is Brain-Score, a framework for quantitatively comparing any arti\ufb01cial neural network to the\nbrain\u2019s neural network for visual processing. Even with the relatively few brain benchmarks that we\nhave included so far, the framework already reveals interesting patterns. First, it extends prior work\nshowing that performance correlates with brain similarity. However, adding recurrence allows us to\nbreak from this trend and achieve much better alignment to the brain. Even when the OST measure\nis not included in Brain-Score, CORnet-S remains one of the top models, indicating its general\nutility. On the other hand, we also \ufb01nd a potential disconnect between ImageNet performance and\nBrain-Score with PNASNet, a state-of-the-art model on ImageNet used in our comparisons, that is not\nperforming well on brain measures, whereas even small networks with poor ImageNet performance\nachieve reasonable scores. We further observed that models that score high on Brain-Score also tend\nto score high on other datasets, supporting the idea that Brain-Score re\ufb02ects how good a model is\noverall, not just on the four particular neural and behavioral benchmarks that we used.\nHowever, it is possible that the observed lack of correlation is only speci\ufb01c to the way models were\ntrained, as reported recently by Kornblith et al. [18]. For instance, they found that the presence\nof auxiliary classi\ufb01ers or label smoothing does not affect ImageNet performance too much but\nsigni\ufb01cantly decreases transfer performance, in particular affecting Inception and NASNet family of\nmodels, i.e., the ones that performed worse on Brain-Score than their ImageNet performance would\nimply. Kornblith et al. [18] reported that retraining these models with optimal settings markedly\nimproved transfer accuracy. Since Brain-Score is also a transfer learning task, we cannot rule out that\nBrain-Score might change if we retrained the affected models classes. Thus, we reserve our claims\nonly about the speci\ufb01c pre-trained models rather than the whole architecture classes.\nMore broadly, we suggest that models of brain processing are a promising opportunity for collab-\noration between neuroscience and machine learning. These models ought to be compared through\nquanti\ufb01ed scores on how brain-like they are, which we here evaluate with a composite of many\nneural and behavioral benchmarks in Brain-Score. With CORnet-S, we showed that neuroanatomical\nalignment to the brain in terms of compactness and recurrence can better capture brain processing by\npredicting neural \ufb01ring rates, image-by-image behavior, and even neural dynamics, while simultane-\nously maintaining high ImageNet performance and outperforming similarly compact models.\n\n9\n\n\fAcknowledgments\n\nWe thank Simon Kornblith for helping to conduct transfer tests to CIFAR, and Maryann Rui and\nHarry Bleyan for the initial prototyping of the CORnet family.\nThis project has received funding from the European Union\u2019s Horizon 2020 research and innovation\nprogramme under grant agreement No 705498 (J.K.), US National Eye Institute (R01-EY014970,\nJ.J.D.), Of\ufb01ce of Naval Research (MURI-114407, J.J.D), the Simons Foundation (SCGB [325500,\n542965], J.J.D; 543061, D.L.K.Y), the James S. McDonnell foundation (220020469, D.L.K.Y.) and\nthe US National Science Foundation (iis-ri1703161, D.L.K.Y.). This work was also supported in part\nby the Semiconductor Research Corporation (SRC) and DARPA. The computational resources and\nservices used in this work were provided in part by the VSC (Flemish Supercomputer Center), funded\nby the Research Foundation - Flanders (FWO) and the Flemish Government \u2013 department EWI.\n\nReferences\n[1] Moshe Bar, Karim S Kassam, Avniel Singh Ghuman, Jasmine Boshyan, Annette M Schmid, Anders M\nDale, Matti S H\u00e4m\u00e4l\u00e4inen, Ksenija Marinkovic, Daniel L Schacter, Bruce R Rosen, et al. Top-down\nfacilitation of visual recognition. Proceedings of the national academy of sciences, 103(2):449\u2013454, 2006.\n\n[2] Pouya Bashivan, Kohitij Kar, and James J DiCarlo. Neural population control via deep image synthesis.\n\nScience, 364(6439), 2019.\n\n[3] Fran\u00e7ois Chollet et al. Keras. https://keras.io, 2015.\n\n[4] Radoslaw M Cichy, Aditya Khosla, Dimitrios Pantazis, Antonio Torralba, and Aude Oliva. Deep neural\nnetworks predict hierarchical spatio-temporal cortical dynamics of human visual object recognition. arXiv\npreprint arXiv:1601.02970, 2016.\n\n[5] Alex Clarke, Barry J Devereux, and Lorraine K Tyler. Oscillatory dynamics of perceptual to conceptual\ntransformations in the ventral visual pathway. Journal of cognitive neuroscience, 30(11):1590\u20131605, 2018.\n\n[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical\nimage database. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 248\u2013255. IEEE,\n2009. ISBN 978-1-4244-3992-8. doi: 10.1109/CVPR.2009.5206848.\n\n[7] James J DiCarlo, Davide Zoccolan, and Nicole C Rust. How does the brain solve visual object recognition?\n\nNeuron, 73(3):415\u2013434, 2012.\n\n[8] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward neural\nnetworks. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pp. 249\u2013256,\n2010.\n\n[9] Umut G\u00fc\u00e7l\u00fc and Marcel AJ van Gerven. Deep neural networks reveal a gradient in the complexity of\n\nneural representations across the ventral stream. Journal of Neuroscience, 35(27):10005\u201310014, 2015.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770\u2013778, 2016.\n\n[11] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,\nMarco Andreetto, and Hartwig Adam. MobileNets: Ef\ufb01cient Convolutional Neural Networks for Mobile\nVision Applications. arXiv preprint arXiv:1704.04861, 2017.\n\n[12] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\n\nconvolutional networks. In CVPR, 2017.\n\n[13] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer.\nSqueezeNet: AlexNet-level accuracy with 50x fewer parameters and. arXiv preprint arXiv:1602.07360,\n2016.\n\n[14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[15] Stanis\u0142aw Jastrzebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che, and Yoshua Bengio.\n\nResidual connections encourage iterative inference. arXiv preprint arXiv:1710.04773, 2017.\n\n10\n\n\f[16] Kohitij Kar, Jonas Kubilius, Kailyn Schmidt, Elias B Issa, and James J DiCarlo. Evidence that recurrent cir-\ncuits are critical to the ventral stream\u2019s execution of core object recognition behavior. Nature neuroscience,\npp. 1, 2019.\n\n[17] Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. Deep supervised, but not unsupervised, models\n\nmay explain it cortical representation. PLoS computational biology, 10(11):e1003915, 2014.\n\n[18] Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do Better ImageNet Models Transfer Better? arXiv\n\npreprint arXiv:1805.08974, 2018.\n\n[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pp. 1097\u20131105, 2012.\n\n[20] Jonas Kubilius. Predict, then simplify. NeuroImage, 180:110 \u2013 111, 2018.\n\n[21] Jonas Kubilius, Stefania Bracci, and Hans P Op de Beeck. Deep neural networks as a computational model\n\nfor human shape sensitivity. PLoS computational biology, 12(4):e1004896, 2016.\n\n[22] Victor AF Lamme and Pieter R Roelfsema. The distinct modes of vision offered by feedforward and\n\nrecurrent processing. Trends in neurosciences, 23(11):571\u2013579, 2000.\n\n[23] Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt, Thomas Breuel, and Jan Kautz. Iamnn: Itera-\ntive and adaptive mobile neural network for ef\ufb01cient image classi\ufb01cation. arXiv preprint arXiv:1804.10123,\n2018.\n\n[24] Ming Liang and Xiaolin Hu. Recurrent convolutional neural network for object recognition. In The IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), June 2015.\n\n[25] Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent neural networks\n\nand visual cortex. arXiv preprint arXiv:1604.03640, 2016.\n\n[26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll\u00e1r,\nand C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on\ncomputer vision, pp. 740\u2013755. Springer, 2014.\n\n[27] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille,\n\nJonathan Huang, and Kevin Murphy. Progressive Neural Architecture Search. arXiv preprint, 2017.\n\n[28] Najib J Majaj, Ha Hong, Ethan A Solomon, and James J DiCarlo. Simple learned weighted sums of inferior\ntemporal neuronal \ufb01ring rates accurately predict human core object recognition performance. Journal of\nNeuroscience, 35(39):13402\u201313418, 2015.\n\n[29] Aran Nayebi, Daniel Bear, Jonas Kubilius, Kohitij Kar, Surya Ganguli, David Sussillo, James J DiCarlo,\nand Daniel LK Yamins. Task-driven convolutional recurrent models of the visual system. arXiv preprint\narXiv:1807.00053, 2018.\n\n[30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\n\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.\n\n[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.\nScikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n[32] Karim Rajaei, Yalda Mohsenzadeh, Reza Ebrahimpour, and Seyed-Mahdi Khaligh-Razavi. Beyond core\n\nobject recognition: Recurrent processes account for object recognition under occlusion. bioRxiv, 2018.\n\n[33] Rishi Rajalingham, Elias B Issa, Pouya Bashivan, Kohitij Kar, Kailyn Schmidt, and James J DiCarlo.\nLarge-scale, high-resolution comparison of the core visual object recognition behavior of humans, monkeys,\nand state-of-the-art deep arti\ufb01cial neural networks. Journal of Neuroscience, pp. 7255\u20137269, 2018.\n\n[34] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211\u2013252,\n2015.\n\n[35] Martin Schrimpf. Brain-inspired recurrent neural algorithms for advanced object recognition. Master\u2019s\n\nthesis, Technical University Munich, LMU Munich, University of Augsburg, 2017.\n\n[36] N. Silberman and S. Guadarrama. Tensor\ufb02ow-slim image classi\ufb01cation model library. https://github.\n\ncom/tensorflow/models/tree/master/research/slim, 2016.\n\n11\n\n\f[37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[38] Courtney J Spoerer, Patrick McClure, and Nikolaus Kriegeskorte. Recurrent convolutional neural networks:\n\na better model of biological object recognition. Frontiers in psychology, 8:1551, 2017.\n\n[39] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), sep 2015.\nISBN 9781467369640. doi: 10.1109/CVPR.2015.7298594.\n\n[40] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking\n\nthe Inception Architecture for Computer Vision. arXiv preprint, 2015.\n\n[41] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-\n\nresnet and the impact of residual connections on learning. In AAAI, volume 4, pp. 12, 2017.\n\n[42] Hanlin Tang, Martin Schrimpf, William Lotter, Charlotte Moerman, Ana Paredes, J.O. Josue Ortega Caro,\nWalter Hardesty, David Cox, and Gabriel Kreiman. Recurrent computations for visual pattern completion.\nProceedings of the National Academy of Sciences (PNAS), 115(35):8835\u20138840, 2018.\n\n[43] Martin J Tov\u00e9e. Neuronal processing: How fast is the speed of thought? Current Biology, 4(12):1125\u20131127,\n\n1994.\n\n[44] Johan Wagemans, James H Elder, Michael Kubovy, Stephen E Palmer, Mary A Peterson, Manish Singh,\nand R\u00fcdiger von der Heydt. A century of gestalt psychology in visual perception: I. perceptual grouping\nand \ufb01gure\u2013ground organization. Psychological bulletin, 138(6):1172, 2012.\n\n[45] Yuxin Wu and Kaiming He. Group normalization. arXiv preprint arXiv:1803.08494, 2018.\n\n[46] Daniel L Yamins, Ha Hong, Charles Cadieu, and James J DiCarlo. Hierarchical modular optimization\nof convolutional networks achieves representations similar to macaque it and human ventral stream. In\nAdvances in neural information processing systems, pp. 3093\u20133101, 2013.\n\n[47] Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models to understand sensory\n\ncortex. Nature neuroscience, 19(3):356, 2016.\n\n[48] Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo.\nPerformance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings\nof the National Academy of Sciences, 111(23):8619\u20138624, 2014.\n\n[49] Amir R Zamir, Te-Lin Wu, Lin Sun, William B Shen, Bertram E Shi, Jitendra Malik, and Silvio Savarese.\nFeedback networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp.\n1808\u20131817. IEEE, 2017.\n\n[50] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning Transferable Architectures for\n\nScalable Image Recognition. arXiv preprint, 2017.\n\n12\n\n\f", "award": [], "sourceid": 6969, "authors": [{"given_name": "Jonas", "family_name": "Kubilius", "institution": "MIT, KU Leuven, Three Thirds"}, {"given_name": "Martin", "family_name": "Schrimpf", "institution": "MIT"}, {"given_name": "Kohitij", "family_name": "Kar", "institution": "MIT"}, {"given_name": "Rishi", "family_name": "Rajalingham", "institution": "MIT"}, {"given_name": "Ha", "family_name": "Hong", "institution": "Bay Labs Inc."}, {"given_name": "Najib", "family_name": "Majaj", "institution": "NYU"}, {"given_name": "Elias", "family_name": "Issa", "institution": "Columbia University"}, {"given_name": "Pouya", "family_name": "Bashivan", "institution": "MIT"}, {"given_name": "Jonathan", "family_name": "Prescott-Roy", "institution": "MIT"}, {"given_name": "Kailyn", "family_name": "Schmidt", "institution": "MIT"}, {"given_name": "Aran", "family_name": "Nayebi", "institution": "Stanford University"}, {"given_name": "Daniel", "family_name": "Bear", "institution": "Stanford University"}, {"given_name": "Daniel", "family_name": "Yamins", "institution": "Stanford University"}, {"given_name": "James", "family_name": "DiCarlo", "institution": "Massachusetts Institute of Technology"}]}