Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Although the idea of a functional correspondence between ANN components and brain regions is a key motivating idea in this paper, it has to be noted that brain “areas” can be functionally and anatomically heterogeneous. Therefore, the one-to-one mapping between the number of model components and the number of brain regions may be a bit arbitrary and simplistic. Can we really say for sure it should be four areas, and not five or six? Moreover, the assumption that the circuitry does not differ across regions seems simplistic. Lines 89-101: How are these architecture details decided on? For example, “V2_COR and IT_COR are repeated twice, V4_COR is repeated four times”. As far as I can see, there is no a priori justification or explanation for many of these choices. If there has been extensive preliminary architecture search then this should be stated clearly. (The circuitry analysis (Fig 5) does mitigate this point somewhat). Although the paper is generally well-written, there are a number of places where the text doesn’t really flow (perhaps as a result of extensive editing and re-organisation). For example, “Feedforward Simplicity” is mentioned on line 59 but without any definition or explanation. (Personally, I find the phrase a bit strange, since the whole point of the models is that they are not feedforward - - perhaps “information flow simplicity” or just “model simplicity” is better?) Line 164: counting recurrent paths only once seems a little strange: shouldn’t a model with recurrence be measured as more complex than one without recurrence, all else being equal? Line 206: I am not sure it is clear what “category” means here. Line 253: I realise space is a limiting factor, but I think it would be good to have more detailed information about the object recognition task the monkey is performing. The evolution of representations in recurrent neural network models of object vision over time and their relation to neural data over time (e.g. line 76) is tested in Clarke et al 2018 (JoCN), in relation to human MEG data. One general comment I have about this paper is that it is not always clear what this paper’s novel contribution is, in relation to other pre-prints published by what I suspect is the same lab. There are a number of interrelated arXiv preprints on this work, and it is not always clear what is novel in each. For example, is this current paper to be taken as the first published presentation of the “Brain-Score” benchmark? If so, it should be described much more extensively than it is. If not, it should not be regarded as a novel contribution of this paper (just reference the earlier preprint).
UPDATE AFTER AUTHOR RESPONSE Thank you for the clarifications. I had only one real reservation, regarding the rationale for feedforward simplicity, and this has been clarified. It was a pleasure to review this. ------------------------------ Summary: The paper develops a multi-faceted metric that summarizes how similar a deep network's representations and decisions are to those of the primate ventral visual stream. This builds on a body of recent work that has identified parallels between deep networks and the ventral stream. The past work suggested that deep networks are good but incomplete models of the ventral stream, hinting that better models (and thus richer understanding) might be achieved by somehow modifying deep networks. To make clear progress in this direction, an objective and reasonably comprehensive measure of brain similarity is needed, and this submission provides it. The authors also present a new network model that establishes a new state of the art for this metric. In contrast with established deep networks (some of which score fairly well) this one has structural parallels with the ventral stream, including analogous areas and recurrent connections. The authors also show that while object recognition performance is correlated with brain similarity, this correlation is weak for the recent best-performing networks in the deep learning literature, suggesting a divergence from the brain, which the new metric provides a way to avoid. Originality: Each of the above contributions is highly original. Quality: This is thorough, field-leading work. In addition to the main contributions, the authors showed that the results generalized well to new images and monkeys (Fig. 2), and reported the effects of numerous variations on the model (Fig. 5). Clarity: The text is well written and the figures and nicely done. The supplementary material provides rich additional detail. A minor limitation is that, since the paper covers a lot of ground, some of the details go by quickly. For example, the description of the network might be expanded slightly. I am not sure the small square labelled "conv / stride 2" in Fig. 1 was explained in the text, or the kind of gating (except in Fig. 5). Significance: Each of the above contributions is highly significant and likely to inform other researchers' future work in this area.
In this manuscript, the authors design a brain-inspired convolutional recurrent neural network- CORnet-S that maps the visual ventral pathway to different layers in the network. The CORnet-S model achieves competitive accuracy on ImageNet; more importantly, it achieves state-of-the-art performance on the Brain-Score, a comprehensive benchmark to assess the performance of a model for neural predictivity, behavioral predictivity, object solution times (OST), and feedforward simplicity. The manuscript is clearly written, and is of interest to a broad audience in the conference. However, I do have several concerns for the manuscript: * The biggest concern is the justification of contribution. First, Brain-Score is proposed in 2018 . Although  is a preprint version, but it has been widely cited, so I would argue the contribution of Brain-Score for this manuscript. For me, a novel contribution is the proposal of the OST metric that expands the Bran-Score benchmark from single frame to sequence level. Second, as can be seen in Table 1 in the supplemental material, the claimed state-of-the-art performance largely depends on the OST score, which is unfair to all other models without recurrent connections (as their scores are by default 0). If OST score is not taken into consideration, the performance of CORnet-S is not state-of-the-art anymore, although still competitive. In this case, in order to better justify the contribution, evidence from other perspectives are needed, such as the number of parameters, inference speed, GPU memory usage, etc. Another way the author may consider is to add recurrent connection to other feedforward models with similar depth/number of parameters. If CORnet-S is more efficient than deeper models or performs better than recurrent shallow models, then the contribution can still be justified. At the current stage, however, more experiments/metrics are needed. * Since OST is a novel contribution of the manuscript, section 4.6 and Figure 6 should be elaborated more clear. For example, why do we need a 80%-10%-10% setting for the CORnet-S setting? How does the 10ms window map to the dots in Figure 6 and how are they measured in the network? Minor concerns: * Figure 2: What does the red dot mean? * Figure 3: cannot see the superiority of the CORnet-S model. Since the performances are really close, this Figure may not be included.  Schrimpf, M., Kubilius, J., Hong, H., Majaj, N. J., Rajalingham, R., Issa, E. B., ... & Yamins, D. L. (2018). Brain-Score: which artificial neural network for object recognition is most brain-like?. BioRxiv, 407007. ===========Post-rebuttal============ My concern of the preprint has been addressed by clarification of policy by the AC, so is no longer an issue. As a result the Brain-Score benchmark has become a major contribution. The authors also addressed my concern regarding comparison against other shallow recurrent models (in perfomance, although I am still interested in memory usage/model size, etc). As a result my score is updated from 5 to 8.