Reviews: Generalisation in humans and deep neural networks

Update after rebuttal: This is a very interesting and well executed paper. A large number of studies have said that DNNs yield performance that resembles that of humans and animals, but this study provides a more fine-grained comparison of the behavior and finds real differences. In particular, the study probes sensitivity to perturbed inputs, something that current DNNs are much less capable at than humans. This provides a valuable dataset and benchmark to aim for, and I expect the ML community will have algorithms that perform far better in short order. I would encourage the authors to make as fine-grained and specific comparisons of similarities and differences as possible; to suggest potential reasons why these discrepancies might exist (clearly marking these reasons as somewhat speculative if they aren't backed by direct evidence); and to expand on potential routes toward better algorithms (again marking this as speculative if it is). _________ Summary: This paper probes the generalization abilities of deep networks and human observers, by undertaking an extensive psychophysical investigation of human performance on ImageNet in the presence of strong distortions of the input images. The results point to a large gap between DNNs (which can perform well when trained on a specific distortion) and human observers (who can perform well across all distortions). The paper releases the psychophysical dataset and image distortions to aid future research. Major comments: This paper undertakes an impressively controlled comparison of human performance and DNNs at a behavioral level. While previous work has shown broad similarities between DNNs and human behavior, this evaluation undertakes a much more challenging test of generalization abilities by considering extreme distortions. Additionally, in contrast to prior work which relied on Amazon Mechanical Turk to generate data, the present submission uses carefully controlled laboratory viewing conditions that remove some prior concerns (such as variation in timing, viewing distance, and how attentive AMT subjects might be). Overall this paper contributes a rigorous psychophysical dataset that could spur a variety of follow up work in the field. The most striking deviation between human subjects and the DNNs investigated here is in the entropy of the response distribution. The DNNs apparently become stuck, responding to all distorted images with a single class. On the one hand this is indicative of a clear failure of the DNNs in their vanilla form. On the other hand, one could imagine simple procedures to improve the situation. For instance, one could adjust the temperature parameter of the final softmax to attain equal entropy to human subjects, and assume responses are sampled from this. It is well known that the final softmax layers are not calibrated probabilities, so this may be a fairer comparison. Another route would be through explicitly calibrating the probability distribution in the final layer, eg through dropout [75].

Reviewer 2

[I have read through the authors' response and modified my review accordingly. Remarks following the authors' response are included below in square brackets.] The authors run experiments on humans and on standard DNNs to assess their robustness to different types of noise/distortion. They find that pretrained models do not perform well on distorted images except for color-related distortion, as compared to human subjects. When models are trained on distorted images, they also perform poorly when tested on any distortion other than that they were trained on. The paper is clear and well-written, and the rigorous human experiments are potentially valuable to a machine learning problem. However, I find the overall conclusions unsurprising. It is to be expected that DNNs will perform quite poorly on data for which they were not trained. While a close comparison of the weakness of humans and DNNs would be very interesting, I feel the present paper does not include much analysis beyond the observation that new types of distortion break performance. I am actually surprised that the DNNs did so well on grayscale images, where performance resembles that for undistorted images without any retraining. Further analysis of this regime could be instructive. [The authors address this point in their response, noting that the exact points of divergence between human and deep learning are worthy of examination, which is absolutely true. I would be very interested in this paper providing more interpretation of the excellent data that they have gathered. What conclusions can be drawn? What explanations are possible for the generalization of deep networks to some types of noise but not to others? The authors are including detailed, category-level comparison in the supplementary material, hopefully, this will also include interpretation of the observations.] The kinds of noise introduced should be more clearly described - for example, phase scrambling seems never to be defined. Even more importantly, the parameters by which these types of noise vary (the x-axes in Figure 3) should be much clearer. [The authors have addressed this point in their revision.] Minor issues: - ln 62: “once” should be “one”. - ln 104: “fair” should be “far”. - Figure 4 could be presented more clearly. In particular, the x axis should be labeled clearly to show the correspondence between the training distortion(s) and the testing distortion. - ln 674 (Supplementary Material): “compaired” should be “compared”.

Reviewer 3

The paper compares human and deep neural network generalizations using images that were distorted in a number of different ways. These comparisons hint to "behavioral differences" between humans and deep neural nets. The study is well executed and clearly written. The number of experiments conducted is impressive (both for the deep nets and the psychophysics experiments that were used as baseline). Overall, I think that the study can help to uncover systematic differences in visual generalization between humans and machines. Such empirical evidence will be needed to inspire algorithms that can counteract distortions. The paper would have been much stronger if the first elements of algorithms that can counteract distortions were outlined. Although the empirical part is impressive and interesting, there was no theoretical contribution. I seems to me that two relevant topics might be worth mentioning in the paper: 1) Captcha seem to exist exactly because the humans can better deal with distortions than machines. The paper could benefit from a note about that 2) I am curious what's the relation between this work and training with adversarial examples.

Paper ID:	3743
Title:	Generalisation in humans and deep neural networks

Reviewer 1

Reviewer 2

Reviewer 3