Review for NeurIPS paper: Mutual exclusivity as a challenge for deep neural networks

NeurIPS 2020

Mutual exclusivity as a challenge for deep neural networks

Review 1

Summary and Contributions: People are biased to assume that each object has a single label, which is called the “mutual exclusivity” bias. This aids inference for novel objects and in other situations. The paper examines whether single hidden layer feedforward and seq2seq neural networks have a mutual exclusivity bias. First, they examine how different variants predict the label of a novel object after learning several object-label pairings in an idealized scenario. Then they examined the extent that mutual exclusivity exists in three real-world machine translation and two object classification data sets (as far as I can tell, a corpus analysis without model comparisons, except for the object classification). Contributions: 1. Bridges important topics in cognitive science and many domains within machine learning 2. Provides quantitative evaluation with simulations as to whether a few popular machine learning methods are biased in the same way as human learners (often researchers make claims without any simulations to back it up) 3. Corpus analyses provide some real-world context for the extent that the mutual exclusivity bias is empirically valid

Strengths: - Paper is of broad interest to NeurIPS community, touching on several important problems across machine learning and cognitive science. - Analyzing the extent of bias towards mutual exclusivity in machine learning models and how that affects performance on real-world data sets is a significant contribution. - The corpus analyses of mutual exclusivity were the first I’ve seen and a more thorough corpus analysis for it would be a substantial contribution.

Weaknesses: - It is difficult to assess how poor the mutual exclusivity bias of the analyzed neural networks is without a comparison model that has mutual exclusivity to some extent. I point out one model that does in the prior work (Zinszer et al., 2018) and the authors reference three, but never evaluated those. Those analyses are critical for understanding the results as well as attempts in machine learning to codify the bias. - The substantial analysis of neural network models for mutual exclusivity is only on synthetic data, whereas the claims are much more broad. There is some analysis for the object classification case, but from what is there, it seems it was conducted with an extremely simplified model that differs from the others. - If a corpus analysis of mutual exclusivity is desired, then it would be more clear to establish the methodology for the single language case before moving to data sets containing two languages. Can the authors complete similar analyses with CHILDES (even more interesting given it has more ecological validity) and other single language data sets? I have a more minor concern with the rigor of the statistical analyses. I appreciate that the authors a wide range of options for each architecture. However, for every precise setting, it appears the resulting neural network was run once. If we think of the ME score for a given fully specified neural network as a random variable, we are essentially getting a single Monte Carlo estimate of its expected value. Based on the results, it seems very likely that conducting the analyses for each fully specified network multiple times to get a proper estimate would look similar. I hope the authors consider doing this.

Correctness: The paper would benefit from using the same strong methodology from the synthetic studies throughout (as well as the other issues mentioned in the weaknesses section).

Clarity: The paper is generally well-written. There are many different methodologies and models used across multiple domains and it was very difficult for me to keep track of which was being used when (I hadn’t realized they didn’t use the same models from the synthetic data analyses in the next studies until a second read through). A minor comment: The word bias has a lot of different senses in this community and it would help to clarify which one you are using in each situation. Here are three. There is bias in the strict statistical sense – the difference of the expected value of a variable from whatever the “true” value of the construct it is estimating. There is bias as in all things being equal, a learner will select x if it has a bias towards x. And there’s bias in a broader sense, just prior weight towards something (and also systematic error, as in the heuristics and bias judgment and decision-making framework). I think you mean the second one usually, but I’m not certain.

Relation to Prior Work: Zinszer et al (2018) define a Bayesian model of mutual exclusivity and evaluate its performance on data from behavioral experiments. The authors cite several machine learning methods that have already attempted to incorporate mutual exclusivity biases into deep learning methods. This suggests that the general finding (that deep learning is biases away from mutual exclusivity) is already somewhat known in the community. Examining the performance of those techniques would strengthen the paper substantially. References: Zinszer, B. D., Rolotti, S. B., Li, F., & Li, P. (2018). Bayesian Word Learning in Multiple Learning Environments. Cognitive Science, 32(2), 439-462.

Reproducibility: Yes

Additional Feedback: Reply to Author Feedback: Thank you for the careful consideration of my critique and I am glad it sounds like it was helpful for you. My general feeling about the manuscript remains the same -- it is fantastic work in progress, but not quite ready for a premier conference publication. With respect to the CHILDES/ecological data set question, thank you for the clarification. Word learning models are evaluated CHILDES/Age of Acquisition norms quite frequently. For example, see Hills et al. (2009). Hills, T. T. Maouene, M., Maouene, J., Sheya, A., & Smith, L. (2009). Longitudinal Analysis of Early Semantic Networks: Preferenetial Attachment or Preferential Acquisition. Psychological Science, 20(6), 729-739. Also, in case it is helpful, there was a recent Behavior Research Methods article with nice wrappers for accessing CHILDES with R: Sanchez, A., Meylan, S. C., Brainsky, M., MacDonald, K. E., Yurovsky, D., & Frank, M. C. (2019). childes-db: A flexible and reproducible interface to the child language data exchange system. Behavior Research Methods, 51, 1928-1941. --------- Broader impacts discusses impact to machine learning research and not society.

Review 2

Summary and Contributions: This paper highlights the notion of mutual exclusivity, positioning it as a useful, relied-upon signal for human learning (e.g. of early language) and showing that it is often lacking, or worse, actively biased against, in standard deep network settings of importance. The paper suggests that keeping the tenet of ME in mind may lead to better generalizing neural networks.

Strengths: The motivation behind the work is strong, and it makes good points about the standard learning paradigms across tasks (in supervised learning in vision as well as in language) and the anti-ME bias they naturally acquire in training. The evaluation is convincing in that it quantifies this undesirable bias clearly. The paper also investigates the natural desired ME bias in toy and real scenarios and shows further that the anti-ME bias is undesirable. I believe that the topic of the paper will be of interest to the NeurIPS community in designing neural networks that are fundamentally more able to generalize, based on an elegant principle shown to be present in human learning.

Weaknesses: Of course, since the topic is quite broad and challenging, there are no real concrete suggestions made toward mending this flaw. Some regularization techniques are attempted but they seem to be more to show that simple fixes won't work as opposed to offering a solution. The scenarios considered are all supervised (e.g. Omniglot and Imagenet), and an analysis of real-world unlabeled data would be interesting as well. Overall, I also get the impression that many of the computations conducted are quite simple and trivial, with the caveat that the core notion of the paper (desiring mutual exclusivity in deep nets) seems to not be widely accepted, so perhaps even simple metrics suffice. My main concern is that the analysis could go further in investigating non-toy neural networks, even if such toy experiments are constructed by analogy to similar experiments in child language learning.

Correctness: The claims are sound (with sufficient evidence in both cognitive science and machine learning) and informative. Empirical methodology is clear and simple, and conveys the argument that ME is absent well.

Clarity: Yes, the paper is well-written and concise.

Relation to Prior Work: Yes. While previous work does touch on the challenge of mutual exclusivity, this paper differentiates itself by investigating it in the modern deep learning setting with common datasets, and by quantifying the degree to which networks struggle with this learning principle.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper examines an inductive bias---called the mutual exclusivity (ME) bias---observed in human learners, wherein novel words are more likely to be mapped to novel objects than known objects with known words. For a relatively wide range of conditions, the authors show that feed forward networks subjected to a particular interpretation of a version of the learning problem do not exhibit the ME bias. They further illustrate through experiments two domains (translation and image classification), where the ME bias could be useful if correctly instantiated with machine learning models.

Strengths: The strength of the work is mainly empirical, in particular in elucidating two domains (translation and image classification) where the ME bias may be useful. The operationalizations are well thought out, and could inspire new benchmarks for such tasks. After reading the paper, I do feel the problem is a worthy challenge to the field, similar to Lake's omniglot dataset that inspired numerous few-shot learning models. It is indeed surprising that the machine learning community has continued to focus so strongly on unrealistic tasks where a fixed, flat set of labels are observed once.

Weaknesses: The weaknesses of the paper in my opinion are the first section on feedforward networks, and the failure to follow leads for successfully introducing the bias instead. In the first experiment section, the authors use feedforward networks to map one-hots to one-hots, where the output labels are fixed permutations of the inputs, and one category slot represents a novel item. My first thought before reading the details is echoed by the authors themselves in the discussion: "We posit it more generally affects any discriminative model class trained to maximize log-likelihood (like multi-class softmax regression, decision trees, etc.)." It's hard to believe that many readers would see this experiment as being well-suited to the model class. It's not the class I would go to in order to address such a problem. Indeed, the authors note themselves that "In a trained network, the optimal activation value for an unused output node is zero." While the authors cite more promising leads themselves (Santoro et al. [41] and Lake [42]), they do not pursue them.

Correctness: I see no major issues with the claims and methods proposed.

Clarity: The paper was very well written and easy to follow.

Relation to Prior Work: Previous work is clearly reviewed but I would have expected more discussion of successful approaches or leads since that's what the paper aims to motivate.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper tests whether two of the typical deep learning architectures (seq2seq and classification) exhibits a bias observed in child language acquisition. This bias, called the mutual exclusivity bias, suggests that children tend to assign one label for each object. In other words, when children hear a new word while observing two objects, one familiar and one novel, they will associate the new word to the novel object. The authors argue that this bias is essential for deep learning models since it enables fast generalization (learning a new word in a few examples) in children. The authors show that the configuration of models they examine does not exhibit the mutual exclusivity bias.

Strengths: The work is grounded in an observation from the child language acquisition literature. The authors design several experiments to empirically test whether deep learning models can predict this behavior. Exploring inductive biases that would help deep learning models generalize better is an important research direction for the NeurIPS community.

Weaknesses: Testing for the bias. The experiments assume that the models exhibit the bias if the probability of the new class(es) given a new word is one. It is not clear why it is expected for the models to assign the probability of one to the correct (new) class. When testing classification models, the correct class is the one that the model assigns the highest probability to, and this probability is often much smaller than one. This is because of the fact the sum of the prior probability over all incorrect classes is relatively large when there are many classes, even though the probability of individual classes is small. Moreover, when testing the models in a continual learning setup, the authors should continue training before the model overfits, and report the performance of the models on a held-out split. Limits of architectural search. Although the authors explore a range of parameters and optimization techniques, exploring all possible combinations of loss, priors, and architectures is not possible. Without any theoretical proof, the conclusions of this work only hold for the settings the authors have examined. There are two relevant approaches that the authors have not taken into account: adding a prior to new classes by (a) adding a bias parameter that is added to the weights, or (b) when calculating the softmax in the final layer. The importance of the mutual exclusivity bias. Many words have synonyms, and though the mutual exclusivity bias helps to generalize for novel words, it will prevent the model from learning multiple correct mappings. In the translation setting, the bias prevents many-to-one or one-to-many mappings between the two languages which are desired for many language pairs.

Correctness: The empirically methodology is mostly correct but not all the conclusions hold given the experiments. See above.

Clarity: The paper is well written.

Relation to Prior Work: Yes, the authors discuss work in both AI and cognitive science. There are some missing references in the cognitive modelling domat that are particularly relevant: http://papers.nips.cc/paper/5205-visual-concept-learning-combining-machine-vision-and-bayesian-generalization https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1551-6709.2010.01104.x

Reproducibility: No

Additional Feedback: After reading the response: Thanks for the explanations. For evaluating the mutual exclusivity bias, I think a more apple-to-apple comparison is to check if the model will choose any of the novel classes (as opposed to comparing a score). In the continual learning set-up, if you do not select a model based on a held-out set, the model can be overfitted to the training data, but this is an optimization issue (and not relevant to the inability to model a bias).