Review for NeurIPS paper: Learning abstract structure for drawing by efficient motor program induction

NeurIPS 2020

Learning abstract structure for drawing by efficient motor program induction

Review 1

Summary and Contributions: This paper compares a computational model of learning drawing programs to a survey on human subjects. It shows similarities between human subjects and a learned computational models in terms of the abstract structure of the learned programs.

Strengths: The paper describes a thorough experiment on human subjects. The paper analyses the results using a state of the art computational model of program induction. The results are very interesting, shed light on human intelligence and can serve as inspiration for artificial intelligence, and drive further research on models for program inductions and learning abstract structures and concepts.

Weaknesses: There's no direct conclusion on how to improve program induction models given the presented results.

Correctness: As far as I can tell everything is correct.

Clarity: The paper is very well written, see some minor comments below.

Relation to Prior Work: As far as I can tell this is clearly presented although I am not too familiar with the relevant literature.

Reproducibility: Yes

Additional Feedback: Some aspects of the computational model are not clear. 1. Lines 94 - 106 describe a way to bias the model towards efficient motor control without supervision of the motor data, why isn't this use in PI models in table 2 ("Admissible traj., all equal probability")? 2. Most of the results are using the hybrid models which are trained using the motor data. This makes the training set different than for the human subjects which only see the images. How is this comparable? I think this should be discussed. Some more comments: 1. How consistent are models trained on the same data? It would be good to see a plot like 3c and 3d on models trained with different random seeds. 2. It would be useful to see the updated libraries after training. 3. Section 3,3 describes the results in figure 6 mentioning the models (HM1 and HM2) but the figure mentions PI1 and PI2. Not sure which are the correct ones. Line 182 refers to table 2 instead of table 1? Update: I have read the authors' feedback and happy to keep my score.

Review 2

Summary and Contributions: This is a cognitive modeling paper, combining results of a human subjects experiment on copying simple drawings with a Bayesian model of program induction. Properties of generalization were measured by using two different training sets and measuring properties on the same test set, for both people and the model. Ablated versions of the model were used as well, to argue that the results were due to the claimed factors.

Strengths: The theoretical description was clear and the empirical evaluation was performed well. Since learning is central to NeurIPS, and there is a neurosymbolic component to the model, it should definitely be of interest to this community.

Weaknesses: There are two weaknesses. The first is that this is an incremental step along a path trod by a number of other papers in PNAS, Science, etc. It is not clear how much is new here, compared to those prior efforts, including the two DreamCoder papers. That should be clarified in a revised version. The second weakness is that the method of randomly generating programs until you find something that works ("programs are sampled from a generative model...") is unlikely to scale to more complex drawing tasks. This is the same limitation that prior work in this genre has. By contrast, models like SOAR, which accumulate procedural knowledge via chunking, would become more skilled over time, following the power law of learning. Would the approach described here do that? It might, if the update of the library were handled properly, since the power law of learning falls out from any incremental knowledge accumulation method. That would be an interesting test of this kind of model.

Correctness: To the extent that can be determined from a conference paper, the claims, method, and methodology are correct.

Clarity: The writing is very clear.

Relation to Prior Work: As noted above, how this differs from prior work in Tennenbaum's group has not been adequately delineated.

Reproducibility: Yes

Additional Feedback: I think this is a strong paper, in a line of research that is interesting and should be of interest to the NeurIPS community. Making it clearer how this work constitutes an advance over the prior work in this line is the major point I would like to see addressed in the revision.

Review 3

Summary and Contributions: The goal is to study how humans learn to perform a novel task via abstraction and compositionality of simpler routines. The experimental manipulation is to use two different types of training sets that are designed to lead participants to learn different inductive biases. The same test set is used for both groups. Thus, systematic differences in how subjects complete the task across groups may be attributed to differences in how they learned the task, due to the assignment of the training set. A program induction algorithm for learning the task demonstrates the same types of biases as seen in the humans. Also, the outputs of the program learned on training set 1 are more similar to the human drawing trajectories for group 1 than for group 2, and vice-versa for outputs of program learned from training set 2.

Strengths: The experimental design controls for confounding factors by randomization. The stimuli design is quite ingenious in allowing two different types of drawing strategies to be tested on a common set. The program induction uses state-of-the-art probabilistic inference methods to learn how to draw, and qualitatively replicates important features of human performance. A second experiment with a rotated test set adds extra confirmation that humans learned the task as a composition of simpler tasks. The data displayed is remarkably clean and shows high compliance of the human volunteers to the intent of the task. This study is very interesting for the researchers interested in understanding human learning, and encourages the use of similar methodology for future studies in human task learning.

Weaknesses: As the authors note, the AI has much less agreement to human behavior than intersubject agreement within the same group. But due to the novelty of the work, such a performance gap is understandable. The authors used very strong prior knowledge (e.g. pre-defined primitives) which while intuitively plausible, may limit the generality of the AI.

Correctness: The algorithmic and the experimental methodology appear to be correct.

Clarity: On the whole, the paper is quite readable. One issue that was not immediately clear is the terminology of image, stroke, trajectory. The meaning of the notation in the equations (4) "t draws I" is not clear. I had to refer to the supplement to understand that the trajectory did not refer to the entire image but rather to a single stroke.

Relation to Prior Work: Previous literature from two fields, (1) psychological studies of drawing and (2) AI program induction and drawing are cited. This study is novel in comparing human data to AI behavior.

Reproducibility: Yes

Additional Feedback: The paragraph "Reweighting trajectories by motor cost" could probably be made more clear. It was not clear if all the details needed to reproduce the implementation of the program induction are given. But these are relatively minor issues and would not improve the already high score that I am giving this paper.