Review for NeurIPS paper: Generating Correct Answers for Progressive Matrices Intelligence Tests

NeurIPS 2020

Generating Correct Answers for Progressive Matrices Intelligence Tests

Review 1

Summary and Contributions: This paper describes a method to generate the correct choice panels i.e. the answers to the RPM problems. The proposed method consists of 3 modules. (1), VAE for learning the latent representation of the choice panel, (2), CEN for capturing implicit relations in the context panels (3) DIscriminator for improving the generated image. With properly designed loss functions, the latent representation generated by CEN is used to sample the hidden vector for the generative process in the VAE, therefore the proposed method can be used the context panel the generate the choice image according to the given choice panels. The experiments are designed to automatically verify the correctness of the generated answers.

Strengths: The proposed method does provide a different perspective comparing to the previous methods which learn discriminative models instead of generative ones. The way of using the context panels to generate a hidden representation that approximates the latent representation of VAE is interesting and novel in the RPM domain. Making the whole generation process end-to-end is a valid contribution to the community.

Weaknesses: The design of the experiments needs further illustration and justification. First of all, the selected models for recognizing the correct choice are not strong enough in the selected dataset. The accuracy of these models is not trustworthy in this experimental setup. Second, the accuracies of these models on the generated answer are even worse. It is difficult to tell whether the generated results are correct or not. Last and most importantly, the authors should conduct experiments on the RAVEN dataset because there is robust model [13] that reaches over 90% accuracy. Generating the answer on RAVEN with [13] to evaluate the correctness of the generated results would be a much better way to justify the proposed method empirically.

Correctness: There are some misunderstandings of the definition of generative model. A well acceptable definition should be “sample from a latent variable to generate a corresponding signal”. I would recommend not to use the word generative model. The experiments need further justification.

Clarity: It is easy to read the paper.

Relation to Prior Work: The paper does provide literature review to distinguish itself with the SOTA methods.

Reproducibility: Yes

Additional Feedback: Please address my question about the experimental setup. I am open to raise the score if the authors could provide insightful feedback. I have read the rebuttal and the comments from other reviewers. I would like to keep my original rating because the feedback from the authors did not address my problem. It looks like the authors misunderstand my point in the rebuttal. To be more specific. The method to verify the correctness is not trustworthy because (1), the SOTA to choose to discriminate the answer is not strong enough, only 70% acc. (2) In addition, the verification process is not conducted on the problem individually. It is difficult to tell whether the proposed method generates the correct answer panel for the corresponding context. (3) According to (1) and (2), it makes the entire evaluation metric not reliable. The authors misunderstand my comment in the initial review and make irrelevant responses in the rebuttal. Therefore I decided to keep my initial rating.

Review 2

Summary and Contributions: This paper focuses on the progressive matrices intelligence test, which generates the correct answer given some contextual images. The performance of the model on the generated image quality is satisfying and the performance on multiple-choice tests is competitive.

Strengths: S1: the performance of the proposed model is good, which is able to generate both correct and high-quality images given the abstract images. S2: the task is interesting to me. S3: ablation studies are conducted which demonstrates how each module works.

Weaknesses: W1: What is the performance of humans on the multiple-choice test? Does AI outperform humans? W2: it seems that the proposed model just combines some previous works, like VAE and GAN, which have been widely applied in image generation. The only difference is that the task here is abstract image generation instead of natural image generation. W3: which dataset do the authors use in this paper?

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper proposes a task of learning to synthesis correct answers for Raven Progressive Matrix. Authors argue that it is a more challenging task than existing RPM-style tasks since the metric of generation is defined on a semantic space. That is, the generated results do not have to be exactly the same as the correct answer, but it should capture the correct underlying relation. To tackle this challenge, authors design a model with three pathways for reconstruction, recognition, and generation. For empirical evaluation, authors compare the generated results and the ground-truth correct results with SOTA recognition models, and find the model tested achieve similar performance. Authors also provide qualitative study to justify the learnt generative model can indeed extract the abstract relations from provided samples.

Strengths: The model is very carefully designed, even though the idea itself is intuitive. First put aside the aesthetics of modeling, this model does take quite some efforts to design and train. Since authors present their methods with pretty smooth logic flow, I can only roughly summarize their ideas here. First there is a reconstruction pathway, which is basically a VAE model for the correct answer. It learns the patterns in the pixel-level. Then there is a recognition pathway, which combines the context provided in the sample images (the first two rows/columns of RPM) with the left ones. This pathway inherits the multi-scale abstraction structure from [1], and is trained to extract contextual relations from given images, which is later used as a context-aware discriminator to train the generator. The generator is based on the decoder of VAE, but in the latent variable is sampled from a distribution conditioned on the context. There is also an unconditioned discriminator that provides GAN-style weak supervision. The empirical results seem interesting. I am amazed the model can generate qualitatively correct images. And the ablation study is comprehensive.

Weaknesses: My first concern is that this model seems far from minimalism. Generating correct answer for RPM is an interesting task. But one of the reasons it is interesting to the current AI community is that humans can somehow generate some results correctly without huge amount of training. Although this work demonstrates the possibility of generator that can show some reasoning capability, I highly speculate that this is a distillation from the subnetworks for context extraction, which is trained with strong supervision. There is still a long distance from this model and human brain. The latter one is believed to be designed by nature following minimalism. And the quest for this minimum model is necessary because the task of RPM itself is not applicable to real life scenes. It is the general meta priors we discovered from this task that might be promising for a more complex real world model. Therefore, the proposed model might be of great interest to communities like image translation, but it is not quite to the point of the research in RPM. With that said, there are also quite some empirical designs that are probably not generalizable to other task. For example, the DS-KLD for the regularization on the generator. Also, VAE are always blamed to in the authenticity of generated images. Authors claim that their evaluation scheme that bypasses human study and replaces it with SOTA model is mainly due to the concern of human resources. However, I really doubt the actual feasibility of human study even if adequate human resources are provided. Apparently, humans can somehow figure out the blurry patterns in the generated images. Did author deliberately choose machines to verify their model because of it? Can authors provide an experiment setup with hypothetically sufficient labour? Other weaknesses are listed in correctness part below.

Correctness: In terms of correctness, authors did not introduce how they split training, testing and evaluation sets. Relational models are normally expected to exhibit certain out-of-distribution capability. I'd like to request authors' comment on this.

Clarity: This paper is generally very well written. Authors offer great details of their logic flows when introducing their model. The method section is very informative and easy to follow.

Relation to Prior Work: I think this work covers most of prior works, though there seems to be a skew in the perspective on the essence of RPM-style problems.

Reproducibility: Yes

Additional Feedback: I raised my rating because authors showed significant efforts in addressing my concerns. The design to use VAE to reconstruct image to add blurriness to wrong answer is intriguing. Even though authors do not provide figures for it, I would like to believe the blurriness affects reconstructed answers and generated answers equally. Also, the human study looks interesting. I highly recommend authors to include them in revision. I would also like to see discussion on weakness#1 in revision to make how this model leverages supervision more explicit. This can help frame this method better and provide anchors to other colleagues in the community.