NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1744
Title:Meta-Reinforced Synthetic Data for One-Shot Fine-Grained Visual Recognition

Reviewer 1

The results seem to improve over ProtoNet which is basically the only baseline used in the paper despite mentioning lots of data augmentation papers in related work. However, I couldn't quickly equally comparable works using ImageNet data to do meta-learning on CUB (although I am not working in this field). Reproducibility looks okay barring the fact that BigGAN is a conditional GAN and requires class label. Strictly speaking it is not mandatory if the goal is to find BN parameters anyway (which depend on class label in BigGAN), but I imagine the initialization matters. I'd appreciate clarifications about this aspect of the implementation. The writing is fairly clear, although section 4 contains quite verbose notation. The idea is not particularly original, many tried using GANs for data augmentation. But this implementation does work, even though the improvement is not ground-breaking.

Reviewer 2

Paper Strengths: The authors tackle an important and challenging problem of few-shot fine-grained classification. The proposed approach is simple. Experimental evaluations demonstrate the effect by modifying the GAN generated images by combining them with real images at the patch level through meta-learning. Paper Weaknesses: 1) The proposed approach in this paper can be viewed as a straightforward combination of an off-the-shelf GAN and [7] that learns to linearly fuse two images for data augmentation. The novelty of the proposed approach is somewhat limited. In addition, the connection with [7] is not fully discussed. 2) Since [7] is directly relevant to the proposed approach, it would be more convicting to show the experimental comparison with [7]. 3) The technique in the proposed approach seems not restricted to fine-grained recognition. It would be interesting to evaluate the approach on standard generic image few-shot benchmark such as miniImageNet. 4) How is the performance if we do not use a pre-trained GAN generator but train the generator also in an end-to-end manner? 5) How is the performance with respect to the number of generated examples? 6) Why does the proposed approach work for few-shot learning? It looks like that the GAN image is similar to the original real image. Hence, their combination does not introduce diverse examples beyond the given real images.

Reviewer 3

Originality: 7 / 10 This is a novel paper that is well motivated and executed. Admittedly, all of its components are not novel alone -- grid linear mixture for image augmentation [6], meta-learned generator [35], episodical procedure, and standard few-shot classifiers. The proposed pipeline itself is new and does provide insight that end-to-end image-augmentation is feasible with a strong generator initialization. And also, finetuning a GAN towards certain modalities (or observations) are not informatively studied before. Figure 1 and its experiments could serve as a good reference to researchers who want to study image augmentation. ---------- Quality: 8 / 10 The experiments, as well as the pilot study, are in great shapes and considerations. The shown performance gains are consistent and non-trivial compared to its previous works. One thing to note is that the baselines are rather old -- the most contemporary model in the comparison table is RelationNet [29] from CVPR 2018. I believe there are still a lot of missing numbers out there. And what will make this paper even better are experiments on ImageNet itself to see how it will scale to large benchmark. But I am not exactly sure if BigGAN has seen the testing set during its training. ---------- Clarity: 8 / 10 The writing is good and it was a great pleasure reading it. Notations are consistent and there are very few typos that do not actually interrupt the flow. It is such a pity that authors have not included the code, but I would strongly suggest to opensource as early as possible since people will be following it. ---------- Significance: 8 / 10 It would be better if there's any intuition or even empirical study on the correlation between the fusing weight and the real/synthesized image pairs. ---------- Other comments: (1) How would your augmented fused image embed on feature space? And how will it affect the final decision of the classification head? Since you are primarily focused on the one-shot learning task, it will be informative and also doable for such illustration. (2) L180, typo, inconsistent $\mathbf{I_g}$, change it to $\mathbf{I}_g$. (3) L114, logistic regression, is it a one-versus-all classifier for each category, and then select the most salient probability from all candidates? There should be more text for description. (4) L141, $\mathbf{I}_z$, missing definition, this is the first time you use this notation without any reference.