NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:479
Title:Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge

Reviewer 1

Quality The paper is thorough in describing the method and supporting the proposed method with experiments Clarity The paper is well written and easy to follow. Originality & Significance Although the method is not very novel in light of Paper 1253: RecreateGAN (see more below), the experimental exploration of different settings of the method is thoroughly done and interesting. The idea of matching local image features to word-level embeddings and matching global image features to sentence level embeddings is intuitive and makes sense. This paper shares significant parts of the method with Paper 1253: RecreateGAN, in particular, the textual-visual embedding loss in (6) of this paper matches the pairwise loss defined in eq (5) of the other paper. However, this paper uses this component as part of a different method, namely for textual-visual embedding vs an image similarity embedding. Additionally, the cascade of attentional generators in this papers is very similar between both papers. Although both papers can be seen as methods to translate something to an image (in this case text, in the other paper, an image) using similar embedding methods to condition a cascade of attentional generators, the details of the methods and tasks still have quite a bit of differences between each other, and it would not be possible to explain both systems in one 8-page paper. I find the text-to-image paper more compelling, as the task itself makes more sense to me, and the results are state of the art. However, as there are similarities, I am repeating some comments from my review of the other paper here: 1. The rationale for the attention-weighted similarity between two vectors is unclear to me. Since all the loss in (6) is symmetric in w and l (and also symmetric in s and g), why not use a symmetric similarity measure? I would also appreciate an exploration of the effect of using a cosine similarity here. 2. I don't understand why the probability of w being matched by l is calculated over the minibatch in the formulation of (3). Why is it desirable to have p(l | w) depend on the other members of the batch?

Reviewer 2

I agree the motivation of "learn, imagine and create". The whole model is intuitively simple and easy to implement. But I have a few concerns. First, since the Text-image encoder directly embeds the sentence, how can the model be able to control each word as presented in the experiments? Second, as far as I know, there are so a lot of works on text-image generation. Finally, the useage of the concept "prior knowledge" is very inappropriate. The prior knowledge ofter refers to the symbolic knowledge which can be interpretable and explained the reason. But this paper only named the text-image encoded feature as the prior knowledge, which is overclaimed. And the claims about "mimicing human imagination" is also not good as it is just a fusion strategy. If we think this paper as a practical model , it is fine. But the authors used too much overclaimed phrases to try to highlight some contributions which are not true.This paper only compares with AttGAN and lacks of enough comparison with recent works. Finally, the ablation studies to validate each key component is missing and not enough to support the contribution of this simple model.

Reviewer 3

his paper presents a novel text to image (T2I) generation method called LeciaGAN which is inspired by the process humans follow to achieve the same goal. These processes are modelled by decomposing the required tasks into three phases: the knowledge learning phase, the imagination phase, and the creation phase. Semantic consistency and visual realism is achieved by using adversarial learning and cascaded attentive generator. The paper provides a good overview of the literature in the related works, and provides a good motivation for the proposed algorithm. Despite the many concepts required to clearly describe the model in few pages, the well-thought structure of the paper makes it easy to follow, while sufficient description is given to highlight the necessary elements. The metric used to (Inception Score and R-precision for objective measures, and subjective visual measures and human perceptual testing) is sufficient to establish the performance of the proposed algorithms as well as to compare it with already existing methods. I am quite satisfied and impressed by the result obtained and the method used to get there. I believe this paper has a significant contribution to the T2I field. Some minor comments are as follows: * The descriptions in Text-Image Encoder paragraph in section 3.1, line number 123, are slightly cryptic. I think rewriting it to clearly present the pairwise similarity part would greatly help the readability of this section. * Equation 7, just below line 141, the second term in the right side of the equation should be log(1-D_modal(s_n))