Reviews: Controllable Text-to-Image Generation

Positive Aspects 1. The paper is well-organized and written, which can be followed easily. 2. The considered problem is interesting. In particular, instead of generating a new image from the text, the authors pay more attention to image manipulation based on the modified natural language description. 3. The visual results seem very impressive, which are able to manipulate the parts of images accurately according to the changed description while preserving the unchanged parts. Negative Aspects 1. For the word-level spatial and channel-wise attention driven generator: (1) The novelty and effectiveness of attentional generator may be limited. Specifically, the paper designs a word-level spatial and channel-wise attention driven generator, which has two attention parts (i.e. channel-wise attention and spatial attention). However, since the spatial attention is based on the method in AttnGAN [7], most contributions may lie on the additional channel-wise part. But instead of verifying the impact of channel-wise attention, the paper reports the visual attention maps (Fig.5) according to the integration of channel-wise and spatial attention. Moreover, the visualized maps in Fig.5 are similar to the visual results in AttnGAN [7] and thus cannot prove the effectiveness of the proposed channel-wise attention method. (2) In Fig.6, it seems that the generation image without attention mechanism (SPM only) can also achieve desired visual performance (i.e., the generated image only changes the colour of the bird while keeping the rest). Thus, I am not sure whether the attention mechanism is really necessary for the proposed method. 2. For word-level discriminator: (1) In line 132-133, it mentions that “Additionally, to further reduce the negative impact of less important words, we apply a word-level self-attention mechanism”. More experiments should be conducted to verify the influence of the additional self-attention. Otherwise, it cannot demonstrate whether the self-attention mechanism is necessary for the discriminator and how it impacts the final results. (2) Moreover, to prove the effectiveness of the proposed word-level discriminator, it would be better to conduct an ablation study with and without this discriminator. 3. For semantic preservation model (SPM): The proposed SPM is an application of the perceptual loss [24], where the novelty is also slight. 4. Minor issues: Line 231, “Fig.4 also shows the …” should be “Fig.5 also shows the …”. Final remarks: For this paper, I have mixed feelings. In positive, the studied problem is interesting and the generated results are impressive. Nevertheless, on the other hand, since there are not sufficient experiments to verify the impact and necessity of each proposed component, it is not clear whether they are significant or not in this task. Besides, the novelty of some proposed parts may be limited (e.g. SPM).

Reviewer 2

- How does the channel wise attention module help in addition to the word-spatial attention? An ablation here would be useful to validate this architecture choice. - In equation 4, what is the interpretation of \gamma_i. Since the word embeddings come from a pre-trained bidirectional RNN, it is not clear why this number should correspond to a notion of the “importance” of a word in the sentence, and it is not clear to me what importance means here. Is my understanding correct that the objective in equation 5 is for the model to maximize the correlation between words depicted in the image and the important words in the sentence? - When the semantics preservation model is used, how does this affect the diversity of the model samples? To my knowledge the proposed word-level discriminator and channel-wise attention are novel. The paper is well written despite a few unclear points mentioned above. However, the significance is difficult to judge because the CUB dataset is somewhat saturated, MS-COCO samples are not realistic enough to easily assess the controllability, and the text queries do not seem out of sample enough to test the limits of the text controllability.

Reviewer 3

Paper summary: This paper proposed a novel text-to-image synthesis method with its focus on learning a representation sensitive to different visual attributes and better spatial control from the text. To achieve this goal, a novel generator with word-level spatial and channel-wise attention model has been adopted based on the correlation between words and feature channels. To further encourage the word-level control, a novel correlation loss defined between words and image regions has been explicitly enforced during model learning. Experimental evaluations have been conducted on benchmark datasets including CUB and COCO. Overall, this paper is an interesting extension to the text-to-image synthesis and multi-modal representation learning. In general, it is clearly written with reasonable formulation and cool qualitative results. However, reviewer does have a few concerns regarding the current draft. - Current title is ambitious or not very precise as text-to-image generation is often coupled with a learned controllable representation. - What is the shape of visual features v? As far as reviewer can see, in Equation (1), v indicates the feature map with single image channel; in Equation (2), v indicates the feature map within image sub-region. It would be great if this can be clarified in the rebuttal. - Equation (6): j is defined but not used. - Table 1 and Figure 3: While reviewer is impressed by the results on CUB, results on COCO don’t look good. Also, it seems like the model only learns to be sensitive to the color but not shape or other factors. - Figure 4: On CUB, two images generated by similar text (with one concept difference) look almost identitcal (e.g., same shape, similar background color). On COCO, It is clear that two images generated by similar text (with one concept difference) look a bit different (see last four columns). Reviewer would like to know which loss function enforces such property. Why the same loss completely fails on COCO?

Paper ID:	1221
Title:	Controllable Text-to-Image Generation

Reviewer 1

Reviewer 2

Reviewer 3