NIPS 2018
Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
Paper ID:210
Title:Image Inpainting via Generative Multi-column Convolutional Neural Networks

Reviewer 1

Summary This submission tackles the problem of image inpainting. It adapts the idea of multi-scale predictions by running three branches that predict features at different scales which are concatenated before two convolutions create the final prediction. The loss consists of three components. The relatively common L1 and local+global discriminator loss and lastly a novel implicitly diversified MRF that correlates patches of features between the ground truth and the prediction. The method is evaluated on five diverse datasets and with several ablation studies analyzing the influence of different parts. Strengths - The ablation study shows the contribution of the different components well. I am quite surprised about the added detail through the increased receptive field. - The method shows impressive image quality on a wide range of datasets. - The paper is technically sound, well structured and easy to follow. - The fact that the expensive MRF optimization can be pushed to the training and is not needed for evaluation is elegant and makes the model much more usable during inference. Weaknesses - Since it is difficult to evaluate image generation tasks quantitatively, a human study could be performed where users need to select their preferred generated image. This evaluation could help showing the increased visual quality with respect to other methods. - Since Equation 5 is applied multiple times it should contain some notion of iterations (e.q. M_w^{(i)}. Otherwise M_w appears on both sides of the equation. - An interesting evaluation could be correlating reconstruction quality and RS measure. This could show that the loss actually helps finding the right correspondences between patches. Minor Comments - L. 82: {1, 2, …, n} - The type of upsampling for the feature maps before concatenation is not mentioned. - Since the ID-MRF is only used in training it would be interesting to report training times or even time per image during training compared to inference. - Potential typo in Equation (4). L_M(conv4_2) appears twice. - The intuitive example for L_M from the supplementary material could be included in the paper. I found it easier to follow than the explanations in lines 117-128. - L. 224: our method is still difficult to deal with large-scale -> still has difficulties dealing with […] Rating & Evaluation Given the above strengths and weaknesses I am in favor of accepting this submission. A user study and a more in-depth analysis of the proposed MRF loss could make this paper even stronger. -- After reading the other reviews and the rebuttal I find most of my concerns addressed. The authors provide several additional quantitative results including a user study with convincing outcome. I vote in favor of accepting the paper.

Reviewer 2

The paper proposes a model and training regime for the task of image inpainting. Experiments are run on various datasets with limited qualitative and quantitative results presented. There are a number of issues with the paper that result in my vote to reject the paper. First, the experimental results are insufficient. The qualitative results are too few for the reader to make an informed assessment of the model (this is a consequence of limited space but nevertheless a problem). In addition, quantitative results are only included for the full model on one dataset. Table 1 should be expanded to include all datasets as well as numerical results from the ablation analysis. Second, the model change to multi-column is incremental and unclear whether this provides an improvement. The ablation analysis does not provide any quantitative analysis nor is there sufficient exploration of alternative model architecture to draw a conclusion. Third, it is unclear to me why the nearest-neighbor-based regularization term is called an MRF. I don't see where the MRF comes into things given that in Equation (1), for any pairwise term, one of the terms is always observed. Last, the writing is very disjoint and often unclear. For example, what does in mean to "keep multi-modality" (L37)? And what is Y on L87 (presumably it's the original image, i.e., X = M * Y where * is the Hadamard product). Other suggestions for improving the writing/correcting typos: 1. "patch-wised" -> "patch-wise" 2. "Exemplar" (L51) -> "Examples" since you use "Examplar-based" inpainting to mean something completely different 3. "zero value is filled" -> "Unknown regions of image X are filled with zeros." 4. "Wasserstain" -> "Wasserstein"

Reviewer 3

The paper proposes a deep architecture for image inpainting that incorporates parallel conv branches and is trained using adversarial loss, a proposed spatially-weighted reconstruction loss and an MRF regularization term. Qualitatively, the results seem clearly better than baselines. As usual with generative models, it is difficult to evaluate the results quantitatively. There is an ablation study, which would be very useful normally, but since the evaluation is purely qualitative, it is difficult to draw conclusions. The authors put forward the advantages of the proposed parallel-branches architecture. Maybe it would be useful to compare as well with an inception-style model, where the branches get concatenated after every module and not as here only at the end. How many parameters for the network? Any particular reason for using 3 branches?