Review for NeurIPS paper: Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

NeurIPS 2020

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Review 1

Summary and Contributions: Method for training a visual dialog system with only VQA data and not dialog data. Experiments confirm effectiveness of approach.

Strengths: The idea of the DwD task in interesting, novel, and compelling. The novel architecture encourages appropriate generalization from VQA training to dialog adaptation without dialog training data. Experiments are quite comprehensive, including a variety of automated metrics and human evaluation and ablations of the method.

Weaknesses: The model seems quite complicated, maybe overly so, but the ablation experiments illustrate the importance of several aspects of this complex model. Why use a 2-year old bottom-up-top-down VQA model? This seems very dated now. A more recent multimodal transformer model like Vilbert, LexMert, or Uniter would be more appropriate. Wouldn't a more modern model help produce better results overall?

Correctness: Paper seems technically sound and experimental methodology is reasonable including a variety of automated and human eval metrics and ablations. My biggest question about the paper is: Why are there no experiments directly evaluating the Q-bot's ability to engage an actual human user in a complete dialog, demonstrating its ability to engage a user in a full dialog, evaluating it's ability to find a hidden image given to the user. This sort of full user-study on the end-to-end task would be the most compelling evaluation and would greatly improve the quality of the paper. It is not clear why it is missing.

Clarity: The paper is fairly clear and well written, but it tries to cram a lot of information in a short paper leaving many details for the appendix. I feel a paper that focuses the evaluation more on a final human user study as outlined above would be better, leaving some of the other experiments for the appendix.

Relation to Prior Work: Good coverage of related work.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper addresses the problem of learning one type of task and adapting it to another task. Specifically, the paper focuses on learning to perform visual dialogue without having learned to do the dialogue. Using VQA data, it learns to ask discriminative questions. Then question intent is used to generated new questions in the dialogue. The evaluation is performed both using automatic metrics for both task success and language drift. Also, human evaluation is performed to check the language drift.

Strengths: + paper is written well and understandable + Using VQA data to pre-train the dialogue model is an interesting approach + Having "Not Relevant" answer in A-Bot could help Q-Bot's planner to filter out asking the non-relevant question

Weaknesses: + The main problem with the paper is the game design. In visual dialogue, i.e GuessWhich game[2], does not have access to the image. It has to build up the visual representation based on the caption and dialogue. That is why having a caption is important for the GuessWhich game (L69). While in the proposed game, since Q-Bot has constant access to the images. It just needs to ask questions such that it distinguished the one image from the other. Which mean that ask questions such that narrow down the type of image (bird from car) and then try to differentiate among different similar type image (say bird). This could be observed from the questions generated from Figure 3. + Since the model is trained on VQA data to distinguish images, it only learns to ask the discriminative type of questions. However, the dialogue is not always discriminative; for example, the dialogue could also involve clarification, follow-up question etc. That means that model is not learning these types of skills only focusing on one type of skill. + The way distractor images are selected is not systematic. Because of random selection, some of the games might be very easy to perform, only need to distinguish b/w contrasting VQA images. For VQA (i.e., MS-COCO) images, object category, super-categories, etc, could be used to create systematic and challenging distractor images. Look at "Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog" for some suggestions. + As we could see from Table 1, B3 vs. C3. Having random distractor significantly improves performance. So having challenging distractors will have a greater effect on the performance. + Dipper analysis on the distractor, for eg in the case of CUB does having bird image as target images make tasks more challenging. + Results are not clearly explained in the text. For example, L266 "Longer dialogs (more rounds) achieve better accuracy". However, it is not supported clearly. It would have been better to see results for 9 distractors with a different number of rounds of dialogue and all measures how it performs say as a graph. + Make it clear how the target image is selected for A-Bot (L62-63) + Missing dataset splits is paper using the same split as GuessWhich. Please provide details. + For language analysis looks at "The Devil is in the Details: A Magnifying Glass for the GuessWhich Visual Dialogue Game," which measures Lexical Diversity (similar to language-diversity), Question diversity etc. + Table 1 caption "Our method strikes a balance between guessing game performance and interpretability." Please clarify how interpretability is improved. + Fig 3 provide Q4 answer and guessed image by models

Correctness: Looks okay.

Clarity: Well written paper. Need some improvement in result presentation and explanation.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: This paper presents a factorized training approach to training a question-generating agent for a GuessWhich? style image guessing game that first pretrains on single question-answer pairs and then fine-tunes on visual dialog data. The proposed approach maintains higher language fluency and relevance than standard fine tuning (which exhibits language drift towards agent-agent 'neuralese') while achieving higher accuracy than zero shot transfer (which has high fluency but does not know how to sequence questions for dialog). Interestingly, the method also outperforms typical transfer on task accuracy in most tested settings across two datasets and various game and dialog parameter settings. **Post-rebuttal: the rebuttal addresses questions and concerns from other reviewers and suggests the additional space on acceptance can be partially used to round out details currently relegated to the supp. I continue to feel the paper should be accepted.

Strengths: The use of human evaluations to decisively measure language intelligibility is a major strength of the paper's evaluation. It would be great to see it elevated into a row of a results table.

Weaknesses: The paper leans heavily on prior work in VQA, GuessWhich?, etc., and so many implementation details are referred to as "standard" or detailed only in the supplement. This may limit the accessibility of the contribution, and limits its scope without the inclusion of a broader discussion of where this kind of approach could succeed outside of the small game setting presented.

Correctness: Yes, the evaluation is thorough.

Clarity: Yes, the paper is easy to follow and the contribution is clear.

Relation to Prior Work: Yes, the paper situates itself well with respect to existing work in VQA and image selection games.

Reproducibility: Yes

Additional Feedback: For adding a "not relevant" prediction, from my understanding, a single randomly sampled question per image is added to the training set with "not relevant" as an answer. Doesn't this add a weird latent prior on the frequency of "not relevant" that's a function of the training questions available on average per training image? Anyway, it's neat that this seemed to just sort of work out of the box like that. It's interesting that this method tends to have higher accuracy (Table 1) versus typical transfer. Is the transfer baseline crippled somehow or is this just explained by better generalization to the test data? Seeing final training numbers here might help tease that out, either in the main Table or in the supplement with a small explanation. Line 337 is making a good point in a really hand-wavy fashion. Being more concrete here would be helpful, since the way it's written sort of excuses models and doesn't explain why these arise (e.g., sample bias versus color space representation bias). nits: Line 3 missing word "a" Figure 3 misalignment in row 1 box "Ours" on final A3/P3 Line 271 lowercased "a-bot" Line 273 n-dash should be a comma or the word "and" Citations include both arXiv-style and CoRR style for arXiv papers. Some citations are listed as on arXiv only when they have been published (e.g., [12] appeared at ACL'20, which happened after this article was submitted, so references need to be updated accordingly; [29] appeared at ICML'18 but that isn't reflected here; etc.).

Review 4

Summary and Contributions: This papers tackles the problem of language drift when optimizing chat bots in task-oriented dialogues based on a task completion scores. The proposed approach decouples text generation from policy learning. Specifically, the policy is a variational auto-encoder with discrete states. The text generator converts discrete states to natural language; it's pre-trained and fixed during policy learning to ensure that the language does not drift away from natural language. The approach is tested on an image guessing game. ========================== Thanks for the response and comment on VAE with discrete spaces.

Strengths: This work tackles a challenging problem in RL-based text generation. Typically, when we optimize the reward, the objective does not necessarily limit the generation to natural language, which may produce a model that speaks in "bot language". I like the proposed approach in that it decouples text generation from policy learning in an elegant way through discrete policy states and the pre-trained text generator, such that the text generation part can be left untouched during policy learning.

Weaknesses: I think the proposed approach might have two potential problems in practice and I would like to see more discussion on how it is solved in the image guessing setting. First, VAEs often has the mode collapse problem where the latent structure is ignored. Did it happen here? If not, do the learned states have any task-specific meaning? Second, during policy learning the latent states are likely to deviate from their pre-training distribution, which may cause problems for the text generator (which is fixed during policy learning). Is the text generator robust to distribution shift of the latent states?

Correctness: Yes, the approach makes sense to me, although I'd like to see more explanation on potential pitfalls in practice.

Clarity: The paper is quite clear and easy to read.

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: I'm surprised to see that Typical Transfer has low accuracy in quite a few settings. They are prone to language drift, but should be able to optimize the reward, right? I think an important baseline is also missing. One standard way to alleviate language drift in policy learning is to interpolate that with MLE training. So here Typical Transfer could be interpolated with MLE training on VQA (same dataset used in pre-training). There will be a trade-off between optimizing the reward and avoiding language drift though. Still, this is commonly used in practice and should be compared against.