Review for NeurIPS paper: RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning

NeurIPS 2020

RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning

Review 1

Summary and Contributions: This paper takes a systematic look at continual learning of LSTM-based models for image captioning. It adapts continual learning approaches based on weight regularization and knowledge distillation to image captioning. Besides, this paper proposes an attention-based approach, called Recurrent Attention to Transient Tasks (RATT). This approach is evaluated in incremental image captioning on two new continual learning benchmarks based on MS-COCO and Flickr30, showing no forgetting of previously learned tasks.

Strengths: 1. This paper is the first work to focus on the continual learning of recurrent models applied to image captioning. 2. The proposed approach named RATT is able to sequentially learn five captioning tasks while incurring no forgetting of previously learned ones, which is supported by experiments on two new continual learning benchmarks defined using the MS-COCO and Flickr30 datasets. 3. The experiments result in Table2 and Table3 show that the method is effective, the forgetting rate has dropped significantly.

Weaknesses: 1. overall, the paper lacks enough innovation, the method just modified HAT[1] to recurrent networks for image captioning 2. There are many excellent image captioning models. But the baseline compared with the method is too simple, the experiments results lack sufficient conviction. [1] Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International Conference on Machine Learning (ICML), 2018.

Correctness: Yes, The author provides detailed comparative experiments.

Clarity: This paper is mainly well written, although it contains many trivial formulas, like (6) - (11), which is the formulas of LSTM. And the structure may not be reasonable, as the related work, which is closely related to chapter 2, should be written before chapter 2.

Relation to Prior Work: Almost yes. The related work of catastrophic forgetting is well written, but the related works of image captioning is very few.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper studies the continual learning of recurrent networks on the image captioning task. It proposed the novel Recurrent Attention to Transient Tasks (RATT) method inspired by previous attention-based continual learning approaches. Experiments on the COCO dataset and the Flickr30K dataset show impressive results, where RATT achieves almost zero (or even negative) forgetting.

Strengths: The paper is one of the first to study continual learning in recurrent settings and shows promising performance on the image captioning task. It proposes RATT, a novel approach for recurrent continual learning based on attentional masking, inspired by the previous HAT method. In its proposed method, three masks (a_x, a_h, and a_s) to embedding, hidden state, and vocabulary are introduced, and in its ablation study, the paper shows that all these three components are helpful to the final continual learning performance. In addition to the proposed novel approach, the paper also explores adapting weight regularization and knowledge distillation-based approaches to the recurrent continual learning problem. In its experiments, the paper shows strong results, largely outperforming simple baselines (such as fine-tuning) and previous regularization or distillation-based approaches (EWC and LwF). It achieves almost zero forgetting on COCO and negative forgetting on Flicker30K, which is very impressive. Update: After reading the authors' response, I am convinced by the authors' claims and would recommend accepting the paper. Regarding the novelty of this paper, first, since there is no previous work on continual learning in the image captioning domain, explorations in this direction should constitute a novelty by itself. Second, while the proposed methodology is inspired by HAT, it involves non-trivial adaptation of HAT to the image captioning task, such as introducing three different masks (a_x, a_h, and a_s) to embedding, hidden state, and vocabulary. The paper also provided detailed analyses on these aspects. Hence, I believe the paper is novel in both the problem it addresses and its methodology.

Weaknesses: I do not think the paper has any major weaknesses. However, it is widely known that automatic metrics like BLEU and CIDEr often do not align well with human judgment on the quality of the captions. It would be better if the paper can introduce human evaluation of the generated samples. Besides, the captioning model used in this work is a very simple model compared to the state-of-the-art on image captioning. It would be better if the paper could show that the proposed RATT method generalizes well to more recent captioning methods such as AoA [A]. [A] Huang, Lun, et al. "Attention on attention for image captioning." Proceedings of the IEEE International Conference on Computer Vision. 2019.

Correctness: The paper's claims and its method are correct.

Clarity: The paper is well-written.

Relation to Prior Work: Yes, the paper clearly discusses (and distinguishes itself from) previous work on continual learning.

Reproducibility: Yes

Additional Feedback: After reading the authors' response, I am convinced by the authors' claims and would recommend accepting the paper. Regarding the novelty of this paper, first, since there is no previous work on continual learning in the image captioning domain, explorations in this direction should constitute a novelty by itself. Second, while the proposed methodology is inspired by HAT, it involves non-trivial adaptation of HAT to the image captioning task, such as introducing three different masks (a_x, a_h, and a_s) to embedding, hidden state, and vocabulary. The paper also provided detailed analyses on these aspects. Hence, I believe the paper is novel in both the problem it addresses and its methodology.

Review 3

Summary and Contributions: This paper presents a new method RATT on solving continual learning problem for LSTM-based models on image captioning, which self-defines two continual learning benchmarks and achieves promising results on it.

Strengths: 1.This paper first extends traditional continual learning problem on image captioning and creates two new task splits based on COCO and Flicker30K for with thorough evaluation. 2. The empirical evaluation shows RATT is effective in preventing catastrophic forgetting compared to other recurrent continual learning methods, which even obtains zero or nearly zero forgetting. 3. The paper has a clear structure and is well organized.

Weaknesses: 1. For the task split on COCO, can the authors provide the overlapping words percent between different tasks? It is meaningful to know the disjoint degree between them. 2. In Table 2, the proposes RATT achieves nearly zero forgetting. As each task share common words, can the authors more clearly explain why the new training task does not have any influences on the old task (the trainable weights for common parts between different tasks are fixed?). 3. What are the limitation cases or failure cases of the proposed method. 4. The baseline (show and tell) is proposed about five years ago, whose performance is far behind current SOTA models using LSTM, which needs a mention or comparison, such as [a,b,c]. And more and more methods using Transformer [c, d] instead of LSTM for captioning. The related work for image captioning should be more complete and up-to-date. [a] Bottom-up and top-down attention for image captioning and visual question answering. CVPR,2018. [b] "Regularizing rnns for caption generation by reconstructing the past with the present." CVPR. 2018. [c] Reflective Decoding Network for Image Captioning. ICCV, 2019. [d] Meshed-Memory Transformer for Image Captioning. CVPR, 2020. [e] Image captioning: Transforming objects into words." Advances in Neural Information Processing Systems. 2019.

Correctness: The methods using attention mask is reasonable and empirical experiment analysis is adequate.

Clarity: Yes. This paper is well organized and well written.

Relation to Prior Work: Yes, the Catastrophic forgetting part in the related work section has a clear discussion.

Reproducibility: Yes

Additional Feedback: 1.Please consider a more clear analysis/explanation on zero forgetting and why LwF performs much better on Flicker30K than COCO. 2.Please consider a mention or discussion with the SOTA captions methods in recent years to complete the related work section. It will be better if the author can incorporate the performance of their method using more recent captioning models which also use LSTM. 3. The max score for Interior on CIDEr in Table 2 belongs to FT and Line280 should replace word "increase" to "decrease"?