NeurIPS 2020

Look-ahead Meta Learning for Continual Learning

Review 1

Summary and Contributions: The authors propose an extension of MAML for continual learning. Inspired by the lookahead search, the proposed method meta-learns a learning rate associate with each model parameter. The proposed method is evaluated on two image classification datasets and archives noticeable performance improvements comparing with prior work. Upate: After reading the rebuttal, I still recommend acceptance. However, I still suggest the authors to include more recent meta-learning baselines and improve the figures (e.g. font sizes and visualizations).

Strengths: - The idea of incorporating the modulation of learning rates in meta-learning is reasonable and interesting. - Experimental results of both single-pass and multiple-pass scenarios are reported. And the proposed method achieves performance improvements in both setups. - Comparisons with prior work are thoroughly discussed in the main paper and the supplementary material. - The paper is clearly written and easy to follow.

Weaknesses: - It would be better to also evaluate how the model's performances change as it sees more tasks during training. - Other types of more recent meta-learning baselines such as PEARL should be also included in the experiments. - I am curious to see if this kind of methods can be applied to reinforcement learning problems and achieves reasonable performance as well, where the training can be less robust and stable. - In Table 3, the evaluation is conducted only for 3 different seeds. It would be better to have experiments of larger scales.

Correctness: Yes. No obvious problems were found.

Clarity: The paper is clearly written and easy to follow. How the proposed method is connected to lookahead search can be better explained since it is contained in the title and the algorithm name.

Relation to Prior Work: Yes. Comparisons with prior work are thoroughly discussed in the main paper and the supplementary material.

Reproducibility: Yes

Additional Feedback: It would be better to compare with more recent meta-learning baselines and compare with baselines in the reinforcement learning setup.

Review 2

Summary and Contributions: This paper proposes Look-ahead MAML (La-MAML) for online continual learning. This method uses per-parameter learning rates for meta-learning updates and can achieve better performance than previous methods such as EWC, GEM, and MER.

Strengths: The proposed method seems novel and relevant to the NeurIPS community.

Weaknesses: I think one limitation of this work is the relatively low accuracy reported in Table 1 compared to previous work. For example, 1. In [Ref1, Ref2], the authors reported an accuracy of 84-97% for EWC and 80-93% for GEM on the MNIST Permutation benchmark. The number reported in this paper is much lower than those, only at 62% for EWC and 55% for GEM. 2. For MER, according to [Ref3], the RA is 85.50% on MNIST Permutation, which is much higher than the 73.46% reported here. 3. For GEM, according to [Ref4], the accuracy on MNIST Rotations is at 86% even when we restrict to only 1 epoch per task. The number reported in Table 1 here is only at 67.38%. There seems to be systematic differences between previously reported results and the results in this paper. The authors should explain why there are such differences. [Ref1] Nguyen et al. Variational Continual Learning. 2018. [Ref2] Swaroop et al. Improving and Understanding Variational Continual Learning. 2018. [Ref3] Riemer et al. Learning to Learn without Forgetting by Maximizing Transfer and Minimizing Interference. 2019. [Ref4] Lopez-Paz et al. Gradient Episodic Memory for Continual Learning. 2017.

Correctness: Yes, but there is a discrepancy between the results in this paper with previous results, as discussed above.

Clarity: The paper is mostly well-written. However, there are some issues below. 1. The RHS of Eq (1) does not seem to be equal to the LHS. 2. In the experiment, the paper does not specify which models are used. 3. There should be more details about the Many Permutations benchmark. 4. There should be appropriate punctuations after the equations (for example, Eq 3, 6, 7, etc).

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: Post-rebuttal: The rebuttal has addressed my concerns regarding the discrepancy between the results in this paper with previous results. I have increased the score and hope the authors will highlight the differences in experimental settings in the final version.

Review 3

Summary and Contributions: The authors propose a novel Meta Learning-based Continual Learning approach that expands on previously existing works (OML, MER) by reorganizing the way data is handled in meta-optimization and incorporating a learnable learning rate adjustment framework. An analysis and comparison of existing meta-learning CL methods is also provided and experiments are conducted to compare the proposed method with the state of the art.

Strengths: -The presented results are very encouraging in the task-incremental setting as intended by Hsu et al (see below). -The critical review of existing Meta-Learning CL methods is useful to form a comprehensive picture of them. It is particularly useful to establish an equivalence framework, since these works are not always easy to follow (due to complex notation).

Weaknesses: I found some issues with the experiments, that I list in the following: Line 215 states that experiments refer to “task incremental settings”. This term has a specific meaning in CL literature [3,4]: it usually means “multihead”, i.e. task labels are given at inference time. I understand that this is the setting that is featured in section 5.2. Recent literature [1, 2, 5, 6] argues that this setting is trivial and that the Single-head/Class-Incremental setting (i.e. no task labels at test time) should be preferred. Providing Class-IL results could therefore be of great help to understand how LA-MAML performs in a more challenging setting. I find the running time comparison in Table 2 not entirely satisfactory. MER is arguably among the slowest methods in CL literature. While I see the point of comparing La-MAML with its strongest competitor w.r.t. Table 1, other competitors (e.g. ER, iCaRL, EWC) are much, much faster. I believe it would be more fair to show a comparison with at least one of these methods, to let the reader understand what kind of increase in time complexity corresponds to the increase in performance brought by La-MAML. The backbone network used in section 5.2 (see appendix F lines 506 and following) is an unusual choice and I do not know of any other work adopting it. There appear to be very few convolutional layers w.r.t. ResNet (Resnet-based models are commonly used for the related datasets [1 below, 6 below, 7 below, 8 below, 6 paper references, 17 pap. references, 21 p.r.]). In addition, the fully connected layers are very large. With a backbone design this unfamiliar, I believe that the experimental results are difficult to interpret. [1] R. Aljundi et al. Gradient based sample selection for online continual learning. NIPS 2019. [2] De Lange et al. A continual learning survey: Defying forgetting in classification tasks. arXiv preprint arXiv:1909.08383, 2019. [3] G. M. van de Ven and A. S. Tolias, Three continual learning scenarios. NIPS Continual Learning Workshop, 2018. [4] YC Hsu et al. Re-evaluating continual learning scenarios: A categorization and case for strong baselines. NIPS Continual learning Workshop, 2018. [5] S. Farquhar and Y. Gal. Towards robust evaluations of continual learning. arXiv preprint arXiv:1805.09733, 2018. [6] R. Aljundi et al. Online continual learning with maximal interfered retrieval. NIPS 2019 [7] Wu et al., Large Scale Incremental Learning, CVPR 2019 [8] Hou et al., Learning a Unified Classifier Incrementally via Rebalancing, CVPR 2019

Correctness: Claims and method seem correct. I find the empirical methodology overall correct, although I have listed some criticism w.r.t. to the experimental settings.

Clarity: The paper is remarkably clear, I congratulate the authors for that.

Relation to Prior Work: Yes, there are two related sections that cover both loosely-related works and specific differences with other Meta-Learning approaches.

Reproducibility: Yes

Additional Feedback: I am happy with the author rebuttal. They answered my concern about more details in the experimental setting. I hope authors commit to revise the paper accordingly. It is ok to have a different experimental setup but it must be cristal clear in the manuscript. Given the rebuttal I raised my score of a tick.

Review 4

Summary and Contributions: Authors propose a new meta-learning based method for online continue learning.

Strengths: Their base method looks incredibly simple, i.e. aligning on average instead of pairwise gradients, yet seems to improve significantly the existing methods for online learning. Analysis is intuitive but sound. Their subsequent La-MAML is also intuitive and simple, based on an investigation of correlation between LR gradients, resulting a straightforward algorithm. Although incremental, the work has certain novelty and is highly relevant to the NeurIPS community.

Weaknesses: I can find few flaws. Perhaps the use of symbols and their super- and sub-scripts could be clearer and their meanings better explained.

Correctness: All appear sound, though I have not fully examined the appendices.

Clarity: The paper is well written indeed.

Relation to Prior Work: Fairly clear.

Reproducibility: Yes

Additional Feedback: