Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
[Originality] While modifying and extending PathNet is exciting, most concepts of PathNet are reused in the submission. The submission is rather incremental in the sense that the framework is practically a combination of useful existing techniques: parameter sharing/reuse, knowledge distillation and retrospection and model plasticity. The novel part of the submission is the implementation details of the framework and the corresponding training methodology that make the framework powerful. [Quality] The submission is technically sound, and the framework is comprehensive and flexible, incorporating many existing techniques. The experimental results clearly show that the proposed method outperforms previous works. But some detailed analysis is missing. For instance, how does the varying number of exemplars affect the performance? Also, the submission claimed the framework effectively reuses modules, but no comparison of model sizes with previous works is provided in Table 1. Last, how important is the augmentation mentioned in L223? As many other methods do not utilize augmentation strategy, the effect of the augmentation should be enunciated. [Clarity] The paper is not easy to follow details, especially in Section 3. Reproducing path selection algorithm is not straightforward, and figuring out the detailed algorithm only with the descriptions in the submission is difficult. [Significance] The empirical results are strong, and the framework is clearly one of the top performing methods in continual learning. The novelty, however, is weak, and the framework is the composition of existing components. The paper is not clearly written, and more analysis is necessary. [After Rebuttal] The authors have addressed many of my concerns, and I have raised the overall score after the rebuttal.
***** I thank the authors for the response. I have also read author reviews. My concerns have been partly addressed and I will change my score to reflect this. My major remaining concern is the incremental nature of the work that should be discussed more in the text, i.e.: how each component of the algorithm/architecture has been used before. That said, there are values to combining existing works and showing that this works better. It would also be great if some claims that are not empirically supported are removed from the paper. ***** Originality: As far as I know, the combination of several continual learning strategies proposed in this submission is new. However, it is not clear to me each of these strategies on its own is a novelty. For example, a special case of the objective function has been used by iCaRL (as noted on line 255). The modification in this paper is the phi term that balances the regularisation term and the prediction loss. I feel this is hardly a "novel controller". The use of core-set points or examplar replay is not new -- this is however not discussed in the main methodology. The paper also suggests that previous methods fail in some continual learning aspects, but note that these previous methods can also be combined like done in this paper. Clarity: I think the paper is generally well written. There are however hyperparameters in the proposed approach that should be discussed. Quality: + The paper stated that the proposed random search is better than reinforcement learning or genetic algorithm based methods, but did not provide any experiments on this. Maybe in terms of parallelism and time efficiency, this makes sense. But running multiple training routines in parallel is also expensive. + as stated above, it would be good to understand which component of the proposed method contributes to the performance gain, e.g. maybe coreset points are already sufficient for all the gain seen. + network size and capacity: it would be good to have a comparison to previous approaches that dynamically add capacity to the network. The random path selection strategy here does this when M is large, so it would not be fair to criticise other methods (line 98). As seen in Figure 4, increasing the network size improves the performance significantly. + I think the stability-plasticity dilemma is well-known in the continual learning community and is somewhat over-emphasized in this submission. Most existing continual learning algorithms are actually trying to balance these two aspects of learning and provide either an objective function or a network architecture or both that deal with such issue. The proposed method falls into this category. + the regularisation term in eqn 5: Is this a valid KL divergence (due to the log)? How is the temperature t_e chosen? Clarity: The paper is well written in general. The experiments seem extensive and thorough. Line 127: remarkably -> unsurprisingly? Significance: The paper attempts to address an important and challenging problem. The proposed methodology seems interesting and the results show that this works well in the class-incremental learning setting in continual learning. However, it is a hybrid approach that involves many moving parts and it is unclear which one should be adopted by the community. Whilst the experiments show that this works, it would be good to know if this also works on non class-incremental learning settings that are typically tested by other continual learning work, and that all contributions claimed in the intro are supported by empirical evidence.
[UPDATE] I will maintain my score (7) because I believe the paper makes a clear and substantial contribution to the CL literature. I still recommend acceptance and I am willing to trust the authors will improve the writing. Below I quickly discuss other concerns. I agree with R2 and R3 that more analyses were necessary, but I am satisfied with the Author's response for both components of the loss figure (a) and number of exemplars (c). I thank my colleague reviewers for insisting. If you follow closely figure (a), the contributions of each component of the loss increase as more classes are added, which I find quite interesting, I think this really adds to the paper and the authors are most likely going to discuss this in the final manuscript. Overall I am satisfied with the new amount of analysis. While the conceptual contribution may seem relatively low vis-a-vis PathNets, I would like to argue that using very high capacity models for such problems is actually a novelty on its own which is not covered by PathNets. In fact, I can't think of any CL work out there which managed to generalize using such large networks on the datasets typically used in the field; I think it's safe to say that this paper uses at least 10x more parameters and 10x more computation compared to previous methods. But I believe the paper does a great job of showing that previous CL methods actually do work better at this novel scale and using residual architectures. I believe that not only the combination of previous methods is new, but none of those methods were proven to still be relevant at this scale before. While the writing seems rushed at times, I am satisfied with the clarifications I received. I also agree with R2 and R3 that some of the claims were not well supported, but those claims were not central to the paper, imho. It's a simple matter to remove claims like superiority to genetic algorithms, which is weak anyway. I will not increase my score to 8 because the other reviewers are correct that novelty is not earth shattering, but I believe it's substantial enough for acceptance. The combination of existing methods, creative use of resnets at unprecedented scale for continual learning experiments are nicely complemented by strong technical execution. This is an exciting new ground for continual and incremental deep learning. [OLD REVIEW] Originality: The paper spends excessive amounts of space trying to differentiate itself from previous work at the conceptual level. This is not necessary imho, as the paper is valuable on its technical innovations alone. The novelty comes from a thorough and judicious investigation of much larger and expensive network architectures and specifically hand engineering an incremental learning algorithm for the sequential classifier learning case with a (small) persistent data store. The model is highly over-parameterized compared to previous works or to the capacity that is required for good generalization in the regular multi-task training case. While, in a sense, the claims of bounded computation and memory footprint are true, they are only true because all the needed capacity is allocated beforehand, so model growth is not needed in the use cases considered; this is however no guarantee for the more general case of sufficiently different tasks, or simply for more tasks in the incremental learning sequence. Claims of computational efficiency are similarly overblown. The model is indeed able to benefit from independent exploration, fully in parallel, of training N different novel paths. But this is a trivial claim, so can every other model, e.g. by exploring different hyperparameters and choosing the best in validation. The claims of difference to genetic algorithms and reinforcement learning are also overblown. Random path selection *is* a basic type of genetic algorithm. Furthermore, the sequential, balanced task classifier learning setup is one of the simplest examples of continual learning problems; it is not at all clear that random path selection would work as well for much more complex cases, such as multi-agent reinforcement learning or GAN training, both of which are much less well behaved continual learning problems. The claims of minimal computational overhead over competing approaches are simply ridiculous. Minimal overhead over huge cost still adds up to huge costs, since the proposed architecture is very expensive compared to even recent works. Not only many paths are explored simultaneously, but even a single path is much more expensive to train and evaluate compared to most other approaches in the table. But this is not necessarily an issue; we want to increase performance and computation is becoming more abundant and cheap. However, such misleading claims are an issue. Quality: The experiments look thorough, and the improvements look large enough to be significant, although it would be nice to have some error bars. Ablation and FLOPS calculations are clearly interesting. The discussion of backwards transfer is particularly shallow. It’s not at all clear to me that improvements don’t really come from actually training more on the stored samples rather than some backwards transfer effect from learning new tasks on different data. Some analysis and targeted experiments would be required. Clarity: writing is a bit silly here and there, especially when over-hyping and comparing to previous work. This only detracts from the otherwise good quality of the paper. Significance: Perhaps significant beyond incremental classifier learning, although the paper does little to convince us of it.