NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1101
Title:Catastrophic Forgetting Meets Negative Transfer: Batch Spectral Shrinkage for Safe Transfer Learning

Reviewer 1


		
The method proposed in this paper are novel and the experiments are very solid. In table 2 they show the result evaluated in 5 datasets, 4 different sample rates and 3 different methods combined with the new method. The proposed method performs very well in all situations. However, I am not convinced by the explanation given in the paper. First of all, I don't understand what does transfer in 'negative transfer' means? Is it mean to fine tune from a pretrained model? Or to use the feature from a pretrained model? Or to use some technique like L2-SP to preserve the previous feature? The authors conclude that the negative transfer does exist in 3.2 and Fig 1(a). But from my prospective, fig 1(a) only shows that L2-SP performs worse than L2. Fig 1(b) shows that lower layers are more transferable than higher layers and the transferability can be somehow indicated by the singular value. Fig 1(c)(d) shows that with more training data, the singular values are smaller, especially in the smaller half of sigma. These are all very interest result, but I cannot see how it related to negative transfer.

Reviewer 2


		
This paper investigates whether negative transfer exists in fine-tuning deep networks pre-trained on a domain and how to deal with it. Batch Spectral Shrinkage (BSS) is proposed as a regularization to suppress which features that potentially cause negative transfer, indicated by small singular values of SVD(F), where F are the feature matrices. The main problem addressed by the paper is vivid and well motivated. The proposed regularization to alleviate the problem is intuitive, through the familiar singular value decomposition. The empirical evaluations show the effectiveness of the BSS regularization over a range of datasets -- models with BSS perform on par or better than those without BSS after fine-tuning, especially with limited number of fine-tuning examples. Improvements: - In my opinion, the fine-tuning step is a special case of continual learning (CL) that only has 1 additional step. It would be interesting if BSS can be incorporated into existing CL methods such as EWC (Kirkpatrick et al. 2017) and/or LwF (Li & Hoiem 2016) as well -- is it even possible to do so? - It would be great if there is a BSS fine-tuning use case other than visual recognition, e.g., text classification with pre-trained word embeddings, that can be evaluated. ============== After the rebuttal I thank the authors for providing the response to my concern, by reporting the additional experiment outcomes with EWC and on text classification. My final score is up to 1 level. Please incorporate the new experiment results into the manuscript / supplemental materials.

Reviewer 3


		
1. The reviewer thinks that the novelty of this paper is not enough. The title of this paper is “Catastrophic Forgetting Meets Negative Transfer”. However, the part that deals with catastrophic forgetting only uses the previous methods, and the formula only extends the proposed BSS regularization to the previous methods. There are also no ablation studies to verify the effectiveness of the two parts, i.e., catastrophic forgetting part and negative transfer part. 2. Line 167 mentioned that “in the higher layers, only eigenvectors corresponding to relatively larger singular values produce small relative angles. So aligning all weight parameters indiscriminately to the initial pre-trained values is risky to negative transfer.” Then why not re-initialize all the high-level parameters and train again? Part of the transfer learning, i.e., “A Survey on Transfer Learning”, only transfer the parameters of lower layers. Are there any experiments to verify the pros and cons of this process? 3. The paper analyzes the influence of network parameters and feature output representation on negative transfer. Why use feature regularization instead of parameter regularization? Are there any experiments to verify? 4. The paper mainly solves the negative transfer phenomenon in fine-tuning. But the comparison methods are all about catastrophic forgetting, and there is no negative transfer method. Why not compare with the state of the art negative transfer methods? “Characterizing and avoiding negative transfer” 2018; “Deep coral: Correlation alignment for deep domain adaptation” 2016 “Adapting visual category models to new domains” ECCV 2010 “Adversarial discriminative domain adaptation” CVPR 2017 5. Some statements in the paper are repeated, and the format of the reference is very confusing.