Review for NeurIPS paper: On the Theory of Transfer Learning: The Importance of Task Diversity

NeurIPS 2020

On the Theory of Transfer Learning: The Importance of Task Diversity

Review 1

Summary and Contributions: 1. This work proves a new generalization bound for transfer learning in the realizable setting via first learning a common representation and then training a classification head for the new task. 2. Examples are given for logistic regression, deep neural network regression, and robust regression for single-index models. 3. A novel chain rule for Gaussian complexities is derived as an additional technical contribution.

Strengths: 1. The framework generalizes several previous works, and the bound works for general losses and model families. 2. Theoretical derivations are thorough and rigorous. Interpretations of theorems are clearly stated.

Weaknesses: 1. The framework assumes realizability. Maybe the author can discuss what happens with moderate model misspecification. 2. The training tasks are assumed to be homogenous in sample size and complexity. In realistic scenarios, practitioners pre-train on only one or two complex tasks (such as ImageNet) before transferring to many downstream tasks. The theory does not explain why transfer learning works when training tasks are not diverse. 3. In all three examples, the 'classifier head' hypothesis class F is linear. I wonder what task-diversity constants (definition 3) can be derived for more complex family F such as a multi-layer neural network. 4. Question: Does the analysis for neural networks only work for squared loss? How about logistic loss, or classification? 5. Question: Can more refined bounds than [1] be applied to deep neural networks? For example, how about data-dependent bounds, such as margin bounds, or bounds based on layer-wise Lipschitzness of the network on the training data [2]? 6. More experiments can be done to explore the applicability of the theory and the tightness of the bound. [1] Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. "Size-independent sample complexity of neural networks." Conference On Learning Theory. 2018. [2] Wei, Colin, and Tengyu Ma. "Improved sample complexities for deep neural networks and robust classification via an all-layer margin." International Conference on Learning Representations. 2019.

Correctness: The theoretical derivations seem correct, although I did not fully read the proof in the supplementary materials.

Clarity: Yes. The notation about eigenvalue in assumption 3 is not defined.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Update after author feedback: I thank the authors for their response to my questions. I think it would be interesting to explain why transfer learning works when training tasks are not diverse (e.g. pre-train on one hard task can lead to good performance on many easier tasks).

Review 2

Summary and Contributions: The paper studies the problem of transfer learning to a task with few samples via learning a shared representation from many related tasks. The authors propose a two stage ERM procedure and show that, under certain assumptions, that the performance of the final predictor on the new task benefits from the representation learning step. Furthermore, among these assumptions, is a notion of task diversity that the authors propose which captures that the excess statistical risk decreases as the tasks (to learn the representation) are more and more diverse. The authors use this general result to instantiate bounds on Logistic regression, deep neural networks and multi-index models. Towards proving these results, the authors develop a new chain rule for computing Gaussian complexity of classes with a composite function structure. --------------- Post author feedback comments --------------- I thank the authors for the response. I encourage the authors to add this discussion on task diversity provided in the feedback to the paper itself. Based on this, I increase my score from 6 to 7.

Strengths: 1. The paper studies problem/method which in recent years, has been empirically very successful and is being widely used, but has little theory. 2. The results are good, in the sense that the final bounds that the authors obtain are intuitively expected, and indeed showcase improvements via the representation learning step (unlike previous work). 3. The paper is mostly written well, the related work is discused well and the proofs are described rigorously. 4. The develop a new technical tool: chain rule for Gaussian complexity, which could be applicable/useful in such composite function learning settings.

Weaknesses: To me, the theory felt abstract, in the sense that the quantities seemed conveniently defined so as the theoretical results follow. If the authors think that these are indeed natural quantities, then (in my opinion), their intuitive meaning is not sufficiently discussed. For example, the title and asbstract of the paper suggests that a key contribution is the notion of task diversity. This quantity first appears in page 6 (section 3.3, definition 3) and besides explaining the formula very generally in a few lines, there is hardly any discussion on what it means. Furthermore, in the applications considered, the diversity assumption takes specific forms (like assumption 3), which is also not (sufficiently) discussed.

Correctness: Yes, I skimmed through the proofs, which are mostly based on standard arguments, and look sound.

Clarity: Mostly well-written. Some more discussion on the proposed task diversity notion could be useful.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: 1. Why state results in terms of Gaussian complexity rather than Rademacher complexity which is more common in learning theory literature? Especially when the proofs in the paper start with using these standard Rademacher Complexity bounds, and then these are switched to Gaussian complexity. Is there a technical reason (though I don't see that any place in the proofs which crucially uses some fact about Gaussian random variables), or it is arbitrary choice? 2. Typos: 1. In line 480 of supplementary, What is $\hat f$? (it doesn't seem to be defined anywhere before) 2. In line 626 of supplementary, "we claim the set..", shouldn't the union be $h \in C_{F_{h(x)}}^{\times t}$ 3. In line 629 of supplementary, "By construction..", shouldn't it be $(f',h')$ instead of $(f,h')$?

Review 3

Summary and Contributions: This work presents new statistical guarantees in the context of transfer learning using shared learning representations. The decomposition of the Gaussian complexity of the end-to-end transfer learning pipeline into Gaussian complexity of learning representation H and task-specific function F is based on a novel chain rule for Gaussian complexities contributed by this work. A fundamental task-diversity measure, (v, epsilon)-diverse, is also defined so that the statistical guarantees can be provided in terms of the diversity of a family of task.

Strengths: The new guarantees are provided on decoupled representation complexity and task-specific function complexity. This enables a better understanding of the dynamics due to the sample size of the pre-training tasks, the sample size of the target task, the complexity of the representation model and the complexity of the task-specific model. Indeed, we can see from Theorem 3 that the Gaussian complexity of the representation function must be greater than the Gaussian complexity of the task-specific function for the transfer learning risk to scale slower than the empirical risk of naive algorithm learning each task in isolation, along with large enough sample size of pre-training tasks. In order to provide the statistical guarantees, this work provided new chain rule for Gaussian complexities and a task diversity definition more general than previous works.

Weaknesses: The Gaussian complexity for H and F are likely providing loose bounds. I do not see this as a strong limitation, but rather as future work.

Correctness: As stated earlier the assumptions behind the claims are reasonable. I did not verify the proofs in the appendix.

Clarity: As someone of limited familiarity with theoretical works on statistical guarantees, the paper is written well enough for me to follow and understand overall.

Relation to Prior Work: The related section gives a great overview of the topic and clear comparisons with the current work.

Reproducibility: Yes

Additional Feedback: I am no expert in theoretical work and therefore could only provide limited feedback. I apologize for this.