Review for NeurIPS paper: Agree to Disagree: Adaptive Ensemble Knowledge Distillation in Gradient Space

NeurIPS 2020

Agree to Disagree: Adaptive Ensemble Knowledge Distillation in Gradient Space

Review 1

Summary and Contributions: This paper mainly investigates the knowledge distillation (KD) from an ensemble, which is composed of multiple teacher networks. To fulfill the ensemble KD, most methods simply follow the average rule and neglect the diversity or differences among teachers. In contrast, this paper breaks this routine, and proposes to adaptively weight each teacher in the ensemble so that the knowledge can be leveraged more flexibly. Based on the perspective of multi-objective optimization, this paper introduces a slack variable to model the disagreement among teachers and proposes a gradient descent method accordingly. The proposed method is well explained when it comes to the typical KD which amounts to “logits alignment” for the integration of ensemble. Extensive experimental results on CIFAR10/100 and ImageNet datasets validate the effectiveness of the proposed method.

Strengths: 1.The intuition of this paper is good. Instead of following the routine average rule, this paper proposes to adaptively weight the teachers in the ensemble, and proposes a gradient-based algorithm which can well model the disagreement among teachers. 2.The proposed method has a nice explanation when it comes to both logit-based and feature-based knowledge distillation. The logits alignment perspective seem interesting to me. 3. Elaborate and corroborative mathematical proof are provided to verify the correctness of our algorithm. 4. Extensive experimental results on CIFAR10/100 and ImageNet datasets validate the effectiveness of the proposed method.

Weaknesses: More analysis over the disagreement parameter is encouraged to be given to improve the comprehensiveness of the ablation studies. more references can be included to provide an ampler background for KD.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Please handle the comments accordingly. Post-rebuttal I have read the responses and other reviewers' comments. The authors have addressed most cocerns, I'm satisfied to accept this paper.

Review 2

Summary and Contributions: This paper concentrates on ensemble learning of teachers for better transferring knowledge to student. The authors regard ensemble knowledge distillation as a multi-objective optimization problem that obtains a set of weights for different teachers based on gradients to dominate the learning procedure of the student. Both logit-based and feature-based KD are considered in experiments to explore the applicability of the proposed method.

Strengths: 1. The authors mathematically analysis the equivalent between averaging the KD losses and averaging the softened outputs in AVER. 2. The ensemble knowledge distillation is formulated as a multi-objective optimization problem. An optimization upper bound is derived by means of Lagrange multipliers and KKT conditions. 3. The paper is well-written and organized.

Weaknesses: 1. The motivation of AE-KD is to encourage the optimization direction of the student guided equally by all the teachers. However, considering there are some weak teachers (low generalization accuracy) in the ensemble teacher pool, why are these weak teachers treated equally with other strong teachers in the gradient space? Intuitively, the guidance of student should favor those strong teachers, but keep away from the weak teachers. 2. The proposed AE-KD seems similar with the previous work OKDDip [1], where the dynamic attentions are learned by gradients to adaptively weight the teachers’ logits. What is the difference between them? 3. How to optimize the weights \alpha_m in Eq. (11)? Is it end-to-end optimized together with the student? If yes, how to ensure \alpha_m less than C during optimization? 4. The teacher resnet56 is trained for 240 epochs. Why is the student resnet20 trained for 350 epochs? 5. In ensemble learning, it is valuable that the ensemble networks have various architectures with different accuracies. How does AE-KD perform in this situation? [1] Online Knowledge Distillation with Diverse Peers. AAAI, 2020.

Correctness: The formulations are correct and easy to understand. But the motivation confuses me. Please refer to the above comments.

Clarity: The paper is well-written and organized.

Relation to Prior Work: The proposed AE-KD seems similar with the previous work OKDDip. The authors should discuss it and claim the differences between OKDDip and AE-KD.

Reproducibility: No

Additional Feedback: The author's feedback partially resolves my concerns. Finally, I decide to increase the overall score.

Review 3

Summary and Contributions: This work regards ensemble knowledge distillation as a multi-objective optimization problem and proposes a novel gradient-based algorithm, which computes a set of weights for different teachers based on gradients and decides how the student will learn from teachers in each minibatch. This method adaptively combines the knowledge from multiple teachers and thus provides the student with better guidance than using the traditional averaged loss.

Strengths: 1. This paper proposes a new viewpoint for ensemble knowledge distillation. 2. The method sounds reasonable and empirical results are promising.

Weaknesses: While the analysis in Section 2 is interesting, it is not so relevant to the main focus of this paper. It is better to remove this part. Several minor issues about experiments: 1. It is better to conduct multiple runs of experiments and report the mean and variance. 2. AEKD* is better than AEKD on CIFAR10 but worse on CIFAR100. Any explanations? 3. Figure 2 seems to suggest that with more teacher models, AVER will catch up the proposed method and even outperform it. I'd like to see the results with more teachers. 4. The teacher models are very weak and far from SoTA models. It might be better to conduct experiments with stronger teacher models. 5. Line 244 C\in [1:M]. Should it be C\in (1/M, 1)?

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: