Review for NeurIPS paper: Model Fusion via Optimal Transport

NeurIPS 2020

Model Fusion via Optimal Transport

Review 1

Summary and Contributions: The main contribution of this work is to introduce a layer-wise approach to fusion neurons and weights of neural networks. The key idea here is to consider the barycenter based fusion of networks.

Strengths: The main strength of the work is demonstrating improved efficacy for model fusion over typical approaches like averaging.

Weaknesses: See Additional feedback.

Correctness: The claims and method seem appropriate.

Clarity: The paper is well written.

Relation to Prior Work: The prior work is sufficiently discussed.

Reproducibility: Yes

Additional Feedback: I have some questions regarding the continual fusion of N>2 models and catastrophic forgetting. Although for the few models fused in this work I am wondering what the net effect would be by continually fusing models in this way. The reasoning for this inquiry is that barycentric fusing of multiple empirical distributions (in this work the networks are in a sense simply the empirical distribution) will ultimately "blur" the distributions. Similarly so, fusing multiple models I can see has overall "losing" the representational capacity with respect to the individual tasks each individual network has been learned on - unless there is some fundamental similarity between the underlying tasks (in which case no loss or even improvement may be observed). Can you see this as being the case and can you provide an intuition for your understanding - perhaps from a regularization perspective. ================== There is definitely a scope for further development of this work from which this work seems an acceptable and interesting baseline. The additional responses and discussion in the rebuttal provide extra clarity and so I increase my score to Accept. A thought comes to mind now that eluded me before: I would have actually loved to see how this performs on a very simple model rather than a complex one. Furthering this comment, I believe it would be illuminating for the authors to discuss their work (in the future of course) how this method "scales" to non-deep learning algorithms. Given the generality of the title, in this light, it seems somewhat inappropriate for such a focus on neural networks.

Review 2

Summary and Contributions: This paper proposes a layer-wise fusion algorithm for neural networks based on optimal transport of the parameters in each layer. For various applications, the proposed algorithm shows superior performance over the vanilla averaging.

Strengths: It is interesting to fuse several model parameters into a single model with only model parameters and has multiple applications such as federated learning. For various settings, the proposed fusion algorithm showed superior performance over the vanilla baseline. The potential to federated or decentralized learning seems to be an interesting point.

Weaknesses: The paper is not well-written, e.g., the algorithm part is not clearly organized and addressed. The only comparing methods are 'prediction ensembling' and 'vanilla averaging' across all experiments, which is not convincing and sufficient. For example, in pruning experiments, there are some published structured pruning methods [25-27]. Only very small datasets (MNIST, CIFAR10) are used in experiments. It is not clear if the network and model will overfit on such small-scale dataset.

Correctness: The claim is correct. The methodology is correct.

Clarity: No.

Relation to Prior Work: Yes. In related work, the authors addressed the relations and differences to previous.

Reproducibility: Yes

Additional Feedback: In my understanding, "acts" needs to make use of additional unlabled examples and forward for each of the K individual models. Under this setting, what is the advantages compared to the "prediction average" methods? Some typos: In line 126, "For e.g." In line 231, "e.g." lack of commas In line 250, "Thus, "

Review 3

Summary and Contributions: The paper proposes to use the formulation of Optimal Transport (OT) to align the channels/neurons in two/multiple different models, and then do a weight fusion by averaging. The use cases of such weight fusion is beneficial in cases like special/general multi-tasking, pruning, federated learning, ensembling. In each case, the experiment shows the OT fusion outperforms vanilla averaging of weights.

Strengths: Using Optimal Transport problem to match the order of channels/neurons is a intuitive application of an traditional algorithm to deep learning, and is shown to outperform vanilla averaging where we ignore order. The paper lists lots of use cases such as special and general model fusion, federated learning, pruning, ensembling. The appendix contains lots of detailed experiments and results, which helps interested readers to learn more.

Weaknesses: 1. The special and general models A and B which focus on 1 class and 9 classes in MNIST respectively seems a bit artificial to me. The other constraints introduced in the paper, such as no fine-tuning allowed, no joint training allowed (due to data privacy), are also a bit strange, as these approaches are widely used in common scenarios like pruning or multi-task learning. The method’s usefulness seems to only exist under these strict and sometimes artificial assumptions. Under the most straightforward application (ensembling), the method does not bring improvement without fine-tuning, and even with fine-tuning, the improvement over vanilla averaging is very marginal and I would consider them to be within the error bar of CIFAR-10 classification (0.3%). The paper could benefit from running these experiments with multiple seeds and report mean and stds. 2. Also, in my opinion, "Vanilla averaging” of weights does not form a strong baseline. Normally people don’t average the weights of two identical architectures element wisely, which is unlikely to produce meaningful performance, as also shown in the paper. The reason, as the author tried to address using OT, is that all positions in channel dimension in a convolution or linear layer are equivalent, thus averaging channel 1 and 2 from model A and B respectively makes no more sense than swapping them. The main baseline of the work should be vanilla ensembling, which the method only outperforms with fine-tuning, though. Vanilla averaging could be served as an illustration that the method works, but itself is not competitive. Even under the constraint that we want only one model and no training examples are given, there could possibly be more competent baselines. 3. In the case of pruning, I would intuitively imagine the channels get aligned with the large model correspond to the channels that survived the pruning, which are essentially the same channels in the small model. To what degree this is true? If this is largely true, why would a model benefit from fusion with (almost) itself? If not, why? 4. In figure S9(i), the caption says “all”, which I suppose should indicate all layers are pruned together, but the legends says “conv_9”. Which one is the case? In addition, I wonder whether the difference between vanilla and OT fusion still exists when fine-tuning is enabled, as is often the case in pruning. 5. It seems the paper did not specify the reason to use weight-based or activation-based alignment in each of the experiment. =======================Post Rebuttal=========================== The rebuttal address many of my concerns, e.g., about the application cases and the prior practice of averaging, and my opinion changes towards acceptance.

Correctness: Yes

Clarity: Largely. The math part is a bit abstract.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: Authors propose a to fuse neural net models by aligning weights or activations using the optimal transport (using the wassertian barycenter) instead of vanilla averaging of corresponding weights. This approach outperforms vanilla averaging by a big marging and can be used as an alternative to model ensembling in constraint settings. It's is suitable for federated learning and does not required same size layers.

Strengths: The propose approach seems to work very well on tested datasets and have the potential of great impact among a large set of applications. - The comparison with ensembling methods is very appealing as well as the application to structured pruning.

Weaknesses: - Authors just compare against vanilla averaging of weights, but there are similar proposed approaches that are important to consider. I would really like to see a performance comparison with [1] for example. - The datasets and models where the idea was tested are good enough to get the point across but quiet small compare to modern models and applications. 1. Wang, Hongyi, et al. "Federated learning with matched averaging." ICLR 2020

Correctness: The methodogy followed on the paper seems to be adequate.

Clarity: The paper is very well written, cohesive, and easy to follow.

Relation to Prior Work: Paper mention most relevant related work, but I would like to see more details on how it is substancially different to FedMA

Reproducibility: Yes

Additional Feedback: This work have the potential of great impact among many applications. - I would like to get more details on the matching of layers of different size. Does the Wasserstein barycenter approah still applies in this setting? - I know you mention [1] on related work, but the work is so similar that it is still no obvious to the reader what's the difference beyond the approach taken to obtain the optimal transport. 1. Wang, Hongyi, et al. "Federated learning with matched averaging." ICLR 2020