Review for NeurIPS paper: Personalized Federated Learning with Moreau Envelopes

NeurIPS 2020

Personalized Federated Learning with Moreau Envelopes

Review 1

Summary and Contributions: This paper presents a nice formulation of a federated learning with personalization objective, where a the average of the Moreau envelopes of the local objectives is minimized, and the argmin of the Moreau envelope / prox is used as the personalized weights for each local objective. This formulation has advantages over previously considered ones, and a straightforward FedAvg analogue is analyzed and shown to converge in a certain sense (see below). ------- update after author response ---------- Thank you for addressing many of my comments. I understand that you can't separate the (lambda / 2)\| theta_i - w \|^2 terms from the rest in the objective, just like with L2 regularization. The point I was making was that you should separate between the algorithm and the goal. For instance, ridge regression is an algorithm which you might use for least squares, but at the end of the day, you would report the mean squared error and you would *not* report the mean squared error plus the regularization term. The regularization was just part of the algorithm, not the goal. It seems to me that the (lambda / 2)\| theta_i - w \|^2 is analogous to a regularization term here, and is not really the thing you want to make small itself. It is true that you can't choose lambda=0 for your method, but you can choose lambda=exp(-100000) and your theory does say that smaller lambda is better. However, I suspect that there is some perspective on the problem where choosing lambda>>0 can be shown to be advantageous (just as choosing lambda>>0 can be for ridge regression). I think it would be valuable to identify this perspective.

Strengths: I like the formulation of the objective, which seems like a good conceptual balance between fully personalizing each local model and still leveraging information from other machines at the same time. The algorithm is simple and natural, and the experimental results show that this method appears to have significant practical benefits.

Weaknesses: I am not completely convinced by the theoretical results. Specifically, I am not sure that Theorems 1 and 2 prove the right notion of convergence. Conceptually, I think you want to show that sum_i f_i(theta_i) is small and/or sum_i f_i(w) is small. The theorem statements, I suppose, upper bound sum_i f_i(theta_i), but \| w - theta_i \| is involved, and I don't see why that quantity should be relevant. I don't really see any reason to care about \| w - theta_i \|; so what if you need to move far away from w to optimally personalize for one particular objective? In short, I think that the proposed objective pFedMe makes sense as a *training/surrogate objective*, but does not make as much sense as a criterion for evaluating a model, if that makes sense. I also think that an important consideration, which is missing here, is figuring out to what extent you need to leverage information from other clients in order to achieve optimal performance. On the one hand, choosing lambda=0 means just setting theta_i to the ERM on the local objective, which might do alright, but does not leverage the data / similarity with other clients. On the other hand, setting lambda = \infty allows for no personalization at all. Presumably, something in between is ideal, or else there is no reason for your formulation/method! According to Theorem 1, to the extent that I want to minimize F(w) - F*, I should just set lambda = 0 (since the rhs decreases with lambda). However, that is pretty boring since it means you should just do ERM locally on each client. I think the theory could be greatly improved by: (a) directly analyzing sum_i f_i(theta_i) and perhaps also sum_i f_i(w), i.e. the actual objective of interest (b) showing that increasing lambda can decrease f_i(theta_i) (perhaps this would require additional assumptions about the objectives) In the experiments, I am not completely clear with what is being plotted for the personalized/global models. Is the personalized model plotting average across the local objectives of the loss on the personalized parameter theta_i, and the global model is the average across the local objectives of the loss of the global parameter w? Perhaps I missed something, but I don't think this is precisely explained. Also, for the experiments, assuming I was correct in my interpretation above, I think it would also be good to show the performance of the personalized models for Per-FedAvg (i.e. the local weights after taking an SGD step from the global params). I realize this might not make a huge difference, but it seems like the right thing to compare with pFedMe (PM).

Correctness: To my knowledge, the results are correct.

Clarity: The paper was clearly presented, see above.

Relation to Prior Work: This paper discusses its relationship to prior work. Since the formulation of the objective is changed, it is somewhat difficult to compare convergence rates between this and prior methods (e.g. PerFedAvg), however a more explicit comparison might be helpful here. In particular, I am not sure how to understand the sentence lines 201-203 which says that prior methods can't get quadratic speedup--presumably those methods analyzed a different objective (maybe like eq (1))? More explanation here would help.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper tackles the task of federated learning on non-iid data distributions over the agents. The approach focuses on local models, with a additional regularization term enforcing proximity to a consensus model. The chosen regularization is a quadratic term, leading to a convenient reformulation as a distributed proximal problem. Detailed theoretical bounds are provided as well as several empirical evidences.

Strengths: The paper is very well written: the problem is clearly stated, the proposed solution is well motivated and analyzed. Provided results seem encouraging regarding the relevance of the algorithm.

Weaknesses: This problem has already been investigated in decentralized frameworks. Although different from federated learning, the proximity of the methods in both domains should be discussed (see, e.g., Decentralized collaborative learning of personalized models over networks, P Vanhaesebrouck, A Bellet, M Tommasi - 2017)

Correctness: The claims are correct as well as the overall analysis. At line 167, the author chooses a parameter to ensure strong convexity of h. However, the assumption on f_i is done in expectation: in some cases where the subset is poorly drawn, would not \lambda be very large to guarantee this property? What would be the dependence of \lambda in the expectation/variance term to ensure such property with high probability?

Clarity: The paper is very well written. Some bounds contain a lot of details, but simpler corollary and extensive discussions are provided.

Relation to Prior Work: Related work is well discussed, both in the dedicated section and when discussing results. As mentioned earlier, maybe some additional discussions comparing federated and decentralized results would be enlightening.

Reproducibility: Yes

Additional Feedback: I took some time to browse the supplementary material but I did not have time to carefully read it entirely, it is possible I missed something answering my concerns. I would be happy to modify my review if such event occurred.

Review 3

Summary and Contributions: This paper introduces a novel personalised federated learning algorithm, pFedMe, which relies on Moreau envelopes. The authors prove convergence bounds in the strongly convex and smooth non-convex settings. Experiments on real and synthetic data in the convex (logistic regression) and non-convex (shallow neural network) settings demonstrate improved performance with respect to state-of-the-art.

Strengths: - This paper reformulates previous personalisation ideas stemming from meta-learning approaches (notably Per-FedAvg) in the well-studied proximal framework. The new algorithm is more generic (agnostic to the local optimiser) and at the same time simpler to understand and analyse. This is a significant and novel contribution. It is very relevant to the NeurIPS community, as reformulations help to generate new ideas. - The convergence analysis relies on standard assumptions (avoiding the idealistic gradient boundedness) and yields state-of-the-art results. - Experiments on real and synthetic data show better results than previous personalised algorithms and FedAvg, and are quite compelling.

Weaknesses: The convergence analysis seems to rely strongly on a very particular server aggregation with a lag term controlled by a beta term. In particular, beta needs to be much larger than 1 to reach the SOTA speedup (Corollaries 1 and 2). Intuitively, the update rule with beta larger than 1 could yield some instabilities, and indeed. one observes in Fig 5 from the appendix that a larger beta yields a more instable optimisation in the strongly convex case. The authors state that the learning rate must be tuned very carefully in this case. The drawback is that this brings yet another hyper parameter to tune, which is costly in the FL setting.

Correctness: - The theoretical claims seem correct. - Regarding the empirical methodology, the proposed algorithm used K*R local steps instead of R steps for the other algorithms. I would have appreciated a comparison of their algorithm with fedAvg and per-FedAvg run with K*R local steps, to better understand if the improved performance comes from the additional local computations or from the inherent better formulation. - It would also have been better to use a more realistic FL dataset than MNIST, for instance the LEAF benchmark. - If I understand things clearly, all experimental performances are means of local test performances. How are these averages computed? Is there a weighting with respect to the number of samples? How do the local test performance distributions look like with the different algorithms? - It is unclear whether the hyper parameters were optimised on the local test sets or on local validation sets.

Clarity: The paper is clear and well written. I have spotted very few typos (see additional comments).

Relation to Prior Work: The authors discuss well previous work overall.

Reproducibility: Yes

Additional Feedback: - The global model of pFedMe is lagging in the synthetic experiments (Fig 2), whereas it’s comparable to FedAvg in MNIST (Fig 1). Any idea why it is the case? Is it due to the varying number of clients? Or to a lower sampling rate (S/N = 10% vs 50% ? - Line 175, typo: let theta « is » -> « be » - Line 179, « let assumption 1(a) hold » (no s