__ Summary and Contributions__: This work addresses distributed learning of agents, where even though each agent's training data are different, they can share their learnt model parameters over a neighborhood graph. Authors propose a weight training method that is resilient to any number of byzantine agents present in the network, that could adversarially send model updates that hamper other normal agents. The method ensures that even in the worst case, the regret of a normal agent is not more than the baseline of non-cooperative training. Their approach is validated in experiments, where they outperform model averaging that does not filter for adversaries.
After reading author feedback: I thank the authors for the clarification. I maintain my positive score.

__ Strengths__: The problem studied could be useful to the community. The core idea is intuitive, model updates from other agents are only considered if they evaluate to a lower regret on the agent's local data.

__ Weaknesses__: Some aspects of the problem setup do not seem to be critical to the solution method. For example, the entire connectivity graph is presumably known, but no particular property of the graph (eg its laplacian matrix) is used. In the experiments, it seems all the graphs considered are complete graphs.

__ Correctness__: I briefly checked some of the proofs in the appendix and they seemed correct to me.

__ Clarity__: The paper is written well and is easy to follow. The digit recognition experiment was interesting and it might be better to include more description and possibly fig 8 in the main text.

__ Relation to Prior Work__: There is adequate coverage of related work. The difference from previous work is clear. While the problem the authors are solving is different from "consensus", it would be good to include some references from the literature and highlight differences / make connections if possible. One example reference from that literature is Olfati-Saber, Reza, J. Alex Fax, and Richard M. Murray. "Consensus and cooperation in networked multi-agent systems." Proceedings of the IEEE 95.1 (2007): 215-233.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper proposes a method for performing multi-task learning (MTL) in the distributed setting when byzantine agents are present. The common approach to tackling the distributed MTL problem can fail if there is a single byzantine agent, and the authors propose an alternative online weight adjustment rule that circumvents this issue. The authors prove regret bounds for their approach. In three different sets of experiments, their algorithms converges faster than other approaches.

__ Strengths__: The paper is well-written and reasonably easy to follow, albeit somewhat dense in terms of notation. The main result itself is interesting. The experiments presented are quite convincing.

__ Weaknesses__: I think the authors could do a better job comparing their work to methods proposed in [5] and [16] (as described at the top of page 2). While it is clear that their approach can be resilient in a greater range of settings, I would have liked to see how their method compares to those in the settings where the methods from [5, 16] were designed to handle. A discussion of the respective theoretical results or a empirical comparison would be useful.

__ Correctness__: The content in the main paper looks fine.
I did not go through the proofs in the appendix.

__ Clarity__: Yes. It is quite dense, but this is probably unavoidable. The trade-off is that the authors were able to provide a significant amount of intuition within the paper.

__ Relation to Prior Work__: Mostly, but see comment under 'weakness'

__ Reproducibility__: Yes

__ Additional Feedback__: See comments under 'weakness'

__ Summary and Contributions__: This paper introduces a method for Byzantine resilient distributed multi-task learning. In particular, an online weight assignment rule is presented by exploiting the variation of the accumulated loss. The property of the approach is analyzed under the setup of convex models. The effectiveness of the approach is demonstrated with regression and classification tasks.

__ Strengths__: A distributed MTL method is presented and some theoretical analysis is provided for the optimization.

__ Weaknesses__: First, the scope of this paper is limited. Existing deep multi-task learning literatures are totally ignored. “An Overview of Multi-Task Learning in Deep Neural Networks, by Sebastian Ruder”. On the other hand, deep distributed learning is also not considered in this work, such as “Large Scale Distributed Deep Networks by Jeffrey Dean et al.”, “FedNAS: Federated Deep Learning via Neural Architecture Search by Chaoyang He et al.”. Both aspects should be incorporated into the paper to provide a whole picture of the development of distributed multi-task learning.
Second, the proposed method is based on some strong convex assumption. Thus, when using deep multi-task representation learning way, the framework is not applicable because of the non-convex property. On the one hand, the paper should discuss the alternatives for the loss-based weight optimization in such cases. A related work is “multi-task learning using uncertainty to weigh losses for scene geometry and semantics by Alex Kendall”. On the other hand, more analysis with deep structures should be added to improve the applicability for real-world multi-task big data learning, such as “cross-stitch networks for multi-task learning”.
Third, more experiments related to deep MTL should be provided to show its ability in real-world applications, especially for the digit classification. Some comparison with deep distributed learning also should be added to show the superiority of the proposed method.

__ Correctness__: Correct under some assumptions.

__ Clarity__: Good

__ Relation to Prior Work__: Some important related work is not provided.

__ Reproducibility__: Yes

__ Additional Feedback__: After reading the feedback, I still think it is necessary to discuss the applicability for non-convex cases thoroughly, other than only providing a MNIST experiment with a simple CNN setup.

__ Summary and Contributions__: The paper proposes a method for distributed multi-task learning that is resilient to an arbitrary number of Byzantine agents, including the extreme case where all neighbors are Byzantine. The method is simple, theoretically sound, and has excellent computational complexity. Empirically, it appears to work well in a variety of settings, though I have some concerns about the experimental setup.
Update: I have read the author response, and thank the authors for the clarification of what it means to be Byzantine.

__ Strengths__: Despite a fairly lengthy derivation and a few approximations along the way, the idea underlying the proposed method is simple and intuitive: if a node j wants to determine if a neighbor k is trustworthy, it should evaluate the loss function using k's parameters but on j's data, and filter k out if the loss turns out to be too high. I consider the simplicity of the core idea to be a major strength of the method.
Furthermore, the method works in the presence of an arbitrary number of Byzantine agents, and achieves this without requiring the user to input a tuning parameter that depends on the number of Byzantine agents (which seems to bee a weakness of previous methods).
Another major strength of the proposed method, is that it automatically reduces to the non-cooperative (i.e. isolated) case when all neighbors are Byzantine. Overall, the proposed method has nice properties and is quite elegant.

__ Weaknesses__: The experiments section did not define what it means for an agent to be Byzantine (for example, what model is used to perturb messages sent to neighbors?). Will the method still perform well if the perturbation of messages is small?

__ Correctness__: To the best of my understanding, the derivation appears correct, though I did not check it thoroughly.
The empirical methodology has a flaw, in that it does not define what it means for an agent to be Byzantine.

__ Clarity__: The paper is easy to understand, though it is mathematically quite dense. To improve readability, it might be helpful to push some of the derviation details to the appendix.

__ Relation to Prior Work__: I am not familiar with the literature on distributed MTL, but the related work section does lay out a decent number of baselines. My understanding so far is that the Byzantine resilience feature of the proposed method is a highly original contribution.

__ Reproducibility__: Yes

__ Additional Feedback__: