This work is a solid contribution to the literature on distributed optimization/training. Three reviewers suggested acceptance, and one suggested weak reject. I believe the concerns raised by this reviewer are not major and can (and should) be addressed in the camera ready version. I suggest acceptance provided the authors are ready to address all reasonable issues raised by the reviewers. Some more comments: Plus: There is a consensus that the paper's contributions are quite clear despite the dense content. I recommend some more details be added in the extra page available in the camera ready version. The simplicity of the norm based thresholding is a plus. Either state that this is the first time such thresholding is used to the best of your knowledge, or mention works where this was done before. Some issues: The choice of the step size in the numerical experiments is not described in sufficient detail. The competitor algorithm in Figure 1 can probably do better with step size tuning, which appears to be omitted. One can expect the full gradient based method to converge faster with the right step size. Please provide also plots of 'loss vs. iterations' in the revised version (from the optimization perspective, it seems to be much more interesting that 'accuracy vs. iterations'). There is a consensus among reviewers that the convergence guarantee in Theorem 1 is rather weak. Besides the error floor increasing with the number of workers, there is dependence on the inverse of the minimum singular value and gradient norm, which can be quite large. Moreover, this is under a restrictive incoherency condition on the training data. In their rebuttal, the authors acknowledge that there are some direct connections to the randomized linear algebra literature. I suggest you elaborate more on this in the camera ready. There are earlier model averaging algorithms proposed in the non-byzantine setting which are similar but not discussed. I suggest some coverage of these be included.