Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
My main comment is on the applicability of such robust algorithms in real world scenarios. Can or should the adversarial cases listed in the paper, e.g., power outages and communication failures be modeled as *worst* case attacks? In the seminal work by Lamport the motivation for such worst case attacks are clear. Can the authors show simulations practical cases failures in distributed nodes can be modeled as Byzantine attacks or elaborate more to this point in the paper?
This paper presents DETOX, a new framework for improving resistance to Byzantine compute nodes during distributed optimization. The framework utilizes existing robust aggregation techniques, as well as redundancy, within the proposed framework to make better use of resources toward mitigating the effects of Byzantine compute nodes. The paper is very well written. Presentation of existing methods and description of their framework is very clear. The algorithm is presented clearly with minimal assumptions on readers knowledge of distributed training. The theory is sound and shows improvements in performance over multiple relevant metrics. The framework is, from a software design point of view, substantially more complex (cyclomatic complexity may capture this observation) than existing methods and may make adoption of the method more difficult for certain system architectures. The method appears to do quite well when q=5 number of compute nodes are Byzantine, but how does performance suffer when using DETOX if epsilon is small (epsilon=q=0, q=1)? How do other methods (including no robustness and just redundancy) do outside the DETOX framework when epsilon is small? It may make sense to adopt simpler robust methods when epsilon is small. This is minimally addressed with experiments that employ the weaker reverse gradient attack, but having some experiments with varying epsilon may help system developers/architects identify whether DETOX is the right framework for their infrastructure/existing system. Additional thoughts after authors response: I'm a bit confused by their response regarding variance. For p=45, for vanilla aggregation, the PS will average over 45 batches of size b, giving an "effective" batchsize of 45*b, whereas with DETOX, due to redundancy (r=3), we'll see an "effective" batchsize of 15*b. Additional thoughts having gone over the paper again: 1. I think the paper could use more clarity around how their method compares to other methods in terms of resource requirements. It's mentioned in passing (section 5.1 and 5.3) but I think it could be made a bit more explicit. In particular, in section 5.1, do they consider an r=3 group of nodes a compute node? Is this why DETOX computes 3 times as many gradients as the compute nodes in the vanilla aggregation case? 2. Greater clarity around how DETOX doesn't increase variance (for fixed p). My current understanding: For p=45, for vanilla aggregation, the PS will average over 45 batches of size b, giving an "effective" batchsize of 45*b, whereas with DETOX, due to redundancy (r=3), we'll see an "effective" batchsize of 15*b. 3.The paper could benefit from illustrating the problem (Byzantine node failures) with concrete examples (in particular the adversarial setting, unless this is purely hypothetical). I also think this would help make the work more interesting to a wider audience.
Originality/significance: The idea of hierarchy in workers (majority vote, then aggregation, then robust aggregation) as detailed in figure 1 is attractive as it only adds a linera (in the number of base nodes) overhead. However, the use of a majority vote in the base layer might lead to a big loss in terms of variance reduction, compared to performing robust aggregation straight from the base layer. Quality/clarity: the paper is fairly well-written and is well suited for the larger audience of NeurIPS