Review for NeurIPS paper: Optimal and Practical Algorithms for Smooth and Strongly Convex Decentralized Optimization

NeurIPS 2020

Optimal and Practical Algorithms for Smooth and Strongly Convex Decentralized Optimization

Review 1

Summary and Contributions: The paper proposes a new algorithm, APAPC, for decentralized gossip-based optimization of strongly convex objectives that is optimal both in gradient complexity and in communication-rounds complexity. That is, it both takes advantage of acceleration in terms of the condition number of the objective and depends optimally on the spectral gap of the gossip matrix used by the decentralized optimization method.

Strengths: The theory seems solid, and the problem that the paper sets out to address is well motivated by the desire to avoid dual gradients of the local loss functions. The proposed algorithm appears to compare well to the competitors.

Weaknesses: The proposed algorithm seems like a straightforward combination of the prior approaches that were optimal for gradient complexity and communication-round complexity via the method of operator splitting. So one could perhaps criticize this work as being incremental.

Correctness: The theoretical results appear to be correct. While the experiments are valid methodologically, what is missing here is an actual wall-clock-time comparison between the different algorithms for training on a real decentralized network. As it stands, we don't know how this approach would compare with MSDA and the other competitors in a practical setting, because it is not clear how the "# of communication rounds" and "# of gradients" computed translate to actual time taken by the decentralized solver. The networks chosen for evaluating the decentralized method seem a bit unnatural. Were these used in some previous work? If so, it should be cited; if not, why wasn't a network topology from a standard decentralized learning paper (such as a hypercube or cycle) tried?

Clarity: The paper is written well, but I worry that readers who are unfamiliar with the monotone operator framework will have difficulty understanding it. A lot of definitions are dropped on the reader at the top of page 5, without really giving much intuition about what these definitions are doing. But perhaps this is unavoidable.

Relation to Prior Work: The paper does a good job of situating itself in relation to prior work.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper considered the problem of decentralized optimization of smooth and strongly convex objectives, and proposed the first algorithm that is optimal in both communication complexity and gradient evaluation complexity, and does not require evaluation on dual gradients. Empirical comparison with state-of-the-art methods are also presented, demonstrating the effectiveness of the proposed approach.

Strengths: Clear theoretical justification and empirical demonstration of the effectiveness of the proposed approach.

Weaknesses: NA

Correctness: Theoretical results are clearly stated and seem correct. Empirical results look reasonable.

Clarity: This paper is well written.

Relation to Prior Work: Related work are extensively discussed.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper considers smooth and strongly-convex decentralized optimization problems. The authors propose a novel primal gossip-based optimization algorithm (OPAPC) which is provably optimal w.r.t. relevant oracle complexity lower bounds. Existing optimal algorithms for this setting rely on access to dual gradient which, in absence of any structure, can be very expensive. The suggested method uses primal (1st order) access exclusively---forming the main contribution of the paper. The main algorithmic idea is to use a generalized forward backward-like algorithm with the corresponding positive-definite operator being chosen according to the gossip matrix (so as to comply with the underlying decentralized structure). Acceleration w.r.t. the condition number of the individual functions and of the network graph is then obtained through Nesterov extrapolation and Chebyshev polynomials, respectively. --- update after the authors' response --- I have read the authors' response and other reviewers' comments carefully and taken them into account in my final evaluation. Good luck!

Strengths: The decentralized setting has gain a lot of popularity in ML in recent year for natural reasons. The suggested method OPAPC closes the theoretical gap for the class of primal algorithms for this setting and w.r.t. the oracle complexity bounds derived in Scaman et al. (2017). Although the actual optimality is obtained through integration of existing known algorithmic building blocks, the main ingredient (PAPC) is simple and elegant. Lastly, preliminary empirical results seem promising.

Weaknesses: As mentioned earlier, all the ingredients of the proposed algorithm are well-known and have been studied in previous works--in this sense, the technical contribution of the work is somewhat incremental. The choice of the definite-positive operator according to the gossip matrix is the valuable bit. My other two reservations are general to the context of the work: -- The only *physical* constraint imposed on the gossip-matrix, say W, is to respect the underlying network graph. That being the case, I find optimality w.r.t. W somewhat artificial. A given network graph can accommodate any choice of valid W. - As far as I'm aware, the applicability of such algorithms for real-life decentralized applications is yet-to-be-proven on scale.

Correctness: I haven't examined the theoretical convergence analysis in detail, but the claims seem sound. The empirical comparison seems fair and matches in breadth the venue.

Clarity: The paper is well-written and relatively easy to follow. My only suggestions: -- The rate table provided in the appendix (a standard by now, for such papers) is very convenient and is better placed in the main body to provide a quick context for the paper. -- Personally, I think some parts of the introductory sections are very repetitive and can removed and used instead to promote more interesting technical details of the analysis.

Relation to Prior Work: I find the the related work section sufficiently inclusive and useful.

Reproducibility: Yes

Additional Feedback: -- in the abstract, "The our" -- Section 4.1, perhaps further elaboration on (6) would make this section clearer. -- L231, redundant parenthesis