Paper ID: | 5049 |
---|---|

Title: | DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization |

In this paper, the authors propose a distributed Newton method for gradient-norm optimization. The method does not impose any specific form on the underlying objective function. The authors present convergence analysis for the method and illustrate the performance of the method on a convex problem (in the main paper). Originality: The topic of the paper, in my opinion, is very interesting. The paper presents an efficient Newton method that is motivated via the optimization of the norm of the gradient. As a result, no assumptions are made on the objective function. This is a key differentiating feature of the method as compared to other such methods. Quality: The overall quality of the paper is very good. The motivation is clear, the theory is well thought out and presented, and the numerical results show the superiority of the method (on a convex problem). I state my minor comments/questions/concerns below. Clarity: The paper is well-written and motivated. Below I state a few minor issues and concerns, as well as a few suggestions to improve the clarity and presentation of the work. Significance: In my opinion such distributed Newton methods are of interest to both the optimization and machine learning communities. This paper attempts to alleviate some of the issues (communication vs computation balance, number of hyper-parameters and sensitivity to hyper-parameters) of existing works. I believe that this paper, after minor corrections and modifications, should be accepted for publication at NeurIPS. Issues/Questions/Comments/Suggestions: - “increasingly more frequently”: is this phrase not redundant? - “regardless of the selected hyper-parameters”: The theory shows that this is indeed the case for the method. Of course, the method was designed, in a certain sense, such that that claim would be true. Nevertheless, it could be a bit misleading. Although, strict reduction is guaranteed with any hyper-parameter setting, this reduction could be very small if the hyper-parameters are chosen poorly. The authors should comment about this in the manuscript. - Related Work and Contribution paragraph 1: the authors should more clearly state which disadvantages apply to each method. - Line 66: “derived by optimization of the”: missing word? - Deriving DINGO from he optimization of the gradient norm is interesting and has many advantages as stated by the authors (e.g., no restrictions on the functions). However, does this approach have any limitations/drawbacks? For example, convergence to a local maximum or saddle point? The authors should add a comment about this in the main paper. - The discussion about the hyper-parameters is interesting, and shows the strength of the proposed approach. I suggest that the authors present this information in a table. For each method, the authors could clearly show the hyper-parameters associated. Moreover, in the table the authors could clearly state the per iteration cost (in terms of communications) of each method. - The drawback of the current analysis of DINGO is that it requires exact solutions to the sub-problems. The authors clearly state this. What is the limitation? How would the analysis change to account for inexact solves? The authors should comment about this in the manuscript. - Per iteration cost of DINGO: the authors should discuss the per iteration cost of DINGO in the manuscript. Both in terms of communication and computation, and compare with existing methods. If I am not mistaken, in the worst case, DINGO requires 4 rounds of communications per iteration, plus the number of communications associated with satisfying the line search condition. - Line Search: Usually, the Armijo condition does not have the factor of 2. The authors should comment about this in the paper. - DINGO algorithm is complicated: Algorithm 1 is complicated (with the three cases). The authors may want to give a high level description of the method before the present the actual algorithm. - Effect of theta parameter: The theta parameter controls the search direction chosen by the algorithm. Essentially, it controls that the search direction is not orthogonal to the gradient of (4), and that it is a descent direction. The authors should comment about this in the paper and discuss the role of theta. - Assumptions 3-6: The authors should add to the discussion of these assumptions. Why they are realistic? Why they are necessary? - In this experiment presented in the main paper, the function is strongly convex, and thus all iterates fall into Case 1. The authors should discuss the effect of Case 2 and 3 iterates on the performance of the method. - Future Work: What do the authors mean with “more efficient methods”?

Originality. As far as I understood, the paper proposes an extension of [15] for the distributed setup. This requires careful analysis, but not so many new ideas. I think the comparison with https://ieeexplore.ieee.org/abstract/document/8675539 should be added. Quality. The numerical experiments seem to be convincing. I didn't check the proofs thoroughly, but the high-level intuition is not clear for me. As far as I see, the method uses the inverse of individual Hessians H_i,t, but not the inverse of the full Hessian H_t. I don't understand, why this works in the light of the fact that H_t^{-1} != \sum H_i,t^{-1}. Also, it is not clear for me why the considered three cases are mutually exclusive. Clarity. Here I have the same questions as in the quality part. It would be nice to provide more explanations. At the same time, I very much appreciate that the authors give some intuition and illustration for their quite complicated assumptions and their connection to the standard assumptions. Significance. I think the method can be useful in practice. Especially since it has smaller number of hyperparameters, wider range of applications including non-convex functions. ===========After rebuttal============ I can't say that the authors addressed all my questions from the review. Especially the ones in the quality part. But, I found some answers in the GIANT paper. So, I leave my score unchanged.

As is stated in question 1, I like the idea of not using the strong convexity assumption. My main concern is associated with the computational costs of computing $H_{t,i}^\dagger g$ and\or $\widetilde H_{t,i} g$. It seems to me that (section 4) iterative methods are used to compute these matrix-vector multiplications. However, the convergence analysis seems to require exact computation. The authors are also suggested to elaborate more on the communication cost associated with the line search step. Other comments: 1. The algorithm DINGO involves a few hyper-parameters. It would be good if the authors can discuss how these hyper-parameters are tuned so that the algorithm can achieve better performance. 2. I am not sure whether the step length \alpha_t can eventually be chosen as 1. 3. For the convergence of the algorithm, many assumptions are needed (e.g., Assumptions 1,2,3,5). I am not sure whether the example considered in Section 4 satisfies the assumptions or not. 4. In the implementation, the authors use iterative solvers without preconditioners. If the subproblems have bad conditioner numbers, I do not know if these iterative methods can obtain solutions with sufficient accuracy to guarantee the progress of the algorithm.

Strength/weakness/questions/suggestions: 1- The paper is well-written and it is also structured properly. 2- In Algorithm 1, it is needed to calculate the product of pseudo-inverse of $\hat{H}$ and some vectors $g_t$ and $\tilde{g}_t$. It can be costly. It would be more clear if the authors clarify more about it. 3- In equation (6), it is expensive to calculate $P_{t,i}$. There is an inverse calculation (which the required info can be calculated by a system of equations too, however expensive), and in each iteration of the algorithm, there is some expensive parts mentioned in the above and current points plus “Line Search”. 4- In Algorithm 1, there is no explanation how “Line Search” will be done. Is it done in distributed environment? Or just in master machine? Also, the “Line Search” mentioned in this paper is very expensive (full gradient calculation is needed in each step), and if this “Line Search” is done in master node, then it may happen some cases that the master node would be very busy, or equivalently, the algorithm would not scale well (according to Amdahl’s law). 5- Assumption 3 seems to be a strong assumption. Also, the assumption in Lemma 1, “the Hessian matrix $H_{t,i}$ is invertible”, is strong too. Is the latter assumption is based on strong convexity? The reviewer did not notice why this assumption should be true. 6- In this paper, in several parts it is mentioned “a novel communication efficient distributed second-order optimization method”, however, there is no analysis regarding the required number of communication rounds to reach the solution that shows its efficiency (similar to DiSCO algorithm). 7- The reviewer could not find any information about how DINGO is distributed in practice? Do the authors use GPU or CPU for their calculations? The reviewer was eager to see the code for the proposed algorithm, however, no code was available to check how DINGO is distributed. 8- In Figure 1, it is suggested to compare the results based on wall-clock time not just the communication rounds. In some cases, the expensive calculations may be done in master node, therefore, less communication rounds would be needed, however, the wall-clock time would be very high. Another reason is that, the number of communication rounds is somehow equivalent to the number of iterations (not a good measure though), and it is suggested to compare the results of distributed algorithms based on true measure, i.e., wall-clock time. 9- In Figure 1, rows 1,2 and 4, did the authors consider the cost of line search when x-axis is “Communication Rounds” for DINGO? 10- In Figure 1, what do the authors mean by “Worker”? The details should be provided. 11- It is clear from row 3 of Figure 1, that the “Line Search” mentioned in Algorithm 1 is expensive (many gradient evaluation is needed), and not efficient if the authors want to have well-scaled algorithm. ============= After Rebuttal ============= I read the rebuttal carefully, and the authors answered my main concerns. I am generally satisfied with the rebuttal, and thus increased my score.