NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:6079
Title:Distributed estimation of the inverse Hessian by determinantal averaging

Reviewer 1

Quality & Clarity. This paper is well-written. Te exposition throughout the text is clear and appropriated. Section 2 is particularly neat! Originality: The paper tackles a central problem in distributed optimization and numerical linear algebra: estimating the inverse Hessian matrix from local machines. This paper shows an elegant solution (a weighted average) to solve the bias problem faced when one is estimating the inverse Hessian using a simple average of the local estimators. Significance: The two discussed applications are very relevant to ML, Newton’s method and uncertainty quantification. In fact, the sole contribution to distributed Newton's method would probably be of significant impact to the community; and the method could also generalize to other settings/applications. ---- I acknowledge that I have read the author's response, as well as the other reviewers. Thanks for the response: I have read the paper again and changed my original review and score.

Reviewer 2

Additional Comments: • Overall, the article is well-written and structured. It has a clear contribution and also significant theoretical justification. There are only a few mistyping and grammatical errors: Line 17: “tranformation” “transformation” Line 133: “ its entries is of …” “its entries are of …” Line 146: “ accross” “across” Line 150: “ emprical” “empirical” • In line 54, it is better to define in the theorem 2 along with . • In line 64, it seems the upper bound of the summation operator is incorrect. • Although plotting the estimation error in figure 1 is a useful tool to justify the superiority of the proposed method, using different markers or colors makes the graph more readable. • Line 92 and 104: sections are referenced incorrectly. • In line 93, I think should be since it is assumed that there are estimators unless the purpose is its approach to infinity to reduce the estimation error. If there is no constraint on the number of estimators, this means choosing subsamples is with replacement. However, in line 158, the authors claim that their method is without replacement. • In line 110, it deserves that Sylvester’s theorem is referenced. • To be more illustrative, for each part of the paper separately, provide your assumptions for variables clearly. For example, in line 183, specifying as “independently random variable” is more illustrative than “independently random”. • It seems the proof of the lemma 7 is not totally correct. In lemma 7, the authors are trying to prove two statements of “a” and “b” using inductive reasoning. However, in the line corresponding to the inductive hypothesis, they use the statement “b” to prove the statement “a”. It is more acceptable to prove the statement “b” beforehand and then use it as an axiom for proving “a”, or use assumptions and axioms which hold for both statements to prove them simultaneously. • In line 206, defining is required to determine the bounds for the variable “x” in Lemma 8. • The proof for the Lemma 9 is somewhat confusing. For example, it is difficult to find the connection between the interval of “z” in line 224 and the bound of “x” in Lemma 8. Besides, it seems by considering “” as a Bernouli random variable with the mean of “” and “” as a constant, the equation in line 223 is no longer correct. • There are similar works in the literature that attempt to improve the approximation of the inverse Hessian such as [1], [2], and [3]. I would recommend that the authors cite related works with mutual concerns to motivate their paper more. Moreover, I suggest that the authors provide evidence of the advantages of their proposed approach with these related works in terms of performance and complexity in details. References: 1. Gower, R., Hanzely, F., Richtárik, P. and Stich, S.U., 2018. Accelerated stochastic matrix inversion: general theory and speeding up BFGS rules for faster second-order optimization. In Advances in Neural Information Processing Systems (pp. 1619-1629). 2. Mokhtari, A., Ling, Q. and Ribeiro, A., 2016. Network Newton distributed optimization methods. IEEE Transactions on Signal Processing, 65(1), pp.146-161. 3. Bajovic, D., Jakovetic, D., Krejic, N. and Jerinkic, N.K., 2017. Newton-like method with diagonal correction for distributed optimization. SIAM Journal on Optimization, 27(2), pp.1171-1203.

Reviewer 3

Originality: This work builds upon previous work but derives several results that are interesting in their own right, including moment bounds for determinants and adjugates of random matrices. Quality & Clarity: This paper is extremely clear and easy to read, which contributes significantly to its quality. The results are carefully proven and explained, and the proofs are easy to follow. Significance: the ability to recover the inverse of a large matrix through distributed averaging and without bias has many important applications to machine learning problems, as the authors point out. ----- Questions ------ - Would it be possible in certain of the ML scenarios mentioned, to take advantage of the determinantal weights by sampling subsets of size k that yield high determinantal weights? - Do you know if the identities derived in Section 2 can be generalized to other elementary symmetric polynomials?