NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The paper introduces a new family of randomized Newton methods, based on a prototypical Hessian sketching scheme to reduce the memory and arithmetic costs. Clearly, the idea of using a randomized sketch for the Hessian is not new. However, the paper extends the known results in a variety of ways: The proposed method gets linear convergence rate 1) under the relative smoothness and the relative convexity assumptions (and the method is still scale-invariant). 2) The method works practically with all sketching techniques and 3) for any sketch size. These results also include the known results for the Newton method as a special case. The related work is adequately cited, the similar approaches from the existing literature and their weaknesses are discussed in a short but concise discussion in the paper. The results are technically sound, and these results are explained in a simple and clear way. The proposed method outperforms the state of the art. The experiments section is well-presented, and the results there support the theoretical findings. Empirical evidence is limited to a logistic regression setup (with a variety of real data). It could have been expanded for some other settings, either in the main text or supplements. Overall, I enjoyed reading this submission. The paper is very clear about its focus and contributions, and all technical results and the figures are accompanied by discussions and take-home messages. The proposed method can be attractive to the practitioners since it is scale invariant (hence no tuning or preconditioning) and it outperforms the baseline. I recommend the acceptance of this paper. Some minor comments: - There is a typo in footnote 4 on page 5. - "n" is defined as the Newton direction in (5). It is also used to denote the number of data points in pages 6-7. Consider changing one of these to avoid confusion. - line 259: "out run" or "outrun"? - [12] and [13] are the same papers the bibliography ========== After author feedback ========== I have read the author feedback and takin it into account in my final scores.
Reviewer 2
The main idea of this paper is to develop a sketching Newton algorithm for a class of strongly convex and Lipschitz gradient functions in local norms, which has been studied in [20], but for proximal-Newton-type methods. The proposed algorithm converges to the solution of the problem at a global linear or sublinear rate. Such a rate depends on the choice of sketching matrix. The following points are my concerns: -- The efficiency of the method depends on the choice of sketching matrix S. While Theorem 2 shows a linear convergence rate, it depends on \rho(x), a strong condition, and the authors did not show when it is strictly positive. It is also interesting to see a theoretical trade-off between the dimension of the sketching matrix and the contraction factor in Theorem 2 (at least on a particular class of functions). -- Clearly, the linear convergence stated in Theorem 2 is not surprised under Assumptions 1 and 2, which is similar to gradient methods. This has been stated in [20] as an inexact/quasi Newton method, though not thoughtfully discussed. This certainly makes the contribution of the paper to be minor. The authors should carefully discuss this point. --- The main technical proofs of the paper appear to be correct. I have checked the main proofs in the supplementary. -- The paper is generally well-written and clearly structured. There are some small possible mistakes. Firstly, in equation (52) all norm should be replaced by square of norm. Secondly, the second equation of (76) doesn’t look right to me because Pˆ(x) = H1/2G(x)H1/2 by definition. -- The paper provides only one class of experiment on a regularized logistic problem. This problem satisfies the assumptions studied in other subsampled and sketching methods such as [6,28]. Why the authors do not compare the proposed method with these existing ones? -- The proof of Lemma 3 can be simplified as: f(xk+1) = minλ T(xk +Skλ;xk) ≤ T(xk;xk) = f(xk). -- The result of Theorem 2 is not perfect. There is no theory to gurantee that rho > 0. In addition, it is not clear why rho <= 1. The author should also indicate how the sketch size influences rho. --The assumption (17) of Theorem 3 seems to be too strong. A clear explanation is needed.
Reviewer 3
The biggest hole I can find here is a detailed, clear comparison with Pilanci/Wainwright 2015 work on subsampled / sketched Newton, where they give convergence rates for both the sketch size > rank and sketch size < rank regime. As far as I can see, this work is not particularly novel when compared with the Pilanci/Wainwright work, although I can believe that it rose independently (it does not seem exactly the same). Therefore a clear discussion of the comparison would help rate originality and significance. Quality: The convergence rate looks more similar to that of a first-order method than a second-order method, and I believe if the sketch size goes to 1, then the method should reduce to gradient descent. (Newton should be q-quadratic) Therefore the rates themselves are not super impressive. However, the math is clean and the few lemmas I checked, the proof looks sound. Clarity: I think the mathematical contributions are very clearly explained and intuitive, which is the paper's main strength. In particular, the interpretation of the sketched step as a randomized projection, followed by simple linear algebra consequences, is a very clean way of analyzing the convergence behavior. Comments after rebuttal: - Upon reading the rebuttal and rereading the main paper, I do think I misunderstood some key parts of their algorithm, and the comment they made that gradient descent may not even descend in this framework is true. Hopefully the authors can find a way to emphasize this more when proposing the method? (Maybe with an example with all dimensions labeled?) - The point about the square root Newton in P/W work makes sense, and it does seem that they are mostly focused on local convergence rates, whereas the authors here are focused on global convergence rates. With the promised included discussion, I think the novelty is clearer. - I do feel that the exact line search (which is not terribly computationally burdensome but is somewhat unconventional) should be discussed more in the main text rather than the appendix. I have increased the overall score.