NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4908 Trivializations for Gradient-Based Optimization on Manifolds

### Reviewer 1

Lines 11-12 (and in other places): "the gradient of the exponential of matrices" -- the word "gradient" seems to be inadequate in this context, since gradients apply to functions that map to real numbers. From the appendix, it seems the authors here mean the gradient of a (real) function of the exponential of a matrix. Either way, in the end, the such gradients really only require the Fréchet derivative (or directional derivative) of the exponential map, and the adjoint of these differentials. This is indeed what the appendix derives, hence, perhaps the authors could reformulate in this sense? On a related note, Propostion 6.1 about the gradient of functions involving the matrix exponential: it is said after this proposition that the reference (Al-Mohy and Higham) gives an approximation of dexp. Perhaps it was a different paper, but my recollection is that this paper gives an exact formula, based on computing the exponential of the block matrix [X H ; 0 X] then extracting the top-right block, where X is the base point and H is the direction? This is mathematically exact, and gives machine precision in practice. Line 42, "... consequently not altering the convergence of the optimization method" -- This sentence suggests that changing the Riemannian metric on the manifold has no bearing on convergence, which is not true (think for example of Newton's method, which can be though of as running gradient descent with the Riemannian metric dictated by the Hessian of the cost function). The statement in Theorem 4.3 is mathematically imprecise, as the main claim is an "equivalence" that is not defined. Please rephrase. It seems that the actual claim is rather that critical points are in one-to-one correspondence through phi. I imagine the same is true for second-order critical points? Def. 3.1: the map r should be smooth for this to be a retraction. Line 138, "almost all the manifold": do you mean on almost the whole manifold? Or do you mean on almost any manifold? Either way, what is the meaning of "almost all" here? Line 161: "When working on a manifold, we do not have a notion of the Lebesgue measure of a set." -- Since we are here working on a Riemannian manifold, can't we have a notion of measure from the Riemannian metric? Question to the authors, out of curiosity: do you envision that it would be easy (and beneficial) to extend these techniques to second order optimization? It seems this would be impeded by the necessity of computing the second derivative of the matrix exponential, which may not be trivial computationally?

### Reviewer 2

This paper studies a parametrization technique, which is called trivialization, for transforming a Riemannian manifold optimization problem to an equivalent problem in the Euclidean space. Algorithms used for Euclidean setting can then be used to solve this problem. Two useful trivializations are proposed: Riemannian trivialization and Lie trivialization. My main concern is the practicality of the proposed approach. It is not clear how to use the proposed trivializations to some common and simple manifolds: sphere and Stiefel manifold. Moreover, the computational cost of the trivialization seems to be still high, given that matrix exponential needs to be computed/approximated. Furthermore, what is the benefit of converting the manifold optimization problem back to Euclidean space? This ruins the possibly nice structure of the manifold, which may leads to gains on dimensionality reduction etc. ======== After reading the authors' rebuttal and other reviewers' reports, I am willing to increase my score, as some of my concerns are addressed appropriately.

### Reviewer 3

This work provides a new way of optimization on manifold through combination of parameterization and Riemannian gradient descent. The paper is very well written and the organization is clear. The experimental section is convincing and the references are adequate. The methods and proofs are technically sound. The paper also made sufficient connections and contrasts with the existing methods. The code is included in the submission for reproduciblity check. Significance: The paper proposed a more general framework for optimization through parameterizations than the state-of-the-art method and showed improvement in several numerical experiments.