NeurIPS 2020

Relative gradient optimization of the Jacobian term in unsupervised deep learning

Review 1

Summary and Contributions: The paper proposes an improvement in the time complexity of computing maximum likelihood estimates for an invertible parametrized function mapping a latent variable to an observable variable. In the main setting, the inverse function is assumed to be computable by a complete deep neural net, and the distribution on the latent variable is "nice" (a coordinate-wise product distribution). The computation naturally requires computing the gradient of the loss function using back propagation. This naturally decomposes into a layer-by-layer computation. The bottleneck is the Jackobian term (the partial derivatives of the mapping with respect to the observables), which a straightforward analysis shows requires inverting the matrix of edge weights. For a D X D matrix, this is trivially D^3, and even using non-trivial methods much worse than D^2. The main contribution of the paper is to propose a back propagation update using relative gradients based on multiplicative perturbation of the matrix of weights. This leads to D^2 performance per layer, and the improvement is demonstrated by experiments.

Strengths: The paper clearly touches on very important questions: learning latent variable models, using deep networks to model anticipated data, and solving questions in unsupervised learning. The contribution is simple and elegant, and hence potentially applicable in practice. The improvement is mathematically provable.

Weaknesses: It would have been nice to see examples of concrete and rigorous high dimensional latent variable models that can be solved in this manner.

Correctness: The paper seems to be correct.

Clarity: The paper is clearly written.

Relation to Prior Work: As far as I can tell, they relate correctly to past work.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper proposes to use relative gradient, which simplifies the gradient of the Jacobian term, in optimization for fully connected layers in unsupervised learning. It claims the approach is less restricted compared to normalizing flows while still has similar performance improvement over naive gradient approaches.

Strengths: The paper has pointed to an interesting perspective of simplifying the gradient computation of the Jacobian term by using alternative optimization schemes. The proposed approach is simple and seems effective on the small set of evaluations. If the approach is really effective, it will have significant impact.

Weaknesses: The main limitations of the paper are two folds: 1) The formulation of relative gradient is somewhat hand-wavy, and lacks theoretical justification of its convergence properties. Even analysis on a simple problem setting is good enough to demonstrate at least that a) it converges, b) converges to the correct solution, c) some characterization of its convergence rate. 2) Evaluation is weak. In order to back up the claim, the approach should be evaluated on more challenging tasks with comparison to more alternatives, such as normalizing flow variants. In addition, one major advantage claimed by the paper is that relative gradient is less restrictive than normalizing flow approaches. To further back up the claim, the paper should identify a few example problems where normalizing flow is suffering practically and compare it with the performance of relative gradients.

Correctness: The derivation and empirical methodology look correct.

Clarity: The paper is clear. However, I think the paper spends too much efforts in introducing existing concepts and too little is on the proposed approach, for example, convergent guarantees/properties, more thorough empirical evaluations.

Relation to Prior Work: The discussion on prior work seems sufficient.

Reproducibility: Yes

Additional Feedback: In Table 1, what is the number reported? Please specify it explicitly in the table. Also what are the other models being compared to? Either explain them in the caption or in the main text, and if applicable, add citation near the model names. ========================================== After reading other reviews and author's rebuttal, I maintain the same rating.

Review 3

Summary and Contributions: Quite a bit of recent research on deep density estimation under the normalizing flows umbrella has focused on efficiently computing (a restricted form of) the Jacobian term that appears in the objective. Such models operate with a set of transformations where the computation of this term is easy. While arbitrary distributions can be learned by such methods, the features that are learned are quite skewered which can prevent learning a proper disentangled representation. This paper presents a conceptually simple method to optimize for exact maximum likelihood in such models. In particular, the authors consider a transform from the observed to the latent space which is parameterized by fully connected networks with the only constraint that the weight matrices are invertible. Since the parameters of the transformation are matrices, the authors use properties of Riemannian geometry of matrix spaces to derive updates in terms of the relative gradient. This method gives an efficient way to update the Jacobian term, not just in the context of normalizing flows, but in more general contexts where they might appear. The main model considered is given by equation (2), its factorized variant is given by equation (3), and a conditional variant by (4). Evaluation of the gradient of the first term in this is easy as long as the model can be considered to have a simple form for the latent density. The second term, which is the Jacobian term, represents the main bottleneck underlined above. It is known that forward-backward mode auto diff scales as O(LD^3), where L is the number of layers and D is the data dimension. This is because computing the gradient of the log-determinant involves a matrix inversion. The contribution of this paper is to use the relative gradient such that the computation of the matrix inversion is not necessary. Moreover, by applying the relative gradient carefully (equation 15), we also avoid an expensive matrix matrix multiplication that is needed to compute the relative gradient from the ordinary gradient. Thus, we get a cheap method to compute the gradient of the log-determinant term.

Strengths: The paper gives a very simple method to train linear flows using the relative gradient instead of the normal gradient.

Weaknesses: Since the main selling point of the method is computational speedup, it would be useful to report numbers that indicate so in the experiments.

Correctness: Yes.

Clarity: The paper is well written. There are occasional typos and incorrect capitalizations (like M for maximum likelihood), which the authors should remove.

Relation to Prior Work: Discussion of prior work is thorough.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The main contribution is to propose to perform gradient descent on a change of variable that simplifies computations, avoiding an expensive (cubic cost) matrix inverse, in a deep latent variable model. They derive the corresponding expressions for a fully connected network, and empirically compare their technique to gradient descent (using Adam, see comments below) achieving a substantial speedup. -- after author's rebuttal: I find the new experiment with SGD interesting, that seem to show that the proposed technique works as described in sec 4 (and not only in conjunction with adam) I however don't understand the author's statement that SGD is slower than Adam., where there plot shows the oppose during the first 50 iterations. Regarding autodiff vs standard I still don't understand the difference, since autodiff is supposed to implement the same formulas as in sec 3 (but automatically). This should be clarified in the main body. I still agree (like other reviewers) that the empirical part is weak. For these reasons I will keep my score unchanged

Strengths: The technical part is sound, and introduce a technique that, to the best of my knowledge, has not been used in previous work. Since it drastically reduce the computation cost of training a fully connected type deep latent variable model, it is certainly a significant contribution. In order to fully appreciate the significance of the technique, I wonder whether similar change of variable could be applied to other deep learning setups.

Weaknesses: On the empirical part, I find it striking that no experiment use regular SGD, without Adam. A confounding effect of Adam could be to somehow compensate for the missing matrix inverse in the update step. Did you try it without success? This is an important drawback since the theory is really elegant and sound, but it is not clear that the experiments support the claim. In Appendix F about including the bias term in the relative gradient expression, you mention that you use a projected gradient algorithm. Can you elaborate a bit about that ? Does this simply correspond to setting the elements of the augmented matrix to zero?

Correctness: While the technical part is correct as far as I can tell, I find the empirical evaluation rather weak since, as already mentioned, it misses a simpler experiment without Adam, that would empirically evaluate the proposed theory.

Clarity: The paper is well written, with everything clearly explained. Since a large amount of material is in the appendix, I would have appreciated direct references to the corresponding sections in the main body in order to improve readability. Some things that should be clarified: - in the main body what is the difference between standard and autodiff in fig 1? - in table 1, I assume that you report log likelihoods, but this is missing in the caption.

Relation to Prior Work: This is a rather uninformed opinion since I do not know the recent literature in the field, but is seems like the literature review is complete. NB Part of the related work section is in the appendix.

Reproducibility: Yes

Additional Feedback: typos: l114-115: with