Review for NeurIPS paper: Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond

NeurIPS 2020

Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond

Review 1

Summary and Contributions: This paper introduces an efficient and fast approach to marginalising out the latent variables in Gaussian latent variable models by using an embedded Laplace approximation. The paper then provides an adjoint method to enable derivatives of the approximate log marginal density to be used in HMC for inferring the approximate posterior over the hyperparameters. ############### Response to Author Feedback ############### Thanks very much for the author response and directly responding to my questions. It was very helpful to hear your comments on the Hessian. I agree with the authors that actually implementing an efficient (fair) version of work such as semi-separable HMC would be a challenge and also I agree that it has not seen much wide use. I also understand that including a way of representing tuning time is challenging, but if you can think of one, I think it would help (even mentioning the number of times to fit the model would definitely be useful). Just a comment: to overcome the issue of a GP having few hyperparameters, it might be possible to work with kernels that have a much larger number of hyperparameters, such as spectral kernels e.g "Gaussian Process Kernels for Pattern Discovery and Extrapolation" by Wilson and Adams (2013).

Strengths: The paper is written well and its relation to previous work is clear. The introduction is well-motivated and the combination of the Laplace approximation from Rasmussen and Williams with the use of the adjoint method is described in a way that makes the story of the paper easy to follow.

Weaknesses: My main questions regarding the paper: 1) When computing the Laplace approximation, this still requires calculation of the Hessian, which I believe is with respect to the latent (theta). This is referred to as W in Algorithm 1. Would it be possible to comment further on the kind of trade-off between implementing full-HMC, versus the overhead of calculating the Hessian. I think this is the issue you are referring to in the second paragraph of the discussion section, whereby you mention higher-order automatic differentiation. (I.e. I assume you stick to analytical Hessians (e.g. Table on Page 43 Rasmussen and Williams)). 2) Are there any alternative methods available that you have explored that marginalise out the latent variables? For example “Semi-Separable Hamiltonian Monte Carlo for Inference in Bayesian Hierarchical Models” by Zhang and Sutton jointly sample over hyperparameters and parameters to overcome similar funnel-like behaviours to that of the Gaussian latent variable models that you explore. Although not strictly necessary in this work, it might be interesting to compare. 3) While Figure 1 and the GP with a Poisson likelihood both successfully show the speed of the new approach, the results of Sections 5 and 6 are slightly harder to interpret in terms of the performance benefits. Both Figures show that full HMC’s error often drops in less time than the embedded Laplace method. I think it would be helpful to put in the overhead time in tuning full HMC somewhere in the plots to really highlight how much of a headache tuning is! 4) When the Laplace approximation is introduced (approx. Line 62), is there any implicit approximation that the joint posterior can be written as p(phi | y) p(theta | phi, y)? Is this an assumption being made and does this result in any limitations? (I may be wrong and it may be possible to decompose the posterior in this way for a Gaussian latent variable model but I would be interested to hear from the authors.)

Correctness: The paper seems correct.

Clarity: The paper is very well written and clear.

Relation to Prior Work: The relation to prior work is clear.

Reproducibility: Yes

Additional Feedback: The authors provide an extensive supplementary material and they also provide the code.

Review 2

Summary and Contributions: Performing Bayesian inference on Gaussian latent variable models can be challenging since MCMC algorithms struggle with the geometry of the posterior and can be prohibitively slow. One potential solution is to use a Laplace approximation to marginalize out the latent Gaussian variables and then integrate out the remaining hyperparameters using HMC. To implement this method efficiently, the authors derive a novel adjoint method that propagates the minimal information needed to construct the gradient of the approximate marginal likelihood

Strengths: 1. They discuss and analysis the disadvantages of the existing HMC methods on Gaussian latent variable models, which provide a clear motivation of their work. 2. Via simple and easy-to-understand theorical derivation, they propose a new method to marginalize using a Laplace approximation.

Weaknesses: 1. What does K means in the beginning of introduction? It makes me confusion. 2. The compared method in experiments is just the standard HMC. Is there any existing methods as a baseline to illustrate the advantages of their proposed method?

Correctness: I am not very familiar with this field. I try my best to understand their method and proofs. I think it is correct.

Clarity: The paper is well written.

Relation to Prior Work: Clearly.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The submission considers approximate Bayesian inference on latent Gaussian models by marginalizing out the latent Gaussian variables using a Laplace approximation and performing HMC for the hyper-parameters. The new contribution is that the gradient of the approximate log marginal posterior can be computed efficiently with automatic differentiation via vector-Jacobian products instead of previous work that computes the full Jacobian of the covariance function. ################################## POST AUTHOR RESPONSE UPDATE Having read the response from the authors as well as the other reviews, I will keep my weak accept score. A concern raised in my review was a lack of comparison to alternative approximate inference methods such as variational inference approaches, as also raised in review 4. The authors refer in their response to work comparing the embedded Laplace approximation to variational inference from 10 years ago. Arguably, there was some work improving variational inference approaches in the meantime that also have different trade-offs of complexity/performance vs computing time – and an empirical assessment as to where the proposed approach lies in this respect would be very helpful in a revised version. The authors also responded to include more details on the comparison with the adaptive HMC benchmark. ################################################

Strengths: The proposed method allows for approximate Bayesian inference over very high dimensional hyper-parameters in latent Gaussian models. The idea of differentiating through a Newton-solver and using the fact that this solver is linear in the gradient of the covariance function, is new as far as I am aware. Gaussian latent variables (with many hyper-parameters) are used frequently, so this method - as implemented currently in the Stan framework - can be quite relevant to the NeurIPS community.

Weaknesses: The paper evaluates the performance of the proposed method on three models with log-concave targets only, comparing it to HMC/NUTS only. It would be useful to add some experimental evaluation on more general likelihood function, particularly when the Laplace approximation becomes less accurate (targets that are multi-modal, non-constant curvature) and compare it with alternative approximate methods such as variational ones in this case. In the case where marginalizing the latent variable might lead to "nicer" targets, it would be interesting to compare the method to some additional adaptive MCMC methods - it is not so clear to me what adaptation is used on the bench-marked full HMC sampler (I guess the number of leap-frog steps and the step-size, but also some (non-diagonal) mass matrix/pre-conditioning?)

Correctness: The claims and empirical methodology seem correct.

Clarity: The paper is written relatively well.

Relation to Prior Work: Related prior work seems to be discussed properly.

Reproducibility: Yes

Additional Feedback: I was wondering if this gradient calculation is related to work that consider implicit gradients of hyper-parameters where the parameters follow SGD dynamics (Lorraine et al., Optimizing Millions of Hyperparameters by Implicit Differentiation, 2020; Shaban et al., Truncated Back-propagation for Bilevel Optimization, 2019)? Minor remark: What is Z following line 31 (Appendix) exactly, I assume it is different from Z defined in line 42, which seems to be not a linear function (but has a quadratic term)? The notations \pi(y|\phi) and \pi_{\mathcal{G}}(y|\phi) seem to be used quite interchangeably in the text/algorithm - is this intended?

Review 4

Summary and Contributions: The paper proposes a novel inference algorithms for Gaussian Process models with a very large number of hyperparameters. The approach involves using a Laplace approximation of the latent variables and a novel computation for the gradient of the target density w.r.t. the hyperparameters by computing the gradient's inner product with an appropriately chosen adjoint. The paper demonstrates that their approach works faster that pure MCMC even in cases when the number of hyperparameters are small. ====== POST REBUTTAL ============ The lack of comparison to *recent* VI work in GPs and in particular gpytorch remains a major drawback of the paper. Irrespective of the misnomer around ARD the unaddressed concern is that this work is not well motivated. Recasting an ARD problem to a GP with a diagonal covariance just so we can get an example with a large number of hyperparameters is not a great motivation for work on GPs IMHO.

Strengths: The claims in the paper appear sound. The calculation of the gradient of the target density w.r.t. the hyperparameters is essentially a rearrangement of terms in the classical literature to allow for a single pass of reverse-mode autodifferentiation. This seems like a very simple contribution, but it adds to our knowledge of GPs. The experiments demonstrate clearly that the described optimization helps in better runtime performance and for large number of hyperparameters the improvements are considerably better than pure MCMC. The work is somewhat significant and novel since it fills a missing hole in Laplace approximated Gaussian Process research when the number of hyperparameters are very large. The class of regression models where we are trying to determine the relevance of covariates is relevant to the NeurIPS community. Also a lot of work in the NeurIPS community has recently focused on Gaussian Processes.

Weaknesses: Since this paper falls in the class of approximate inference for Gaussian Processes it would have been appropriate to compare it to Variational Inference. In particular the gpytorch package which is frequently cited in GP research would have been appropriate to compare against. So it is unclear whether the improvements here as compared to pure MCMC would stack up better when compared to other approximate inference techniques. As far as signficance, it is very unusual for Gaussian Processes to have such a large number of hyperparameters. It is not well motivated as to why this is an interesting problem to solve. For example the problem in section 5 where a local scale term is introduced for each of the 6000 covariates seems like a typical problem in the Automatic Relevance Determination (ARD) class of models. ARD was also not referenced or compared against. It is not clear why we should recast this problem as a GP with a diagonal covariance?

Correctness: The claims and methods appear correct.

Clarity: The paper is very clearly written.

Relation to Prior Work: Comparison to VI or ARD was not provided, so the significance of the work is unclear.

Reproducibility: Yes

Additional Feedback: Line 18 describes the model structure as "latent Gaussian model:. But there is no agreement in the literature to justify this name. For example Deep Gaussian models assume that a sample from a standard normal is the first sampled variable followed by a deep net. While the model in line 18 uses a multivariate normal for the latent variable which is the second sampled variable. To be honest, the structure of this model looks exactly like a Gaussian Process. For example gpytorch uses exactly this structure. Alternatively, please provide a citation to defend the name Latent Gaussian Model. On Line 50, \pi is used for both the posterior and the prior. My suggestion would be to use `p` for the specifying the prior and `\pi` for the unnormalized posterior which is the more familiar convention to this reviewer. On Line 67 it is probably worth clarifying that \theta* is the mode "conditional on \phi". Line 99, although K could be arbitrary, in practice the possible choices of K for GPs come from a very small set, so its not clear how impactful this work is.