Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
EDIT: After reading the feedback of the other reviewers I realized that I was too harsh in the assessment of the paper. Consequently, I updated my assessment of the paper to a good submission that should be accepted at the conference. The paper in itself is well written, and the explanations are clear and concise. However, I don't feel NeurIPS is the appropriate venue to publish it. My main concern about the paper is with respect to its appeal for the ML community at large. My impression is that its scope is rather limited. The presented examples are of reduced dimension and it is not clear for me how the method could scale to large datasets. There are some leads given in the conclusion section, however I would have preferred to actually see them implemented instead of just being mentioned.
Originality: First-order Gradient-based MCMC methods have to deal with determining an appropriate length scale for each variable. NUTS is one approach and this paper gives another approach whereby a parameter theta of a proposal distribution is adaptively improved to account for the covariance structure. At the same time theta is adapted to consider the entropy of the proposal distribution. This trade off for theta is rolled into a new speed measure which is the central point of this paper. The paper includes a lower bound of the speed measure that can be directly differentiated resulting in a practical algorithm. The paper also includes a heuristic that makes this adaptive MCMC algorithm applicable to MALA as well. All of these ideas are very original. Clarity: The paper is quite clear, but maybe the authors could include more results from the supplement into the main text. Quality: The technical quality is good and the results confirm the claim that the new algorithm is definitely more sample efficient than NUTS. This makes this work a very important contribution to the field. There are only a few results in the main paper, but the supplement has a lot more results. Significance: The results are very important because they have provided an adaptive approach that can be applied to even the trivial random walk proposers as well as the more sophisticated Langevin proposers. Given the overall simplicity of the approach this would be a very strong competitor for the popularly used NUTS algorithm, specially in light of the presented results, and would trigger further research. Some minor comments: It was not clear how the Adaptive MCMC (AM) described in the results works i.e. what is the objective function of the adaptation. A reference is made to the supplement, but I didn't quite find it there (lines 226-227). Line 266 and Figure 2 top panel. This should really be the auto-correlation plot. It's hard to interpret auto-correlation from a trace plot. The choice of rho_t on lines 136, 137 is not well motivated. In similar situations a Robbins-Monro sequence is typically used. ** POST REBUTTAL ** Thanks for answering the questions and for promising to update the plots. However, I would strongly advise you to provide results comparing with Stan. It will be very hard to take this as an improvement over the state of the art without a comparison to the state of the art!
I think this is a very nice paper, which provides a thoughtful approach to an important problem. I found section 2 impressively clear and easy to follow; this is quite well-written. The algorithms for adapting random walk metropolis and MALA to fit full covariance matrices seem quite nice, and a great alternative to fitting covariance matrices by e.g. moment matching to accepted samples, or estimating from independent pre-runs. I think this has potential to see broad use. I also think the generalization of "speed" has potential to be used to derive adaptive variants of other common proposal types. Some minor comments / questions / complaints: * the one thing I find unsatisfying is the need to fall back on existing results for "optimal" acceptance rates when tuning \beta. For RW and MALA proposals, maybe this is even reasonable (there's some theory about optimal preconditioners, proposal covariances, mass matrices anyway…), but it would be nice to address this more generally, or at least discuss the appropriateness of falling back on these recommendations. * line 140, L is described as a "positive definite lower triangular matrix" … presumably a typo (it is correct later for MALA), but LL' is positive definite, while L is lower triangular * the second row of figure 2 is a bit hard to interpret; maybe something else would be better to show. While this should clearly be a monotonic function, I don't have a sense of the optimal choice of scale when used in the different algorithms, or whether it should be linear with the target density variance, or what, particularly for MALA. * It would be good to release code. The "stop gradient" aspect may induce difficulties in implementation, at least as a reference. * For MALA and NUTS, what sort of preconditioning is done as a baseline? For NUTS as implemented in STAN and PyMC3, I know a standard practice is to run a short pre-run which is used to estimate the mass matrix; this is then used fixed with only the step size adapted online. Is something like this done here? What about for the non-adaptive MALA? ====== I read the rebuttal — thanks for your comments.