Reviews: Meta-Learning MCMC Proposals

An approach for learning proposal distributions in MCMC is proposed. The problem of learning high quality proposal distributions is challenging and therefore general purpose methods for obtaining such proposals is significant for probabilistic inference. The main idea is to learn the proposal distributions for block sampling in Gibbs using mixture density neural nets. Specifically, the neural net is trained assuming a specific structure for the block which is referred to as motifs. The training is performed by finding local graph structures in the input model that matches this motif, using the local parameters, and then instantiating the Markov blanket (or a subset of variables not in the block) randomly to generate training data. The neural net then learns the proposal by minimizing the KL divergence between the proposal and the true distribution for all parameterizations of the block. Overall the paper is well written. The idea seems reasonable, maybe not so novel since learning proposal distributions has been explored before (maybe not specific to neural nets). One possible drawback is that maybe we can use very simple motifs due to the difficulty in training more complex ones. Does that really help in complex graphical models. Also, the structure of the motifs are hand coded, is that a problem? Also it looks like we need to retrain the neural net for every graphical model even if they have same structure, is this correct? The experiments seem reasonable, except there is no other state-of-the-art benchmark (not sure if some of the other techniques in related work could have been used) used in the comparison. The technique shows promise in UAI benchmarks, GMM sampling and also in a real-world problem of Named entity recognition.

Reviewer 2

This paper describes a method to learn smart proposals for speeding up MCMC inference in probabilistic graphical models (PGMs). The main idea is to decompose the graph into possibly overlapping "structural motifs" such as grids or chains with similar relationships between the variables that are either proposed variables, conditioned variables, or local parameters. Using an appropriate distribution, samples are generated for training a probability distribution on the proposed variables given the conditioned variables and local parameters for each instantiation of the motif in the PGM. These learned proposal distributions are then used to generate proposals which are then accepted/rejected using the MH rule. (Variables not covered by any motif are sampled using single-site Gibbs). This work is very similar to inference compilation cited as [19] in the paper. However, the big difference is that inference compilation works on training a proposer for the entire probabilistic program whereas this paper builds up the proposer from multiple parts where each part is trained on a smaller sub-structure in the probabilistic model. The hope being that training a proposer on smaller sub-structures would allow those to be reused across models. Overall the work is quite sound and the results are very promising. The main reason for my low overall rating is to do with the disparity between the claims made in the abstract/introduction with the actual contents of the text. The authors claim, for example, that the "The learned neural proposals generalize to occurrences of common structural motifs across different models, allowing for the construction of a library of learned inference primitives that can accelerate inference on unseen models with no model-specific training required." However, there is no evidence given in the text of a proposer being trained on one model and then used for inference on another one. The other drawback of the paper is that the authors seem to suggest that they have a magical blackbox method that works in all models without any hand tuning as opposed to other techniques (line 51). But in every experiment described their appears to be a fair amount of hand tuning in terms of figuring out how to generate the data for training the proposers. In one case they decide to collapse out a large number of the variables all together! This is more of a collection of recipes rather than a systematic procedure for training the proposers. Finally, it appears that the work is not applicable beyond the realm of fixed size models for two reasons. 1) Even though the authors mention probabilistic programming languages (PPLs) early in the introduction there is no follow up in the text. For example, one would expect some discussion about extracting motifs in the context of a concrete PPL. Also, the paper assumes that the motifs can be extracted at the beginning of inference (Algorithm 1). However, in PPLs the graph structure changes with different values of the latent variables, so it doesn't appear applicable. 2) I don't buy the open universe probabilistic model (OUPM) example in the text. Instead of sampling the number of objects in the universe either directly or lazily (through a Dirichlet process) this example has an upper limit on the number of objects and assigns an "activity" indicator to each possible object with the total number of objects defined as the sum of the active objects. This is an oversimplification. OUPMs are meant to capture not just an unknown number of objects, but also potentially unbounded number of objects. There is nothing in the text of this paper that describes how a proposer would intelligently determine adding or subtracting objects from the model. Despite these drawbacks I believe that the work describes a significant improvement over single-site Gibbs for fixed size models with discrete-valued variables, and is worth publishing. More detailed comments: - The toy example in Fig 1 is misleading. It seems to suggest that the Neural Network is trained only on the local model parameters alpha and beta. While in reality it appears that the neural network is trained on the local model parameters as well as the conditioning variables. - It is truly baffling that the conditioning variables for a motif might not include the Markov Blanket of the proposed variables. This appears to run contrary to the theme of the paper that the proposers learned from these motifs are reusable. For example this work is attempting to approximate the probability distribution p_\psi_i(B_i|C_i) at the beginning of inference, but this distribution depends on the values of the remaining variables in MB(B_i) - C_i which can change as inference progresses. In other words, it appears that the proposal that is learned may not even be relevant from sample to sample let alone be reused across models. The GMM is the only example in the experiments where the Markov Blanket is not a subset of the conditioning set. In the GMM case I suspect that the n=60 randomly selected samples that are part of the motif are a good enough approximation of the Markov Blanket. It would have been helpful to demonstrate more clearly with an example why the authors decided that C_i doesn't equal MB(B_i). - Line 91 talks about "training with random model parameterization", but this is not a well developed concept in this paper. There are a series of examples in the experiments, but no crisp procedure. - Line 208, 209. It is a stretch to conclude from one observed low KL-divergence in this small grid example that this approach works in general to approximate true conditionals. Also, it is not clear from the text if the inference used a single proposal for all occurrences of the grid motif in the graph or were separate proposers trained for each grid motif. The other unanswered question is the degree of overlap between grids, and whether that affects inference. Equation (4) is a bit confusing. This is describing a distribution for a binary valued variable, right? So why do we have "[0,1]" on the first line and "[1,0]" on the second line instead of say "0" on the first line and "1" on the second line? - Line 252. It is not entirely clear how the samples for training the proposer were obtained. Is the model being simulated to generate samples for the training data (as in [19])? Minor: Line 162. I am assuming that a probability table for a bounded discrete table is simply a categorical distribution over the possible values. So, then why do we need a mixture of these. Can't a mixture of categoricals be represented as a single categorical? 1 (a) -> om -> on Post-Response: I am willing to accept the authors response that collapsing out the GMM is not necessary to get better performance, and that they will provide results without collapsing in the next version. However, the claim that OUPMs can be handled with RNNs requires more scrutiny, and such claims should not be added without review. For the current NIPS'18 submission, the discussion on OUPMs and PPLs properly belongs in the "future work" section of the paper given the sketchy details. The paper still has a number of important under-developed concepts such as not using the Markov Blanket for the conditioning set. The authors haven't explained why this choice of the conditioning set doesn't affect reusability of the trained proposers.

Reviewer 3

This paper describes a method for training mixture density networks to act as re-usable block proposal distributions for a Metropolis-in-Gibbs sampling scheme. The proposals are trained to approximate the actual full conditional distribution as closely as possible. When these learned proposals can accurately represent conditional distributions for large blocks of latent variables, this can be much more efficient than e.g. random walk Metropolis-in-Gibbs approaches. The paper is clearly written, and outlines an approach based around identifying structural "motifs" in probabilistic graphical models which frequently occur, and training proposals which will function well in these settings. In the toy example shown in figure 1, the obvious criticism is that this process only generalizes well to alpha and beta values similar to those seen in training. The re-usability of the trained proposal for different instances of a particular motif depends on the values of the parameters having high probability under the training distribution that randomly generates alphas and betas. This overall corresponds to choosing prior distributions in eqns 2 and 3 which will be appropriate across multiple models. I think these issues are discussed fairly well throughout the examples, which do a good and honest job of showing how this approach can be applied to new problems — both the aspects which are straightforward, and those which are somewhat fiddly (e.g. targeting the mixture model with the discrete variables marginalized out). I believe the core innovation of this paper is the observation that many motifs are re-usable across different models. This is a generalization of the re-use of of proposals in both Gu et al "Neural adaptive sequential Monte Carlo" and Paige & Wood "Inference networks for sequential Monte Carlo in graphical models" — in both of these, when considering sequential models, the same proposal model was re-used at each timestep, despite the parameters (and one-step SMC / particle filtering target) changing. Obviously, both those papers consider SMC as the base inference algorithm, rather than Gibbs sampling, and as such aim to learn approximations to the filtering distribution rather than the full conditional. While I am not aware explicitly of work which uses trained proposals like these as block Gibbs updates, the fact that proposals designed for use in a sequential Monte Carlo setting (perhaps with blocks of latent variables per step, as in Paige & Wood) can be adapted for use as Metropolis-in-Gibbs proposals, is not a huge conceptual leap, and in fact the objective function for training the network is essentially the same in all three of these papers, as well as in Le et al. However, to the best of my knowledge no one else has made an effort to re-use trained proposals even across entirely different models. I also think this paper overall presents the approach clearly, and in a way which could encourage others to go ahead and train proposals in this manner. One small annoyance: given the focus on probabilistic programming languages in the introduction, I was hoping to see this implemented in a re-usable way in some probabilistic programming language which is at least as expressive as needed for the open-universe GMM example. Such an implementation would certainly help adoption — by my understanding, although much of the cited related work attempts to target very general classes of models, in the end the implementation code ends up being quite model-specific; unfortunately this seems to be the case here, as well.

Paper ID:	2052
Title:	Meta-Learning MCMC Proposals

Reviewer 1

Reviewer 2

Reviewer 3