__ Summary and Contributions__: This work tackles the important topic of efficient generalization over inference tasks by presenting an inference agnostic training procedure for undirected graphical models. The authors propose AGM to adversarially train undirected graphical model by representing potentials as the output of a generator. Inference is done by sampling different set of potentials, carrying belief propagation and aggregating the obtained marginal distributions.

__ Strengths__: The paper of this topic is important. Efficient and generalizable inference procedure that do not require retraining undirected graphical models are certainly of interest for a significant portion of the community. One important aspect of this work is that the generation of graphical models happens at the level of potentials therefore retaining the explainability of undirected graphical models. Moreover, the experiments are well thought off as they assess the generalization capability of AGMs inference procedure.

__ Weaknesses__: The experimental protocol is unfortunately lacking.Moreover, this work fails to acknowledge other methods attacking the problem of generalizable inference at test time of undirected graphical models[1, 2, 3]. It would advantage the paper to separate themselves for the aforementioned work and even better compare to some of them. The scale of the experiments is too small. The paper only uses multi layered perceptrons, using convolutional neural networks and larger datasets (CIFAR-10 or SVHN for instance) would make the experiments on image data stronger. There are no experiments showing the advantage of generating potentials instead of data directly, this is important as [1, 2, 3] operate at the level of the data. Finally, a clear discussion of the computational costs will be helpful.
[1] Ivanov, et al.. "Variational Autoencoder with Arbitrary Conditioning." International Conference on Learning Representations. 2018.
[2] Douglas, et al. "A universal marginalizer for amortized inference in generative models." arXiv preprint arXiv:1711.00695 (2017).
[3] Belghazi, et al. "Learning about an exponential amount of conditional distributions." Advances in Neural Information Processing Systems. 2019.

__ Correctness__: The experimental results do not show error bars which makes it difficult to form a definite opinion about the results.

__ Clarity__: The exposition is clear and the flow of the different sections logical.

__ Relation to Prior Work__: The paper does a good surveying prior literature but misses a few relatively recent highly relevant literature. It will greatly benefit this work to differentiate itself from the aforementioned literature.

__ Reproducibility__: Yes

__ Additional Feedback__: *** score increased to 6 ***

__ Summary and Contributions__: In some inference scenarios we may not have access to the true graphical model, but we may have access to a distribution of plausible graphical models.
This paper presents a new approach to perform inference on a distribution of graphical models. Parameters of an ensemble of graphical models are generated with Generative Adversarial Networks. Inference is then performed on these ensemble of graphical models. This allows for better generalization across a distribution of inference tasks where the graphical model is not fully defined for each particular task.

__ Strengths__: The paper is very well written. The contribution is novel an well defined. The method is benchmarked over a wide variety of datasets.

__ Weaknesses__: The main weakness I find is that the presented method (AGM) doesn’t outperform (EMG) with a very significant difference while it adds some extra complexity. Regardless of that, I still think the work is valuable to the community and even if the accuracy gap is “incremental”, the novelty of the algorithm is not.

__ Correctness__: The claims and methodology seem correct to me.

__ Clarity__: The paper is very well written and easy to read.

__ Relation to Prior Work__: The work is properly contextualized. I think it may also intersect with meta-learning where an algorithm learns to be flexible on a variety of tasks i.e. learning to learn. It would be great if the authors add a small paragraph in the related work or conclusions relating their algorithm to this line of research.

__ Reproducibility__: Yes

__ Additional Feedback__: --- Post rebuttal update ---
Most of the points raised by the other reviewers where addressed in the rebuttal, therefore my mark will remain as an Accept (7).

__ Summary and Contributions__: This paper is focused on learning models that are able to compute MAP inference on discrete variable problems where the set of conditioning variables may vary at test time. To model this, the authors propose a generative model over pairwise undirected probabilistic graphical model parameters titled AGM which is trained in an adversarial manner. A noise vector is transformed into a set of potentials using a neural network, from which marginals are obtained by using belief propagation; these marginals are then compared against the marginals corresponding to the training data using a WGAN loss. Experiments are run to evaluate this method on "inpainting tasks" (given some subset of variables, predict the states of the rest), where the distributions of which variables are presented and which must be predicted may vary between training and evaluation. There is also an evaluation of the generative modeling capabilities of this approach.

__ Strengths__: This work is presented very clearly and is easy to follow. The idea of learning an ensemble of graphical models generatively in an adversarial manner is novel as far as I'm aware. The experiments are thorough in comparing the performance of AGM and GibbsNet in scenarios where the distribution of query variables changes during training and evaluation.

__ Weaknesses__: I'm a bit confused about the evaluation of the approach. What is learned is a generative model over probabilistic graphical models; however, the focus in experiments I and II is on conditional MAP inference. In this setting, the model is being used as a structured output prediction model, and so comparisons are missing against other structured prediction models, examples being [1], [2], and [3] (see "related work" section for refs). [3] is of particular note, as it is also trained in an adversarial manner. If the primary use of this model is for conditional MAP inference, then it is important to understand how well AGM compares against other similar models. That being said, since the samples themselves are unconditional, this approach is at a disadvantage compared to these other approaches, which condition their "samples" on the input. I think the motivation is unclear here: why use an unconditional model to obtain graphical model parameters when we could use a conditional model to do so and train it on a variety of "inference tasks" so that it is robust to these?
If the goal is primarily to be used as a generative model over discrete variables, then much more emphasis needs to be placed on this in the experiments section, and comparisons against more generative models needs to be made.
The image datasets used are rather simple; it would have been nice to see experiments on a more complicated inpainting task, e.g. using semantic segmentations.
Additionally, since inference is stochastic in nature, it is important to understand the variance in predictions made by using this method, and how this changes as you aggregate more samples. However, this is not presented or discussed in the experiments anywhere.

__ Correctness__: The methods look correct, and the experimental evaluation seems sound. The title of the paper is highly misleading though - the inference isn't really learned, since it is using standard belief propagation techniques. There exist papers which are focused on implementing "inference networks" which are trained to approximate structured inference, and so reading the title may lead the reader to believe that a similar model is covered here.

__ Clarity__: The paper is very easy to read and understand - you did a great job with this!

__ Relation to Prior Work__: The relation to previous models is presented clearly; however, as discussed above, proper comparisons are not made against models trained for MAP inference tasks. Some additional references mentioned above:
[1] Chen, Liang-Chieh, Alexander Schwing, Alan Yuille, and Raquel Urtasun. "Learning deep structured models." ICML 2015.
[2] Belanger, David, and Andrew McCallum. "Structured prediction energy networks." ICML 2016. (there are a few followups to this paper as well)
[3] Tu, Lifu, and Kevin Gimpel. "Learning Approximate Inference Networks for Structured Prediction." ICLR 2018.

__ Reproducibility__: Yes

__ Additional Feedback__: One additional comment: since tables represent average results over a few trials, it would be great to also have, e.g., standard deviations presented as well to get a sense for how consistent the methods perform.
Overall, I think the work is interesting, but I think the focus of the experiments needs to be clearer. Whether the final focus ends up being on MAP inference, generative modeling of discrete variable problems, or some combination of the two, additional comparisons need to be made to appropriate models.
----------------------------------------------------------------------------------------------------------
POST AUTHOR RESPONSE UPDATE: Thanks for the clarifications you provided in your response. Due to these, as well as addressing my concerns regarding the datasets and variance of results, I am increasing my score. I still think the paper would be stronger if it included a comparison against some structured prediction approaches - even though they don't solve the broader problem described in the introduction, they can be applied to several of the tasks used during experimentation, and it would be interesting to see how everything compares.