__ Summary and Contributions__: The paper proposed a bi-level variational framework to enable score matching methods to train general energy-based latent variable models. The original formulation is intractable, so the paper further propose a practical optimization algorithm based on gradient unrolling. Theoretical study (bias and convergence) on the practical algorithm and empirical study on MNIST and CIFAR-10 datasets are provided to validate the proposed method.

__ Strengths__: The paper is very well-written and I enjoy reading the paper.
The proposed variational extension of score matching and the gradient unrolling optimization seems reasonable and novel to me.
It also conducted rigorous theoretical study to analyze the practical approximation of the gradient optimization. The experiments on CIFAR-10 and MNIST also demonstrate the effectiveness of BiSM.

__ Weaknesses__: Here are some parts that the paper could potentially improve:
- Some typos: e.g. in line 41-43, MLE should come first and SM should come second?
- For theorem 2, it would be more interesting to explore the setting where G(theta, phi) is not strongly convex (i.e. a weaker assumption), although the assumption is acceptable if it is necessary for making things feasible. Also it seems there is a missing dependence of the bound on the batch size in theorem 2 and corollary 3, are you assuming infinite batch size here? Usually, SGD with biased gradient also depends on the batch size in a non-negligible way.
- Furthermore, in line 173, I noticed that the paper update phi for K times on the same minibatch. Is this a special design? Why not use different batches (which seems to be less biased)?
- Also in the paragraph following theorem 2, the paper claims the theorem provides insights into implementation. According to the theorem, the gradient estimation becomes less biased when N is larger. Is this consistent with your empirical observation? I didn't find ablation study on the hyper-parameter K.
- Practical usefulness: I understand that the aim of the paper is not to establish a new SOTA. But still I wonder if the proposed method provides any additional practical benefits. It would be cool if the paper can demonstrate this. For example, is there any interesting results if we do Langevin sampling on both image space and latent space? Is it possible to do controllable image generation by manipulating or interpolating the latent variables? These make it different from a standard EBM. Also is it scalable to higher dimension such as CelebA 128x128?
- Usually to make score matching work for images, we need to apply noise annealing on the images [1]. Is it necessary for the proposed method?
[1] Generative Modeling by Estimating Gradients of the Data Distribution

__ Correctness__: Yes.

__ Clarity__: Yes, the paper is well written.

__ Relation to Prior Work__: Yes, the paper provides preliminaries and discussions on previous works. Also some recent works on energy-based latent variable models should be discussed such as:
- UNBIASED CONTRASTIVE DIVERGENCE ALGORITHM FOR TRAINING ENERGY-BASED LATENT VARIABLE MODELS

__ Reproducibility__: Yes

__ Additional Feedback__: Post-rebuttal feedback:
I have read the author response and other reviews. I appreciate the authors' efforts on addressing our concerns. I think my concerns were mostly addressed and I hope the authors will work them into the next version.

__ Summary and Contributions__: This paper generalizes score matching method to learn energy-based model with latent variables. It adopts a variational approximation of the marginal log-likelihood of the visible units, and approximate the posterior of hidden units by another DNN. The learning is accomplished by alternative gradient descent with gradient unrolling. Experiments show that the proposed method is comparable to CD and SM based methods when the ground truth posterior is known, and it can model complicated EBLVM for natural images.

__ Strengths__: - This paper provides sound theoretical analysis of the proposed method, including the relation to SM, the effectiveness of the learning and asymptotic behavior.
- Experiments compare with various baseline methods (CD or SM based), validating the effectiveness of the proposed methods.
- The proposed method is promising towards learning of EBLVM.

__ Weaknesses__: - The proposed gradient unrolling method requires N steps to unrolling the gradient, which is slow and perhaps difficult to scale up to learning large and complicated EBLVMs. Although corollary 3 indicates that the estimation accuracy can be asymptoticly arbitrarily small, that requires N to be sufficiently large that may exceed the computing limit.
- For comparison, at least one NCE-based method should be included. [1] shows that with a strong noise distribution, this line of work is possible to learn EBM on natural images.
- In Table 2, it seems that higher dimension of h leads to worse result. Possible reason needs to be discussed.
- Figure 4 shows that the learning can be unstable.
[1] Flow Contrastive Estimation of Energy-Based Models

__ Correctness__: The claims, method and empirical methodology look good to me.

__ Clarity__: The paper is well written and easy to understand.

__ Relation to Prior Work__: Yes it details prior work and relation to them.

__ Reproducibility__: Yes

__ Additional Feedback__: ===== after rebuttal ======
I have read the authors' reply, and they address my concerns about comparison to prior work. I decide to keep my original rate.

__ Summary and Contributions__: This paper presents a novel method for training energy-based latent variable models (EBLVM). This method is based on minimizing the fisher divergence between the model and the data distribution and is accomplished with an extension to score matching or denoising score matching. The fisher divergence is in general intractable to compute for latent variable models and the proposed method approximates it using a variational approximation to the distribution of hidden variables given the visible variables p(h|v).
The authors prove that their proposed training method is consistent given some standard assumptions.
The proposed approximation cannot be naively optimized given the dependence of the model objective on the parameters of the variational approximation. To address this the authors develop a further approximation which is based on a bi-level optimization. Within a given mini-batch, the variational posterior is optimized by a few steps of SGD to reach supposed optimality. Then this updated posterior (and all steps of the optimization) is used in computing gradients for the model objective.
The authors train Gaussian-Bernoulli RBMs using their method on a toy dataset and small image dataset. They further train a deep EBLVM on MNIST and CIFAR10 and demonstrate that their models can draw samples of comparable quality to recent EBM variants with no latent variables.

__ Strengths__: EBLVMs are a very interesting model class and the authors correctly note that they their study has been greatly limited to structured models due to a lack of scalable training methods. This approach constitutes the most scalable method to date (as far as I am aware) in training this class of models without highly structured energy functions.
The bi-level optimization is fairly unintuitive but the author's detailed analysis of the method should convince skeptical readers that this is a sensible objective to optimize and should arrive at reasonable solutions.
The results on image data are compelling and convincing, generating high quality samples. I would be very interested in a more detailed analysis of the benefits of the latent variable. Perhaps some conditional generation? Infer h given a test image and sample new v given h? This class of model opens up a wealth of interesting possibilities and applications that I am excited about.

__ Weaknesses__: The authors neglect to compare to probably the 2 most related works I am aware of. The authors briefly mention variational noise contrastive estimation which can also be used to train models like those presented in this work. While this method has not yet been shown to scale to high dimensional image data it should be used as a comparison for the toy data at the very least.
This work: "Variational Inference for Sparse and Undirected Models" Ingraham & Marks provides a method for parameter inference in EBLVMs. This method could also be used for comparison but at the very least should be included in the related work.
The proposed method requires 2 inner loop optimizations (N x K) for each model gradient update. This will result in a slow down compared to comparable methods such as regular score matching. In the appendix there is a time comparison of various configurations of N and K but there is no comparison provided between standard score matching, CD, PCD, or DSM. This should be provided to ensure the reader that the method is not prohibitively costly.
The section on deep EBLVMs lacks sufficient motivation. The ability to learn useful low-dimensional representations for high dimensional data is a unique feature of this class of model and should be discussed. I was surprised to find results on this were included in the supplement. I highly recommend placing these results into the main text as they are very interesting!
The main quantitative measure for performance presented here is the fisher divergence. I find this to be somewhat weak. FID and IS are acceptable for MNIST and CIFAR but AIS/RAISE could be used to produce likelihood estimates. Especially on the GRBM models this is quite tractable.
It seems the greatest weakness of the approach is the requirement of the bi-level inner-loop optimization. While I would not fault the authors for this, I feel like the trade-offs of the approach and some potential future directions should be discussed.
The model architectures used should be more carefully described. The energy function is described but the inference model is not at all. This makes reproducibility not possible. Please add this.

__ Correctness__: I feel the author's theorems are sufficient to demonstrate that the algorithm should properly train the models. The bias bound in theorem 2 is very nice but clearly depends on ||\phi^n - \phi^0||. Some empirical results demonstrating that this quantity is small for the models trained would be nice as I have no idea on the range of these values and how they will impact the bias. A small experiment demonstrating that this value is small and giving a numerical bound on the bias in these experiments would greatly strengthen the claims made here.

__ Clarity__: The paper is relatively clearly written but some details are missing. Nowhere in the main text of the paper is their variational approximation explained. Was this a parameterized gaussian as in VAEs? I also found the notation some what confusing as the name q was used to represent the data distribution as well as the approximate posterior.
I also found the notation around \phi confusing. When does \phi become \phi^0? Does the state of \phi persist across mini-batches? I believe the proposed algorithm is as follows:
for K steps do sgd on \phi
set \phi^0 = \phi
for N steps do sgd on \phi^0 to result in \phi^N
update \theta using \phi^0, \phi^1,..., \phi^N
so this implies that K steps of sgd are done to \phi in each mini-batch.
If this is incorrect I suggest clarifying further in your paper and algorithm boxes.

__ Relation to Prior Work__: The authors carefully go over prior work and fit their contribution within this prior work well.

__ Reproducibility__: No

__ Additional Feedback__: Given that the EBLVM models do not outperform the EBMs without latent variables it would much more clearly motivate the work to have some experiments which demonstrate some of the nice properties of EBLVMs which utilize the latent variables.
Please see my suggestions for edits int the Weaknesses and Claims sections for advice on how to improve this paper.
------------------------
After rebuttal and discussion
------------------------
While I still feel this method is quite complicated, I do believe the authors have addressed many of my concerns and demonstrated that my understanding of their work was indeed correct. I have decided to raise my score but keep my confidence level the same. I strongly suggest the authors try to improve the clarity of the algorithm for a latter version of this paper.

__ Summary and Contributions__: This paper proposes the bi-level score matching (BiSM) to learn energy-based latent variable models. Specifically, it reformulate a score matching objective as a bi-level optimization problem, which introduces a variational inference to compute posterior distribution. Experiments on Gaussian restricted Boltzmann
machines and EBLVM parameterized by deep neural networks shows promising results of the proposed framework. The contributions are on those theoretical understanding of the algorithm, and the bi-level score matching concept.

__ Strengths__: The paper provides multiple theoretical understanding of the proposed algorithm. The problem this paper tries to address is very relevant to the NeurIPS community. The bi-level score matching is somewhat novel but not that much.

__ Weaknesses__: Two major weakness: (1) The method don't have sufficient quantitative or qualitative experimental results to verify the model. (2) The important competing baselines are missing. Only two methods are used for comparison in Table 2. Some relevant method, such as [a] that also learn latent EBM witj VAE, is missing. (3) A lot of important prior works about energy-based models with modern deep net as the energy functions are missing in the related works. For example, Xie 2016 [b] is the first paper on MLE of modern ConvNet-parametrized energy-based model. Also see [c] [d] [e]. The current references for EBM in the paper are not complete.
[a] Han et al. CVPR 2020. Joint Training of Variational Auto-Encoder and Latent Energy-Based Model.
[b] J Xie et al. ICML 2016. A theory of generative ConvNet. International Conference on Machine Learning.
[c] J Xie et al. CVPR 2017. Synthesizing Dynamic Pattern by Spatial-Temporal Generative ConvNet.
[d] J Xie et al. CVPR 2018. Learning Descriptor Networks for 3D Shape Synthesis and Analysis.
[e] Wu. Sparse and Deep Generalizations of the FRAME Model
--Annals of Mathematical Sciences and Applications (AMSA) 2018

__ Correctness__: The empirical methodology seems correct to me.

__ Clarity__: The paper is relatively clear and easy to follow. It is also well organized.

__ Relation to Prior Work__: The paper misses a lot important related works from the computer vision community. For example, [a] latent EBM with VAE training which is related to the current paper. Pioneering works about EBM with energy functions parameterized by modern deep net are missing. See [b] [c] [d] [e] [f] . For example,
[b] is the first paper on maximum likelihood learning of modern ConvNet-parametrized energy-based model.
[c] is the first paper to give adversarial interpretation of maximum likelihood learning of ConvNet-EBM.
[f] is the paper to jointly train EBM and generator.
Due to the lack of a body of related works about EBM and latent EBM, the current paper is not in a good condition.
---------------
[a] Han et al. CVPR 2020. Joint Training of Variational Auto-Encoder and Latent Energy-Based Model.
[b] J Xie et al. ICML 2016. A theory of generative ConvNet. International Conference on Machine Learning.
[c] J Xie et al. CVPR 2017. Synthesizing Dynamic Pattern by Spatial-Temporal Generative ConvNet.
[d] J Xie et al. CVPR 2018. Learning Descriptor Networks for 3D Shape Synthesis and Analysis.
[e] Wu. Sparse and Deep Generalizations of the FRAME Model
--Annals of Mathematical Sciences and Applications (AMSA) 2018
[f] Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching (AAAI) 2018

__ Reproducibility__: Yes

__ Additional Feedback__: I suggest authors to revise their paper by (1) completing the related works of EBMs. (2) improve their performance of the method (3) compare with other latent EBM models.
After rebuttal: After reading the reply from the authors, I decide to increase my score. Please follow the comments in your revision to complete the related works about the energy-based modeling, as well as adding relevant baseline comparison in the experiment. For example, I strongly suggest you to include a comparison with [b] i.e., EBM with vanilla ConvNet, (your current comparison with [10] in your paper is a EBM with fanny residual net energy), [a] VAE+latent EBM, and [f,g] CoopNets, which is cooperative learning of EBM with a generator. For your convenience, the FID score for Cifar 10 reported in [g] for CoopNets is 33.61.
=============
[a] Han et al. CVPR 2020. Joint Training of Variational Auto-Encoder and Latent Energy-Based Model.
[b] J Xie et al. ICML 2016. A theory of generative ConvNet. International Conference on Machine Learning.
[f] Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching (AAAI) 2018
[g] Cooperative Training of Descriptor and Generator Networks. (TPAMI) 2018