__ Summary and Contributions__: The paper discusses algorithmic guarantees for the learning of RBM models according to sample size and some measure of the complexity of the RBM models. The paper discusses both the unsupervised learning and the supervised learning case, for the latter an algorithm for learning an RBM is also proposed.

__ Strengths__: The paper adresses the problem of learning a powerful class of generative models improving on previous works which required stronger assumptions on the parameters of the RBM being learned. The paper explains clearly its reasoning: e.g. drawing parallels with two layer neural networks and conditional expectations in RBMs, or explaining the need for a complexity measure to characterize the hardness of the learning from a RBM.
An algorithm derived from the theory to learn a RBM encoding a classification task is furthermore tested on a real dataset and shown to achieve comparable performance to algorithms with related levels of complexity.

__ Weaknesses__: The main text is very dense given the page limit and some informal statements are not entirely clear (see questions below).

__ Correctness__: Given the allowed time, I could not check the proofs.
The method for the proposed experiment seem correct.

__ Clarity__: Overall the paper is well written.
The main text is quite dense but it is unclear that this could be improved given the page constraint.
There are a few places where some variables lack definitions (see details below).

__ Relation to Prior Work__: The paper clearly explains the context of the contribution in the introduction and further details relation to prior works in a dedicated section.

__ Reproducibility__: Yes

__ Additional Feedback__: Definitions:
- Line 130-131 theorem 2, the loss function \ell (logistic?) is not defined in the main text.
- In Example 1 and 2 page 5 and line 218, what is the definition of \eta? The maximum \eta such that the RBM has all of its pairs \eta-degenerate? This is not entirely clear.
- Page 5 line 211 - Is the parameter d defined ?
References:
Page 2 line 57 - “a natural generalization of fully-observed Ising models, for which the learning problem is well understood.” Could the authors provide corresponding references?
Typos:
- Abstract line 4 - the most well studied - the best studied
- Page 2 line 38 - “statistical” <- statistically
- Page 4 line 159 - “the each”
- Page 4 line 164, “(e.”
*****
I have read other reviews and authors feedback, which confirms my positive assessment of the paper.
The authors answered to the clarification I was requesting. I am hoping they will take advantage of the additional page to make sure they implement these clarifications in the text.

__ Summary and Contributions__: This paper proves several learning guarantees for two-layer feedforward neural networks and for Restricted Bolzmann Machines, gives nearly matching lower bounds (under hardness assumptions), and explores connections between the two settings in a rigorous manner. Specifically, the authors give an algorithm for structure learning of RBMs under a reasonable-seeming assumption generalizing ferromagneticity. This is proved via a new guarantee for learning single-hidden-layer neural networks under L^infty bounded inputs. Finally, the authors give guarantees for a "supervised RBM" classification algorithm, where the label is viewed as one of the visible units in an RBM, under assumptions about the weights in this RBM.

__ Strengths__: This paper gives new algorithms with strong guarantees for basic neural network models. By exploring connections between single hidden layer feedforward networks and RBMs, the authors introduce new analytic techniques in this setting. This is a solid contribution.

__ Weaknesses__: Theorem 6 doesn't seem as well motivated. Its assumptions seem strong and very specialized, and the experimental implementation is (unsurprisingly) lackluster in its performance. And we already have lots of guarantees for training single-hidden layer NNs, although this paper gives better bounds under some assumptions.

__ Correctness__: I have no concerns about correctness

__ Clarity__: The writing is excellent. I would suggest dropping the poorly motivated experiments section to make room for the introduction section of the appendices, which gives a clearer roadmap of the proofs.

__ Relation to Prior Work__: The authors do a good job of putting Theorem 2 in context. I am less familiar with the RBM literature, and the authors don't seem to have done as much to give context to Theorem 4. More discussion there would be appreciated.

__ Reproducibility__: Yes

__ Additional Feedback__: I am grateful to the authors for their response. I hope they include some additional context for Theorem 4 in the final version.

__ Summary and Contributions__:
The paper considers learning Restricted Boltzmann Machines (RBMs). The assumption is that the RBM has bounded normed weights, and non-degenerated correlations between any two-hop neighboring visible units. This is different from previous works that either assume ferromagneticity or bounded max hidden degree. The goal is structure learning (that is, recovering the Markov Blanket of the visible units, which can then imply say distributional learning). The analysis is done via a connection to feedforward networks: predicting one visible unit from the others can be done via a two-layer feedforward network. It then gives a learning method to learn those networks (with nearly optimal sample and runtime complexity under the conjectured hardness of sparse parity with noise). Using the connection, the paper also provides the theoretical study of supervised RBMs, giving an algorithm for learning a natural class of supervised RBMs, and gives empirical supports.

__ Strengths__: + The topic is interesting. RBMs are probably the most well-known neural/graphical models for unsupervised learning. Theoretical analysis will provide better insights into this learning architecture.
+ The connection from RBMs to feedforward neural networks is interesting, though it is not very complicated. This will allow borrowing tools from the latter area to the former, as did in this paper.

__ Weaknesses__: - The learning method for the feedforward networks is by l1-regularized regression over monomials. This is not connected with popular practical algorithms for RBMs. Can the authors comment on any insights gained for the practical algorithms from the connection between RBMs and the feedforward networks?
============================after response====================
The authors has answered my quedstion on relation/implications to practical algorithms for training RBMs. It doesn't completely address my concerns, but it gives good reasons for the study in the paper, so I increased the scores to 7.

__ Correctness__: I think the theoretical part is correct, though I only verify the key lemmas.

__ Clarity__: Yes. But there seem to be some typos, eg,
-- Line 100: what are n_V and n_H?
-- Theorem 2: d should be D?
-- X_{~i} should be X_{~i,j} in the math display in Definition 1

__ Relation to Prior Work__: Clear.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: Post Rebuttal: Thanks for clarifications. I am changing the score to 7.
The paper gives a new algorithm for learning the structure Restricted Boltzmann Machines (formalized using Markov blankets), which is claimed to work for larger parameter regimes than the previous work. This is done by considering the problem of predicting the spin of a node given the spins of all other nodes. This dependence is shown to be given by a one-hidden layer neural net (with somewhat non-standard activations). An algorithm for learning this network is given based on polynomial approximation of the neural net and using regression on degree-D monomial feature map (with \ell_1 constraint). The algorithm works under L_\inf constraint on the input vector which is different from the past work.
Given the above algorithm for learning the dependence of one node on the rest, under suitable non-degeneracy conditions, an algorithm is given for learning the structure (Markov blanket) of the RBM.
Nearly matching lower bounds are provided (under hardness assumptions or in the SQ model).
The reduction to neural networks is also used for learning supervised RBMs, which can be thought of as a neural network under distributional assumptions on the data (in terms of "sparsity and nonnegative correlations among the input features 307 conditional on the output label"). This distributional assumptions seems to be new.

__ Strengths__: The algorithm is conceptually simple, and in terms of the chosen complexity parameters (in terms of \lambda_1, \lambda_2) the performance guarantees are nearly tight.

__ Weaknesses__: The main result is claimed to "often" expand the parameter regimes for which the present algorithm works compared to the previous work. There does not seem to be a clean way to state how the present work improves over the previous work. There's a discussion in the paper (paragraph starting at Line 209, and more in the full paper) mentioning situations where the algorithm of the present paper provides much better guarantees than the past work (these are formulated in terms of Dobrushin condition). I am unable to independently evaluate how much of an advance this constitutes over the past work because I don't know how significant this regime is for the present context of RBMs.
Is the choice of complexity parameters in Definition 2 dictated by the proof techniques? Do they have any other significance?
The results on supervised RBMs are not particularly good in experiments, presumably because the distributional assumption are not satisfied by the data.

__ Correctness__: The proofs appear to be correct though I didn't check all details.

__ Clarity__: Generally well-written.

__ Relation to Prior Work__: Well done.

__ Reproducibility__: Yes

__ Additional Feedback__: Line 191: (since general RBMs can represent arbitrary distributions)
I don't see why this is true. The RBM as defined below line 41 is clearly restricted in what distributions it can represent.
"We also note that the parameters of the RBM are not identifiable even given 58 an infinite number of samples, so our goal for learning the RBM is generally speaking to learn the 59 distribution or related structural properties (e.g. the Markov blankets of the nodes in X)."
Please give a reference or prove it.