__ Summary and Contributions__: Update after author rebuttal phase: I stand by my initial evaluation of the paper, and recommend acceptance.
--------------
This paper studies the problem of learning discrete factor models from data. The authors propose a condition that allows them to provide guarantees on the model selection and parameter estimation. They also propose an algorithm for learning the factor graph based on interaction screening. Finally, the authors propose a specialization of this algorithm that can perform structure and parameter learning for graphical models.

__ Strengths__: - The theoretical results are sound and interesting. This seems like the natural generalization of a line of work based on interaction screening for understanding model selection in discrete graphical models.
- this is a significant problem of interest to the NeurIPS community
- The conditions that the authors propose -- local learnability -- seems general and specializes in a way that seems "right"

__ Weaknesses__: nothing significant

__ Correctness__: The claims seem correct to the best of my reading

__ Clarity__: The paper is generally well written. I have some minor comments:
- The authors should include a more detailed discussion on Condition C1 in the main text of the paper given its centrality to the paper.
- the authors claim that the proposed framework includes all models previously considered in the literature as special cases, and that their analysis shows a systematic improvement in sample complexity. These comparisons shouldn't be relegated to the appendix, and should be included in the main text -- preferably as a table.

__ Relation to Prior Work__: The relation to prior work is clearly discussed -- it will help if some of this material is moved from the appendix to the main text.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper presents a general method for learning undirected discrete probabilistic graphical models with provable and somewhat tight upper bounds on the required number of independent samples and computation time. Specifically, the work generalizes the so-called interaction screening method, originally introduced for pairwise binary models.
ADDED: Thanks for the rebuttal. I found the explanation in Supplement D.1 for the efficient verifiability of the identifiability condition hard to follow. I presume the needed values are given by Eq. (82). It is not at all clear to me why the formula in Eq. (82) is efficient to evaluate. (Eq. (84) only gives a lower bound and cannot, of course, not replace the exact formula.)

__ Strengths__: + Extends and improves previous results.
+ A challenging research problem studied by several research groups.

__ Weaknesses__: - The work is purely theoretical. It is not clear whether the presented algorithms could be feasible in practice.
- The presentation is sometimes hard to follow. E.g., Lines 60-61 seem to promise that the indentifiability condition can be verified in an efficient way, but it was difficult to figure out what this way actually is.

__ Correctness__: I could not spot any errors.

__ Clarity__: There are minor problems in writing style. Otherwise the authors seem to have done good job in coordinating the reader through the material.

__ Relation to Prior Work__: Prior research is well discussed, as far as I can tell.

__ Reproducibility__: Yes

__ Additional Feedback__: Minor comments:
- There are some bad notational choices (in my opinion), e.g., K_i in bold and underlining for vectors.
- "[p]" is not defined.
- In Def 1, what is alpha and who specifies it? Why absolute error, not relative error?
- I'd recommend the lazy citation style, e.g., "analyzed in [17]", by proper phrases, e.g., "analyzed by "Vuffray et al. [17]"; the sentence should make sense even after removing the object "[17]".

__ Summary and Contributions__: This paper considers learning the general discrete graphical model. It is an extension of [17] to general basis function.

__ Strengths__: Theoretical results look good and convincing

__ Weaknesses__: It seems like the main difference compared to [17] is to use general basic function in equation (11), instead of focusing on Ising model as in [17]. Other than this, the structure of the paper is similar to [17], e.g. the Local learnability condition corresponds to the restricted strong convexity, e.g. the gradient concentration.
The value \hat\gamma in (5) is claimed as the prior information on the parameter (line 271). Based on my understanding it plays a similar role as \lambda in [17]. Is there any particular reason to switch from L1 regularization in [17] to L1 constrained optimization here? Is that possible to provide an order for \hat\gamma? (Theorem 1 in [17], lambda can be chosen as \sqrt(log p/n), a standard order for high dimensional problem).
In all, personally, I feel like given [17], the novelty of this work seems a bit weak. The overall model is not changed that much, and the tools used for analysis also looks similar.

__ Correctness__: Seems so

__ Clarity__: The notation is a bit heavy. For example, the underline is used throughout the whole paper. Is that really necessary?

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper presents a structure and parameter learning framework (and algorithm) for graphical models with discrete domains. The work builds on and generalizes an existing framework called Interaction Screening for Ising Models [Vuffray et al. NeurIPS 2016]

__ Strengths__: One main difficulty with learning undirected graphical models, for the general case, is the identifiability issue (potentials can be parameterized in infinitely many ways) coupled with combinatorial structure identification. My understanding is that the theoretical elements presented in the paper aim to constrain this issue (correct?). For example, the parameterization and structures are heavily regularized. In addition, the aim is to learn the so-called canonical form as alluded in equation (21).

__ Weaknesses__: For the general case, this approach limits the expressibility and utility of the learned model under the canonical form.
Although the goal of this paper is to advance our theoretical understanding of learning for (undirected) graphical models (which is fine) my wish (or preference) would be to see some empirical results.

__ Correctness__: They appear correct to me.

__ Clarity__: Paper is well written, but very dense.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: In Algorithm 1, is L an input parameter? If so would it not be difficult to tune?
There are many constraints and parameters to identify to make the learning algorithm practical. Also, the parameters in the Theorems are not self contained.
----- update after authors' response -----
I have read through the authors' feedback and reviews. My review remains unchanged.