__ Summary and Contributions__: Update after author rebuttal phase: I stand by my initial evaluation of the paper. I am comfortable with its acceptance at NeurIPS 20.
----------------
This paper considers the problem of learning nonparametric causal DAGs from data. The authors propose an algorithm that works in the nonparametric setting without assumptions on linearity, additivity, independent noise, and faithfulness, and establish computational and sample complexity results for this algorithm under an assumption on residual variances. They also compare the proposed algorithm to prior work via a simulation study

__ Strengths__: - The theoretical claims of the paper are sound and well justified. The empirical evaluation is extensive and well documented.
- The problem of learning causal DAGs is very relevant to the NeurIPS community.
- the results are novel to the best of my knowledge.

__ Weaknesses__: The entire paper relies on the observation in Theorem 3.1 -- which is a simple generalization of the equal variance property of linear models. This claim is reasonably straightforward to establish, and not particularly surprising. It is this fact that allows one to consider nonparametric DAGs as opposed to linear or additive models. After this is established, the rest of the paper is quite a standard combination of machinery from the DAG and nonparametric statistics literature. Given this, I think the contribution of this work is somewhat incremental.

__ Correctness__: The empirical and theoretical results are correct.

__ Clarity__: The paper is well written. There are hyperlinks that point between the supplementary material and the main document that should probably be deactivated.

__ Relation to Prior Work__: Relation to prior work is sufficiently well addressed.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__:
The paper proposes a method for learning Bayes nets with continuous variables, for the nonparametric case.

__ Strengths__:
The work is novel and relevant.
The theoretical results seem sound.
The experimental results are encouraging.

__ Weaknesses__:
Sparsity was not assumed and thus, the sample complexity is exponential with respect to number of nodes d. (Please see my additional feedback below.)

__ Correctness__:
The main claims seem correct.

__ Clarity__:
The paper is clear.

__ Relation to Prior Work__:
Relation to prior work is properly discussed.

__ Reproducibility__: Yes

__ Additional Feedback__:
Sparsity was not assumed and thus, the sample complexity is exponential with respect to number of nodes d. Please discuss how the current proofs can be modified to accomodate for sparsity, and what would be the sample complexity in this case.
Minor comment: Some of the Arxiv references are actually papers in conference proceedings.
=== AFTER REBUTTAL
I am satisfied with the authors response regarding several ways to improve the bounds with assumptions such as sparsity.
Given the above, I keep my initial evaluation of 7.

__ Summary and Contributions__: The paper gives an algorithm that runs in time polynomial in the number of variables d and that learns the data generating DAG with high probability from about O(d^d) samples. The basic idea is to sort the variables based on their estimated residual variances. Empirical results are favorable in comparison to NOTEARS and some other benchmark methods.

__ Strengths__: + Good theoretical results.
+ Appears to also have practical significance.
+ Clearly written, with a good number of examples.

__ Weaknesses__: - The large sample size needed for theoretical guarantees makes the result uninformative regarding practical sample sizes.
- The theoretical results rely on a condition on residual variances that may not hold in typical practical instances.

__ Correctness__: I could not spot any errors.

__ Clarity__: The clarity of the presentation is satisfactory.

__ Relation to Prior Work__: Relevant prior work is adequately discussed, as far as I can tell.

__ Reproducibility__: Yes

__ Additional Feedback__: ADDENDUM:
Thank you for the rebuttal. You may want to emphasize why polynomial time complexity should be of interest when the sample complexity is exponential (reading the data takes exponential time already). I agree that you have managed to remove a lot of typical assumptions in an interesting manner (even if extending previously presented ideas) -- with the cost of adding a somewhat questionable new assumption about the residual variances.

__ Summary and Contributions__: This paper presents a polynomial-time algorithm for learning directed acyclic graphs (DAGs). This paper extends work for learning graphs from data with equal variances to data where the conditional variance of X_i does not depend on i.

__ Strengths__: - The examples given in 3.1 are very helpful.
- The experiment section is strong.

__ Weaknesses__: - The title says "causal" but the main results are about DAGs. As the authors state, DAGs may be interpreted causally under additional assumptions. Is "causal" really appropriate in the title?
- I do not understand what is meant by "simultaneous statistical and computational guarantees" as stated in the introduction. Is it just that computational complexity in 3.2 and sample complexity in section 4 are both given?
- The algorithm makes just a simple change to the that of citation [5].

__ Correctness__: Yes.

__ Clarity__: Mostly yes. Some parts are confusing or potentially misleading, as described above.

__ Relation to Prior Work__: This is discussed, but it is not clear what are the assumptions needed by previous work on equal variances? Does that work assume lineariry, additivity, independent noise, or faithfulness?

__ Reproducibility__: Yes

__ Additional Feedback__: UPDATE: I have changed my review to 6 based on discussion period.