__ Summary and Contributions__: The paper shows that the NOTEARS approach to learning linear SEMs rarely meets KKT conditions. Also alternative approaches to ruling out cycles (absolute value) are considered. Informed by theoretical connections between KKT conditions and edge absences, a local search method is presented (which can use a NOTEARS solution as a starting point). This method improves on plain NOTEARS.

__ Strengths__: Useful negative results are presented which inform a new and empirically successful algorithm.
The NOTEARS acyclicity constraints are generalized (and empirically assessed).
The authors are forthright about what still remains unclear.

__ Weaknesses__: It is difficult to work out what leads to better or worse SHD results. We have choices for algorithm but also F (always squared error here), thresholding, acyclicity constraint and even centering (which makes a surprisingly big difference). Also the results for plain NOTEARS are different from an earlier paper, which is odd.
Given that this is learning by constrained optimization it would be good to more clearly separate out (i) the effect of choice of objective (and constraints) and (ii) the effect of the failure to solve the constrained optimization problem to optimality (i.e. how good is our local optima?)

__ Correctness__: Yes. At least I did not find any errors.
Comparison is only to NOTEARS variants and GES (FGS implementation) and CAM, but we are assured that nothing else is worth considering. I will just have to trust the authors here.

__ Clarity__: The paper stresses the "fully continuous" nature of the NOTEARS approach but with thresholding I think this is an overstatement. The -KKTS methods presented here are not "fully continuous". It doesn't matter whether or not we have a fully continuous method, we just care whether it works.

__ Relation to Prior Work__: The whole paper is about a development of the basic NOTEARS approach.

__ Reproducibility__: Yes

__ Additional Feedback__: It is written: "This section analyzes the continuous optimization
problem of minimizing a score function F (W ) subject to the
acyclicity constraint h(A) = 0 for any h defined by (1)". We see in
the supplementary material that in the experiments presented here
"least-squares loss" is always used as the "F". Is there not a danger
of overfitting here? Perhaps because both NOTEARS and NOTEARS-KKTS
only find local optima, this optimization failure is actually helpful
and is providing a kind of overfitting-avoidance. Or perhaps somehow
working out a good threshold value (here 0.3) to convert weights into
edges is what matters.
AFTER DISCUSSION/FEEDBACK
Thanks to the authors for their feedback. This paper indeed raises a lot of as-yet-unanswered questions. This is perhaps not such a bad thing.

__ Summary and Contributions__: The paper builds on an existing method for learning Bayesian networks from data (NOTEARS) using continuous optimisation. The paper argues that there are theoretical problems with convergence to optimality in the existing method and derives a related method that does not have this problem. Experimental evaluation is presented. This shows that the novel method is actually rather poor in practice - however, as part of a hybrid algorithm, it outperforms the existing state-of-the-art.

__ Strengths__: The paper provides a clear explanation of the theoretical problems with the existing NOTEARS algorithm, and explains how the method should be altered in order to remove this problem.
The presented technique (when used as a post-processing step) gives a significant improvement in accuracy over the current techniques of this type. Results appear to be about a factor of 2 better in most cases. This is achieved without a huge increase in runtime.
Some of the theoretical insight into the problem domain is a potentially useful side effect for others looking to develop alternative techniques in this area.

__ Weaknesses__: The paper essentially fixes a problem with an existing approach. While this is definitely worthwhile and results in an algorithm that is both theoretically and practically superior, this is possibly less interesting than a completely novel technique.

__ Correctness__: The theoretical results appear to be correct to the best of my ability to understand them. The experimental method followed is also technically sound - the fact that it largely replicates experimental settings already presented in the literature is a definite plus.

__ Clarity__: The paper is written very clearly throughout. It is surprisingly readable for a paper with lots of theoretical results and the fact that the theorems involved are linked back to the graphs where possible greatly helps the understanding of the material.

__ Relation to Prior Work__: The entire paper is based around issues with existing approaches and so the previous work in the area is properly explained and the novel contribution of this paper is clearly given. Other less directly related work is briefly mentioned but not really explored.

__ Reproducibility__: Yes

__ Additional Feedback__: The final proposed method for post-processing the results from the NOTEARS algorithm is reminiscent of hybrid Bayesian network learning techniques, where one algorithm is used to learn a network which is then used as the starting point for a local search by another algorithm (e.g. Max-Min Hill Climbing). It might be worth drawing this comparison.
===================================
UPDATED - I read the other reviews and the authors' response. This has not substantially changed my opinion of the paper.

__ Summary and Contributions__: The paper advances the "NOTEARS method" for finding high-scoring Bayesian networks. Specifically, a reformulation is given so as to satisfy the KKT conditions for local optima. The development is shown to yield local search algorithms that can improve upon the performance of NOTEARS in terms of structural error (SHD) and computation time.

__ Strengths__: + Structure learning in Bayesian networks is a challenging and very well motivated machine learning problem.
+ The paper improves the understanding of the strengths and weaknesses of the NOTEARS approach.

__ Weaknesses__: - The paper is fairly incremental, developing a single heuristic local search method (namely NOTEARS that enjoys no non-trivial performance guarantees).
- I find the claim that NOTEARS and FGS outperform earlier methods (Line 276) questionable. E.g., for moderate numbers of nodes d, as studied in the present paper, Ref. 2 (Table 1, p. 2306) shows that MMHC and PC perform much better than GES (here FGS). This is critical for the overall positioning of the empirical results in the context of the state of the art. (The same weakness concern also the original NOTEARS work and many follow-up works.)

__ Correctness__: As said above, viewing NOTEARS as a representative of the state of the art is questionable.

__ Clarity__: The paper is generally well written, given the limited number of pages.

__ Relation to Prior Work__: The relation to NOTEARS (which is the main point of the work) is well discussed.

__ Reproducibility__: Yes

__ Additional Feedback__: -- Added after the discussion phase --
Thanks for the rebuttal. Let me clarify that, to me, achieving a local optimum does not count, as such, as a "non-trivial performance guarantee". That said, I realize that the paper does contribute an efficient means for significantly improving the empirical accuracy wrt SHD, not just within the somewhat narrow context of NOTEARS (local search heuristic for linear models with ad-hoc post-pruning instead of optimizing one principled scocing function), but also for other methods, such as GES, FGS, MMHC.
It was most likely my mistake to think that FGS is just much faster (and thus less accurate) than the old GES; now I believe FGS is indeed also much more accurate than GES wrt SHD. Thanks for the additional results.

__ Summary and Contributions__: The paper introduces novel insights on the topic of learning BNs through a fully continuous optimization formulation.

__ Strengths__: The introduced novel formulation is indeed interesting, and the experimental results are promising.

__ Weaknesses__: The problem in the paper is that it fails in showing the actual scope of the new results, especially in the global context of BNs learning. In fact their methods apparently can only applied to the continuous case: no mention is ever made if the same method can work with categorical variables. This is reflected to the selected set of "state-of-the-art" methods against which they compare their methods, that is a narrow subset of the whole literature on BNs learning.
Saying something like "As mentioned, this paper is most closely related to the fully
continuous framework of ... " is definitely not enough: a more precise and thorough description of the limitations of this work, and its position in the whole BNs learning literature, is needed.
The title and the abstract should modified as well with same reasoning.

__ Correctness__: The methodology is sound.

__ Clarity__: The paper is clear and succint. The language is correct.

__ Relation to Prior Work__: The relation to previous work is clearly exposed.

__ Reproducibility__: Yes

__ Additional Feedback__: