__ Summary and Contributions__: The paper is devoted to the problem of Bayesian learning of Gaussian Bayesian networks. The authors propose several improvements with-in the MCMC framework for learning the graph in order to apply the algorithm to the bigger system. They reduce the cost of an algorithm by removing unnecessary steps, by smarter computing some quantities or by replacing full maximization by some heurestics.

__ Strengths__: The strongest part is that a significantly reduce the cost of full Bayesian inference for DAGs.

__ Weaknesses__: The novelty of the paper is very limited. The ais authors concentrate on computational tricks, tries to improve the scalability of the algorithm. And they achieve some success. However, for NIPS paper I would expect not only to improve implementation of the algorithm but also some new concepts. I do not found any new ideas in that sense. In more details.
The novelty in GADGET algorithm are:
- in preprocessing step the authors introduce reduction of the space by defining the maximum size of possible parents sets, and more sophisticated way to storage and precompute partial scores which are required further.
- in MCMC step the authors modify Kuipers et al. algorithm by dividing into two steps, the first step: run MCMC only to get ordered partitions and postpone the sampling DAG step to post-processing.
- in post-processing, DAG is sampled in a classical way first by adding compelled edges and next sample rest according to posterior probabilities.
The algorithm 2 is a straightforward consequence of algorithm 1 except a new formula for casual effects.

__ Correctness__: All results seem to be correct.

__ Clarity__: I do not have major objection about the clarity of the paper.

__ Relation to Prior Work__: The existing results and methods are well described and the authors clearly explain their contribution.

__ Reproducibility__: Yes

__ Additional Feedback__: #UPDATE#
After reading other reviews and rebuttal. I can agree that the improvement of algorithms in terms of the computational cost is important. Due to that, I decided to increase my score.

__ Summary and Contributions__: The authors give optimizations for a certain kind of MEC learning of discrete models.--specifically methods that score DAGs, looking for the most optimal score, over a sequence of DAGs generated by a Markov chain, where the Markov chain is achieved in various ways. The main way to optimize this procedure, as I read the article, is to improve the Markov chain procedure itself--for example, by limiting the potential parents of a node to a set of K such potentials. There are other optimizations as well--for instance, tricks to save memory. The contribution of the work is increase the number of nodes of a problem that can be analyzed this way.

__ Strengths__: The main strength of the work is in making explicit some of the tricks that one might use to optimize the Markov chains of DAGs considered by this kind of procedure. This will be relevant to other people trying to use these methods who are facing similar problems.
Two particular heuristics are identified for larger numbers of nodes, GREEDY and BACK & FORTH, which do particularly well.

__ Weaknesses__: Although it is hinted at in the experimental section, it is not explicitly stated that methods exist (possibly not performing as well) to infer an MEC without using the above Markov chain procedure for stopping through DAGs. Two are mentioned, PC and GES, where these use independence tests or scores over discrete variables. Whatever their accuracy, both of these methods are highly scalable, far more scalable than the TABU-style searches discussed here. To be fair, this should be stated and the explicit reason for discussing TABU-style searches indicated, since these are in fact, at least theoretically, ways of achieving scalable MEC-style analyses for discrete datasets.

__ Correctness__: The empirical methodology is excellent. This of course is mostly in the supplement, which is very clear.
The claims all appear to be correct.

__ Clarity__: Other than the framing of it, it's fine

__ Relation to Prior Work__: See above.

__ Reproducibility__: Yes

__ Additional Feedback__: The author feedback I thought was good. I'm very supportive of the idea that we need to improve performance of this particular algorithm, or class of algorithms, since I find also that it has performance advantages for accuracy.

__ Summary and Contributions__: This paper considers Bayesian inference for causal graphs with observational data. The first contribution is on speeding up a DAG posterior MCMC sampler. It builds upon a recent work (Kuipers et al. 2018) and proposes algorithmic improvements to reduce space and time complexity of the algorithm. Then it discusses the best candidate parent set selection problem that helps posterior coverage in a limited budget. An exact but expensive algorithm is provided, hence a number of heuristic approximations are listed and compared. Finally, an alternative to IDA is proposed for Bayesian estimation of causal effects in linear Gaussian model. Overall I feel the paper is a solid collection of results.
======
Thanks for providing additional explanations. I think computational issue is important in Bayesian method for BNSL, so I'll keep my current score. I'm looking forward to the promised updates in the final version.

__ Strengths__: Algorithmic improvement in multiple components of the Bayesian inference pipeline.

__ Weaknesses__: Contribution in each individual part is relatively insignificant.

__ Correctness__: Appears to be but did not check thoroughly.

__ Clarity__: This paper is well written.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: - (Kuipers et al. 2018) allows one to have an extra parent outside the candidate parent set. It would be nice to have a discussion on this in the paper.
- Since this is an MCMC paper, a mixing plot would be nice to show how fast the chain converges.
- What value of K did different methods use?
- How much benefit does large K bring in estimating causal effect?
- The paper claims "can handle hundreds of variables" but the largest experiment is with 104 variables. Maybe this can be technically counted as "hundreds" but I would be more conservative in the description here.
- Line 84: B is not defined, but from the context it b_{ij} is the edge weight for x_j -> x_i.
- Line 91: Node 1 should be the ancestor of node 6, not the other way around.
- Line 274: Incomplete sentence.

__ Summary and Contributions__: Like it says in the abstract there are 3 main contributions:
1. Algorithmic tricks for faster MCMC for DAGs
2. Methods for choosing a good set of potential parents for given DAG vertices
3. Application of sampling for a Bayesian measure of causal effect estimation.
Experimental results supporting the effectiveness of the presented methods is given.

__ Strengths__: Causal DAG learning remains of interest to the community. This work is more ambitious than much in this area since a Bayesian approach is taken where we must consider an entire posterior distribution over DAGs. The paper is actually *useful* in that it details key algorithmic innovations to get a method working. Useful empirical work is presented to help decide between competing reasonable choices e.g. for choosing 'good' parents.

__ Weaknesses__: The selection of candidate parents methods were heuristic rather than based on any solid theory. But they were explicitly labelled as heuristic and experiment was used to choose between them so this is a small weakness.

__ Correctness__: I found no technical errors and appropriate experiments were conducted.

__ Clarity__: The writing is good and I only found one typo
p142: precompue -> precompute

__ Relation to Prior Work__: The paper explicitly builds on recent prior work and makes appropriate connections to related work. The authors also make use of existing software.

__ Reproducibility__: Yes

__ Additional Feedback__: As should be clear from the comments given above, I think this is a useful paper that merits acceptance. A worthwhile and difficult problem is attempted and a new and successful method is created to deal with it.
I would have liked more about computing the "inverse proposal probability" in Metropolis-Hastings. Perhaps this is detailed in the associated arxiv paper, but I think it needs to be mentioned here. Also why choose randomly (I assume this means with equal probability) between moves? Is this for easy computation or because there is no particular reason to bias towards, say, splitting.
Robert Cowell had some software called BAIES, see Cowell 1992 in "Bayesian statistics 4", so we have a name clash unfortunately.
AFTER DISCUSSION/AUTHOR FEEDBACK
Thanks to the authors for their feedback. I remain in favour of acceptance.