Paper ID: 1150
Title: Learning Bayesian Networks with Thousands of Variables
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents an algorithm for Bayesian network learning.

In contrast to exact methods, this method can handle settings with thousands of variables, without restricting the number of parents of each variable. The method is composed of two parts, a novel parent selection algorithm and an improvement on an existing structure optimization approach.

One of the weak points of the paper is the parent selection heuristic presented because there are no guarantees that the optimal parents sets are a subset of the parent sets that are chosen.

It's shown in practice that their approach outcompete other methods of parent selection when comparing the score functions. However, the experiments were done on real-world data sets and thus there is no ground-truth structure to compare with.

It's not clear whether their method achieves the correct structure or not.

It would be more convincing to show that the method recovers the correct parent sets on simulated data whether the true structure is known. In addition, the description of the independence selection algorithm is difficult to follow. It would have been helpful if there was some analysis on how many parent sets are explored, and then how many of them end up in the closed list.

Regarding the improvement on order-based search for structure optimization, the paper presents a clever way of considering the extra acyclic structures which order-based search would have ruled out, while maintaining the same overall time complexity.

For this section, it was unclear how the initial ordering over the variables was determined. The experiments demonstrate that their method outperforms Gobnilp (an exact approach) and Order-based search for datasets with number of variables greater than 500 for structure optimization based on the BIC score.

Again, the results could be different if evaluating whether the correct structure is recovered.

Q2: Please summarize your review in 1-2 sentences
This paper proposes an novel heuristic method for learning the structure of the Bayesian network specifically in cases where the number of nodes prevents the use of exact approaches. However, the paper is not clearly written and the experiments are not quite comprehensive.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Update: I have read the authors' rebuttal. I recommend that the authors try to enhance the presentation, if the paper got accepted.

The paper proposes a novel approximation to the BIC score that can guide the search for finding a good parent set. The benefit of the proposed approximation is discussed both theoretically and practically. The second contribution was an enhancement over current structure optimization.

In my opinion, the paper is very simple and could have been described much more succinctly and more clearly. The first contribution is nice in terms of pruning some parts of the search space. For the second contribution, we theoretically only know that it will be at least as good as the current structure optimization?

A good paper overall. My main criticism is the unnecessarily verbose explanation for a very simple, yet useful contribution.
Q2: Please summarize your review in 1-2 sentences
The paper proposes an approximation for BIC that can guide the search for a BN structure. The paper could have been written succinctly in a much better way. The modification of the structure optimization is incremental.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper describes tricks to scale Bayesian network structure learning to thousands of variables. This is achieved by developing new heuristics for candidate parent set identification and the subsequent order based structure optimization.

In general, the paper is clearly written and easy to read. There are issues in editing and style, but the problems do not affect readability (much). The suggested heuristics feel bit ad-hoc, thus the value of the work is eventually judged by empirical evaluation. In this regard, I feel that the work could be better. Notably, the comparisons to simple local greedy searches (that scale well to 10000 variables too) should be included or referred to.

In the following, I will give more detailed comments:

- Intro, second paragraph: "Usually" - I do not know where this

"usually" comes from, but for me it is annoying. It also occurs in

many other places in a text, and more often than not it provokes a

counter-reaction. Yes, such a two phase strategy has been used

before, but to say it is "usually" done this way is a statement that

has to be proven right.

- Page 2, second paragraph. It is inconsiderate to use citation as a

part-of-speech as in [17]. Even for an expert reader, that requires

stopping reading and looking for the reference.

- Page 2, the BIC formula looks a bit ugly. It would have been better

to do $BIC(G)=\sum_{i=1}^n BIC(X_i,\Pi_i)$, and

$$

BIC(X_i,\Pi_i) =...

- Page 3, 3.1. The explanation makes the reader suspicious. It is

exactly the non-monotonic nature of the effect of combining the

parent sets that makes the structure learning hard. How would BIC*

manage with Z=X xor Y (or more generally a parity function). But, well,

the experiments will tell.

- Page 5, Which data was used to produce Fig 1.

- Page 5, while the space of orderings is indeed less that space of

DAGS, 1000! is still quite a big number, so calling it "convenient"

is a bit of a stretch.

- Page 6, 5.1. Sample sizes would have been nice to have too. Maybe n/N in

Table 1.

- Page 7, last paragraph. Gobnilp is said to be better because it is exact

solver. Now, it is exact solver also when n > 500, so how can it all

of the sudden be worse? I mean, the explanations as they stand now,

are not very convincing.

- Page 7. How did you set the structure and parameters for random networks?

All in all, it is a rather simple paper and I find it interesting to scale the BN structure learning to 10 000 variables. I really would have liked to see a better empirical comparison with heuristic methods like greedy local search also with big enough sample sizes that allow a big number of parents for the optimal network.
Q2: Please summarize your review in 1-2 sentences
In general, the paper is clearly written and easy to read. There are issues in editing and style, but the problems do not affect readability (much). The suggested heuristics feel bit ad-hoc, thus the value of the work is eventually judged by empirical evaluation. In this regard, I feel that the work could be better. Notably, the comparisons to simple local greedy searches (that scale well to 10000 variables too) should be included or referred to.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents a heuristic for learning Bayesian networks with very large number of variables. The contributions are an adaptive heuristic algorithm for choosing parent sets whose score is computed and an improved version of ordering-based search.

While it is good to compare the method to exact methods, the main competitors of this method are other heuristics such as greedy search, GES or MMHC and it should be compared to those, too.
Q2: Please summarize your review in 1-2 sentences
This paper introduces a new heuristic for learning Bayesian networks for very large data sets. The method seems to perform well in practice. However, it was not compered to some of the obvious competitors.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Assigned_Reviewer_2

"One of the weak points of the paper is the parent selection heuristic presented because there are no guarantees that the optimal parents sets are a subset of the parent sets that are chosen. [...] It's not clear whether their method achieves the correct structure or not."

=>Sect 5.2 presents experiments on data simulated from known networks. Although we do not report on the distance between the actual network and the networks learned by the different methods, we did measure the BIC score of the learned networks, as done in the literature. Assuming all networks to be equally probable a priori, the BIC score is proportional to the posterior probability that such network is the actual generative model. The networks learned by our novel approach have consistently higher BIC score (thus higher posterior probability of being the correct model) compared to the networks learned by the competitors. Moreover the graph of a Bayesian net (BN) simply encodes conditional independencies. How close a learned structure is to another (or to the true one) is not an obvious question; simply checking for matching arcs does not reflect the distance between the joint distributions.

"It would have been helpful if there was some analysis on how many parent sets are explored [...]"

=>We provide some hints on how the parent sets are explored in Figure 1. It shows that our methods explores first the highest-scoring parent sets. Conversely, sequential ordering is unable to prioritize the highest-scoring parent sets. We can provide some additional examples and further empirical results in the final supplementary material.

Assigned_reviewer_3

"Notably, the comparisons to simple local greedy searches (that scale well to 10000 variables too) should be included or referred to. [...]"

=>The networks learned by our methods have consistently higher score than the networks learned by both Gobnilp and the ordering-search by Teyssier and Koller UAI'05 (T&K'05). The ordering-search of T&K'05 is a heuristic method which improves over the methods mentioned by the reviewer. Indeed T&K'05 show that their search yields higher score than (or equal to) both greedy hill-climbing over the space of graphs and optimal-reinsertion-search for BN structure learning (Moore and Wong, ICML'03). Gobnilp is a state-of-the-art exact solver, and its approximate anytime results are anyway a relevant benchmark.

Our approach does not need a max in-degree (k) but its competitors do. We set a limit k=6 to fairly assess the different methods. The limit k=6 is already much larger than current practice and state of the art. For instance the results reported on Gobnilp webpage adopt k=2 when dealing with more than 60 variables.

"Gobnilp is said to be better because it is exact solver. Now, it is exact solver also when n > 500 [...]"

=>Gobnilp is an optimization package for learning the structure of BN with anytime behavior, based on cutting planes and mixed linear integer programming. It finds an approximate solution if stopped early. In our experiments, Gobnilp is always approximate when provided with more than 500 variables, even though we allowed 24 hours of computation.

"Which data was used to produce Fig 1"

=>Data from the child network, n=5000. It regards the exploration of parent sets for variable 'Disease'

"How did you set the structure and parameters for random networks?"

=>(page 8 of the paper) Each variable gets a set of parents with random size from 0 to 6 (uniformly). Parameters were sampled from a Dirichlet distribution with random parameters in (0,10).

Assigned_Reviewer_3

"we theoretically only know that it will be at least as good as the current structure optimization?"

=>We theoretically know it will be better than T&K'05. We have experimentally seen it is also better than Gobnilp, which is arguably the state of the art for structure learning.

Assigned_Reviewer_5

"While it is good to compare the method to exact methods, the main competitors of this method are other heuristics such as greedy search, GES or MMHC and it should be compared to those, too."

=>(Please see also reply to Assigned_reviewer_3) We find consistently higher scores than both Gobnilp and T&K'05; the latter is shown to improve on other methods (M&W ICML'03). Moreover, we focus on score-based learning, so comparing to methods based on independence tests is not ideal.

Assigned_Reviewer_6

"missing results of section 5.2 should be added to the supplemental material"

=>We will provide these detailed results as supplementary material. Moreover, we plan to release also our software as open source.

Assigned_Reviewer_7

"The result looks promising, but all based on BIC score."

=>With slight modifications our proofs about the BIC score can extended to AIC and to the MIT score (De Campos, 2006, JLMR). We leave instead for future work the extension to alternative scores such as BDeu.