
Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors present a method to learn the structure of
a special class of Bayesian NoisyOr networks (bipartite graphical models
whose nodes are binary random variables). To this end the authors
search for special substructures among the observed variables, called
quartets, using analysis of eigenvalues and existing results on tensor
decompositions. They prove that their algorithm only depends polynomially
on the number of samples and demonstrate the advantage in speed by
comparing it on simulated data to an existing algorithm based on
variational EM.
Clarity I find the paper well written. The
results are compared nicely to existing work and it has been made clear on
which previous results their method depends. As a minor remark, I'd find
the 'Background' section easier to parse with explicit formulas for the
(conditional) probabilites.
The paper seems theoretically sound,
though I did not check the proofs in the supplementary material.
Significance Bayesian NoisyOr networks can be used to
model causal relations, where an effect is (probabilistically) caused by
any event out of a given subset of mutually independent events. Therefore
I think efficient methods of inferring the structure of these networks are
important. Since exact inference has been shown to be computationally
hard, structural assumptions on the decomposability/sparsity of the
network are needed. The paper proposes such an assumption and
demonstrates that it is not too restrictive by showing that a large
network of deseases (latent variables) and symptoms (observed variables)
can be inferred almost completely (although, as I understand, deseases are
modeled to be mutually independent, which may actually not be the case).
Originality The results appear original to me (e.g. the
combination of eigenvalue methods and tensor
decomposition.) Q2: Please summarize your review in 12
sentences
The authors address the problem of inference of a
NoisyOr Bayesian network and present an efficient algorithm that can be
applied to a subclass of these networks. This is a good quality paper that
contains theoretical guarantees on the running time of the algorithm as
well as demonstration of its performance in
experiments. Submitted by
Assigned_Reviewer_9
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a structure and parameter learning
algorithm for noisyor networks based on rank tests of fourthorder
moments and spectral decomposition methods. The proposed algorithm builds
on and generalizes existing algorithms for learning parameters of noisyor
networks when the structure is known. While the proposed method seems
original and potentially useful, the overall presentation of the paper can
be improved. More specifically,
(1) The background can be made
more selfcontained. Although the definitions of noisyor Bayesian
networks and negative moments can be inferred the description, it would be
better if explicit definitions are given. Also, since the proposed work
builds on Halpern and Sontag (2013), it would be nice to give a concrete,
selfcontained summary of that work. Without such a summary, the reader
has nothing to refer to when encountering discussions of previous work,
and may find the presentation hard to follow.
(2) In Algorithm 1,
the "AdjustedJointDistribution" seems to be referring to the
subtractingoff operation, while PRETEST was described in Section 5. They
should be explicitly mentioned in, for example, the caption of Figure 2.
In addition, there are some technical questions:
(1) The
tests and extension require threshold parameters. How sensitive are the
algorithms to the values of these thresholds? How should one choose
them? Q2: Please summarize your review in 12
sentences
The paper proposes a novel algorithm for learning
structure and parameters of noisyor Bayesian networks with theoretical
guarantees. While the method seems original and potentially useful, the
presentation of the paper needs to be improved.
Submitted
by Assigned_Reviewer_10
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The paper discusses parameter estimation and two
versions of quartetbased structure learning algorithms for bipartite
noisyor networks of binary variables. The presented ideas are based on a
recent paper [Halpern & Sontag, 2013], extending to a more general
class of networks. The performance of the proposed algorithm is
illustrated on synthetic examples. The exposition of paper is generally
logical but the writing needs a lot of improvement.
The paper
contains potentially interesting algorithms but it is at times hard to
tell which ideas are new and which ones are borrowed. A summary of the
main contributions and their significance would be useful. What are the
real applications of the algorithm, for which existing algorithms would
not be applicable?
* Main concerns:  Experiments: The
authors list a number of practical applications but the performed
experiments are rather limited.  Exposition: There are too many
missing definitions, missing essential information (e.g., the algorithm by
Halpern & Sontag ), inconsistent notation, and wrong pointers to other
parts of the text. If this paper is accepted, I hope the authors take the
task of improving the text seriously.
* Additional remarks: 
Rank test: How practical is the epsilonranktestability assumption? In
what applications would it be satisfied (with large epsilon)? Can epsilon
be estimated from the data?  Definition 2: Please explain what you
mean by "the third eigenvalue of ... is at least epsilon" if the
eigenvalues are not necessary nonnegative real numbers.  Coherence
test: For each quartet, 4 sets of parameters have to be estimated. State
the computational cost of this estimation. "If the parameters learned do
not agree" should be further explained.  The recent tensor
decomposition method [Halpern & Sontag] seems to be an important part
of the proposed algorithms and should at least briefly be explained.
 The definition of noisyor networks could be more thorough. 
Some notation updates are required, e.g., on p.2, U stands for a set of
latent variables but on p.3 U is used to describe a set of observed
variables.  Figure 3 has wrong caption. Generally, Section 3 looks
like it suffered last minute restructuring.  Figure 2 should come
later, around the place it is referred to. The algorithms should probably
not be called figures.  Thorough proofreading is necessary. There
are multiple typos and missing references to figures and sections (in the
main file and in the supplementary material).
Q2: Please summarize your review in 12
sentences
This is a potentially interesting paper but the
current version needs improvement. The main contributions and their
significance should be made more clear. Further numerical experiments
should be performed, preferably related to real applications. The
presentation of the paper should be improved.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We would like to thank the reviewers for their careful
reading of the paper and will incorporate their suggestions to make the
work more selfcontained and clearer. We also recognize that some points
could have been explained more thoroughly. We attempt to address these
here and will integrate the clarifications into the revised version.
The main contributions of this work are the descriptions of the
quartet and pointwise mutual information tests, along with their
theoretical guarantees. While there are existing structurelearning
algorithms that use rankbased quartet tests (we provide references in the
“related work”), we are not aware of any that could apply to bipartite
noisyor networks, nor have we seen the pointwise mutual information test
used to identify children of latent variables or estimate parameters.
The Halpern & Sontag paper uses tensor decomposition for
parameter learning, but it is limited in that it requires that the network
structure be known in advance. In our paper we show that it is possible to
perform tensordecomposition based factor analysis on datasets to learn
bipartite noisyor networks when the network structure, or even the number
of latent variables, is not known in advance.
We use eigenvalues
instead of singular values when describing the rank test in order to
directly apply Elsner's result to prove the correctness of the algorithm
with polynomial sample complexity. While this proof can be extended to
singular values by bounding the L1 perturbation on a product of matrices,
this requires an additional step and gives rise to a worse bound (although
still polynomial). When we bound the eigenvalues away from 0, the bound
applies to their magnitude.
Quartet learnability is difficult to
determine without checking each network individually. However, some
networks clearly do not fall into this category, and these may give some
intuition about the concept. One such example would be if any pair of
latent variables have close to identical sets of children (i.e. all
children are shared except for at most two children of each parent).
Intuitively, parents must have sufficiently nonoverlapping sets of
children in order for the network to be quartet learnable.
In the
synthetic image dataset, the model is quartetlearnable, but not strongly.
Following the order of the parents from Figure 3 of the supplementary
material, latent variables 1 and 8 do not have singly coupled quartets
without a subtraction step, but can be learned after subtraction. In
Figure 4 of the main paper, for the models learned with 10K samples, the
first 6 latent variables are learned without any subtraction and the rest
are learned after subtracting off learned parents.
In the QMRDT
model, Table 1 shows that 469 latent variables are strongly quartet
learnable, and that all but 4 are quartet learnable. If the structure is
known, the number of latent variables that can be learned without a
subtraction step is 527. When a latent variable is learned, its
connections to all of its children are learned at the same time using
pointwise mutual information.
Epsilontestability:
Epsilontestability implies that it is hard to misidentify
singlycoupled quartets as a result of sampling noise. If epsilon is
large, that makes singlycoupled quartets easier to find. If epsilon is
small, we require more samples in order to estimate the eigenvalues more
accurately.
 Implementation: The pointwise mutual
information R can be estimated from the empirical conditional negative
moments using its definition from line 133 (also, l.h.s of line 145). All
other quantities that go into the quadratic equation of line 161 have been
previously learned.
 Running time: While we could have to
check all O(n^4) quartets, which is costly (though still polynomial time),
we filter candidate quartets by requiring that every pair of observations
in the quartet have pointwise mutual information above a low threshold (we
describe this process on line 393), reducing the number of quartets to 10%
of its initial size.
The coherence test obtains four estimates of
parameters through 4 tensor decompositions. Each decomposition is
inexpensive, requiring solving a quadratic equation and the inversion and
matrix multiplication of 2x2 matrices. The estimates "agree" if their
variance is below a threshold.
 Thresholds: The threshold
used by the quartet test determines the number of latent variables that
are found; setting them too strictly results in not learning all of the
latent variables, and too loosely results in discovering spurious latent
variables. The threshold in the pointwise mutual information test
determines the number of children ascribed to each latent variable. When
it is too lax, each latent variable connects to many observations; when it
is too stringent, each latent variable connects to very few observations.
To obtain our experimental results, we searched by orders of
magnitude (10^k for k in [6, 5, ..., 0]) until we found thresholds that
gave reasonable looking results using the largest sample size (10,000 data
points). We found that our results were not sensitive to the precise
setting of the thresholds. However, thresholds that were orders of
magnitude too high or too low did give clearly suboptimal results. In the
absence of prior knowledge, one could use a model selection criterion such
as BIC to choose between candidate thresholds.
 