
Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper develops a model for multifurcating trees
with edge lengths and observed data at the tree leaves; the model is based
on the beta coalescent from the probability literature. The authors
develop an MCMC inference scheme for their model, in which they draw on
existing work that uses belief propagation to perform inference for the
Kingman coalescent (an edge case of the beta coalescent in which all trees
are binary). The particular challenge for inference here is that there are
many more possible parentchild node relationships when parents can have
multiple children (not just two). The authors seem to use a Dirichlet
Process mixture model (DPMM) at each node to narrow down the space of
possible children subsets to consider. As the authors note, even inference
with the Kingman coalescent is a hard problem. In experiments, they
compare to the Kingman coalescent and hierarchical agglomerative
clustering.
The Kingman coalescent is a popular modeling tool, so
it is great to see a practical extension of the Kingman coalescent to the
multifurcating case being explored for inference. The authors give a
convincing discussion of the interest in multifurcating trees. The DPMM
trick to narrow the space of children to consider for a parent is another
neat idea here and seems to be validated experimentally. That being said,
there seem to be some issues with readability and notation in the paper
(in particular, it is difficult to read on its own without the Teh, Daume
III, Roy paper and also its own supplementary material, and even then
there are some confusing points). And while I think that it is useful to
have a beta coalescent extension to the Kingman coalescent regardless, it
seems surprising that the Knowles, Ghahramani 2011 work on multifurcating
trees (which comes from the diffusion, rather than coalescing, perspective
for building trees) is not included in the experimental comparison (or at
least discussed more). Moreover, given that the main selling point here
seems to be the use of multifurcating trees in modeling, it would be
useful to see comparisons of the trees returned by the Kingman and beta
coalescent approaches.
Some further major points: * Page 2,
lines 5657: My understanding is that this sentence is saying that the
Dirichlet diffusion tree (DDT) and PitmanYor diffusion tree (PYDT) do not
include a generative model for edge lengths. Such a statement would be
false. An integral component of both models is the distance between child
and parent nodes, which is determined by a divergence function. These
trees also typically come with a (Brownianmotionderived) generative
model for leaf node locations in some (typically Euclidean) space. That
the DDT and Kingman coalescent are in some sense inverses has been
exploited for modeling (see the recent work of Teh et al on
fragmentationcoagulation processes). From this perspective, it seems
disingenuous to claim that this paper is the first to provide a model of
edge lengths for multifurcating trees since the PYDT does so. The
contributions of the current work seem, rather, to be the development of
inference with the beta coalescent; these presumably have their own merits
over, and separate from, the PYDT (just as the Kingman coalescent and DDT
have their own merits). * A number of points in Algorithm 1 are
unclear.  * Step 8: Is i local to s? Should it have an s
subscript? Or should this incrementation be outside the for loop? 
* Step 14: pi doesn't really seem to be defined in the main text. Page 2,
line 67: The "tree structure" pi is mentioned but not defined, and Page 4,
line 166: the "tree structure" theta_i is mentioned but not defined. How
does pi differ from theta? Without definition, it's unclear what the
subtraction and addition in Step 14 mean. * There are other notational
issues:  * It is unclear what the role of p_0(y_i) is or how it is
chosen. It is not clear why the current authors diverge from Teh et al,
who seem to use a base distribution at the root, and here instead use a
distribution at each node. How does this distribution interact with the
transition kernel? Z_{\infty}(x) does not seem to be defined explicitly
(Eq. 6).  * The superscript s seems to be introduced on page 4,
line 168 without being defined or discussed. There later seems to be some
confusion about what is particlespecific and what is not. E.g., line 241,
page 5: what does Omega_i restrict? All particles? The notation for
Omega_i later switches to an s dependence. If a DPMM is run for each (s,i)
pair, it would be helpful to be explicit in the text about this.
Conversely, in Eq. 11, why does lambda_{n_i  1}^s have an s superscript?
That seems to at least overload the notation on page 2, line 83. *
Some details and choices in the experiments were confusing, e.g.: 
* Why was alpha = 1.2 chosen for the beta coalescent? This seems somewhat
arbitrary. Likewise, why the choice of 5 chains with 5 particles per chain
(page 7, line 364) vs 5 chains and 80 particles each later (page 8, line
401)? Also on line 364, how many dimensions were kept after using PCA?
 * In the definition of the purity score, is the purity score
averaged over all pairs?  * Page 7, line 338 suggests that the
synthetic data comes from a beta coalescent generative process, but the
appendix seems to demonstrate this is not true. That the generating
process is not a beta coalescent should be clearer in the main text.
Moreover, in the supplementary material, it's not clear how the actual
nodes (as opposed to the number of nodes) for merging are chosen at each
round. Presumably the true model is not used for synthetic data generation
because it is too computationally complex; this is worth mentioning in the
text.  * It's unclear why the effects described on line 345 on
page 7 are evident. Does increasing the number of observations have no
effect simply because the maximum (and best) score is already attained?
 * How were the sample hierarchies in Figs. 3 and 4 chosen? Were
they, e.g., just the final samples in the chain?
Minor comments:
* Page 1, line 38 (and elsewhere): The term "bushy" is often used to
describe binary trees (in particular, balanced binary trees). The authors
here seem to use it describe multifurcatingness of trees. If the authors
wish to use the word "bushy", I would suggest defining their desired usage
early on (at least before the first use in the main text besides section
headings). * Page 2, line 80: Be explicit about the Kingman coalescent
representation as a Lambda coalescent. Namely, the measure Lambda is the
Dirac delta measure at 0 in this case. * Page 2, line 103: Can
anything specific be said to support the statement that this approach is
"more scalable"? E.g., some asymptotic bounds/analysis? * Page 4, line
187: Is this meant to be computational complexity just for the ith
coalescent event? If so, n_i  1 or n_i should be used instead of n. If
it's for all events, since the Kingman coalescent considers every pair of
nodes after each coalescent event in the tree, it would seem the cost of
growing a Kingman coalescent is something like a constant times n^2 +
(n1)^2 + (n2)^2 + ... + , which would be asymptotically like n^3, not
n^2. And the result for the beta coalescent would be not obvious in this
case. * Page 5, line 217: Consider using a different notation for the
identity matrix and the indicator function (cf. Eq. 9). Define the
identity matrix notation at first use (here) instead of line 92, page 2 in
the supplementary material. I'm not sure why ref [10] is referenced on
page 5, line 217 or why this relationship between the mutation rate and
Dirichlet concentration parameter is chosen. * Eq. 12: Should this be
t_b instead of t_{b_i}? * In supplementary material Section 2, x_n is
overloading notation with the main paper. Also, n_k should be explicitly
defined, and also z_i^k (which appears to be 1\{ z_i == k\} ).
===
Update after Author Feedback ===
The Author Feedback addressed my
concerns as much as seems possible in the constrained space. I am
optimistic this will be a solid paper after a round of
editing. Q2: Please summarize your review in 12
sentences
This paper provides practical inference for modeling
the beta coalescentan extension to the popular Kingman coalescent and
which allows parents in learned hierarchies to have multiple children
instead of just two. However, it is somewhat difficult to read on its own
due to notational confusion, errors, and missing definitions, and it would
benefit from a visual comparison of binary and multifurcating trees
returned by the Kingman and beta coalescent
frameworks. Submitted by
Assigned_Reviewer_6
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
The authors of this paper propose a new inference
algorithm for the beta coalescent when applied for discovering
hierarchical structures in data. Their work is a generalisation of the
belief propagation framework in [1] to beta coalescent. Belief propagation
and sequential Monte Carlo (SMC) techniques have been combined before
([1]) to infer the hidden tree structure in coalescent models. However,
during inference the number of all possible children to coalesce should be
considered. This number grows prohibitively when beta coalescent is
considered rendering inference intractable. Towards this, the authors use
a novel approach; they use Dirichlet Process mixture model (DPMM) as part
of the SMC proposal. SMC selects a children set from DPMM to coalesce.
Thus, the DPMM effectively restricts the number of possible children sets
making inference tractable. The results on synthetic data and after
comparing the proposed algorithm with the greedy one that enumerates all
the possible children sets, illustrate that for small datasets the two
algorithms are comparable. For bigger datasets, the prohibitive nature of
the greedy algorithm bans it from applicability, underlining thus the
outperformance of the proposed one. Moreover, the author compare the beta
coalescent to Kingman's coalescent and hierarchical agglomerative
clustering (hac) [2] on real dataset and showing that the beta coalescent
outperforms.
The authors present a novel idea for improving the
inference in beta coalescent models which, in my opinion, can effectively
be used for $\Lambda$coalescent models in general. However, I am confused
about the focus of the paper. Both, in abstract and in Section 1, they
make clear that their contribution is in inference. However, they also
mention that they present results of the beta coalescent on real dataset.
The beta coalescent is a known model and to my knowledge, it has been used
extensively. I wouldn't regard it as a main contribution of this paper.
However, I am pleased to see a comparison of the betacoalescent with the
Kingman's and the hac. In their experiments on real datasets, they look at
the structures found by the models and examine them qualitatively. Maybe
the authors could include results in the predictive performance of the
three models too. That would be an interesting addition. Moreover,
comparing the performance of the proposed inference algorithm to the
simple enumeration one on real dataset would add to the paper.

Section 2  line 055: The authors mention that the PitmanYor
diffusion trees (PYDT) model by [3] doesn't distinguish edge lengths. I am
not sure how correct this is. The edge lengths in PYDT are being
determined by the divergence times in a similar way they are being
determined in the beta coalescent model they discuss about.
Typos: line 038: nodes > node
The writing of
the paper is clear with only a couple of confusing points which I already
discussed. The proposed inference algorithm builds on an existing
inference algorithm for coalescent; it's incremental contribution. It also
provides a qualitative comparison of the beta coalescent to Kingman and
hac which is interesting.
[1]: Yee Whye Teh, Hal Daumé III,
and Daniel M. Roy. Bayesian Agglomerative Clustering with Coalescents. In
NIPS, 2008
[2]: Eads, D. Hierarchical clustering. Scipy, 2007.
[3]: Knowles D., Z. Ghahramani. PitmanYor diffusion trees. In
UAI, 2011.
************************************** ***UPDATE
after author's response*** **************************************
I want to thank the authors for their detailed rebuttal. The paper
presents an interesting inference algorithm for learning a Bayesian model
of a beta coalescent. However, I am not sure whether this is supported
enough by the actual presentation in the paper. If the authors added a
comparison to other algorithms or models, I would be more convinced;
either a comparison to other multifurcating models (like Knowles,
Ghahramani PYDT) or other inference methods for models with
betacoalescent such as the ones in [19]. These would add to the paper. I
do understand though, that the code for these models might not be publicly
available, and asking for an implementation of these from scratch might be
beyond the scope of this paper. Q2: Please summarize
your review in 12 sentences
The current paper proposes an nice inference idea that
contributes incrementally to the inference of coalescent models. As an
idea of itself is nice and could be incorporated in many similar models,
but I am not sure whether it is significant enough. A clearer presentation
of the contribution would be appreciated. Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
An extension to the Kingman’s coalescent, the “beta
coalescent,” is proposed. The beta coalescent allows for nonbinary
branching of the tree. The authors derive belief propagation and
sequential Monte Carlo methods for inference. Experiments with synthetic
and two real world datasets are performed comparing the beta coalescent to
Kingman’s coalescent and hierarchical agglomerative clustering.
This paper presents a niceif fairly
straightforwardgeneralization of Kingman’s coalescent to the nonbinary
case. The usage of a DPMM as a proposal distribution for SMC to make
inference tractable was clever and the justification provided for why the
DPMM would be good proposal was strong. The experimental results provide
further support for the usage of the DPMM.
While the paper was
technically strong, the need for a “bushy” hierarchical clustering model
was not wellmotivated in the introduction. The authors should expound on
at least 2 of the examples cited there. The authors claim “it is crucial
to model hierarchies that are nonbinary.” While I am personally inclined
to agree with them, they provide no justification for the statement.
The two real data examples given were quite illuminating. I was
intrigued by Fig 3: in the human tissue example only three nonbinary
branches were present and in the newsgroup example there was only one
nonbinary. I am thus surprised that the beta model significantly
outperformed the kingman model. An explanation of this in the paper would
be appreciated.
One other concern is that for the synthetic data
example, the the beta model did not outperform the competitors and the
authors fail to provide an explanation as to why this might be.
UPDATE (based on author feedback): The authors are thanked for
their thorough response. The key points are for the authors to clarify
some of the technical details pointed to by the other reviews, the
conditions used in the experiments and the interpretation of the
experimental results. Q2: Please summarize your review
in 12 sentences
A technically strong (if incremental) contribution
with many potential applications. The paper would benefit from stronger
motivation and further exposition in the results section.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the anonymous reviewers for their detailed
comments on our paper presenting belief propagation applied to the Beta
coalescent. In our response, we will review our goals in presenting our
work, discuss the tough choices we made in exposition, comment on related
work, and then cover remaining questions.
=== Motivation ===
Reviewer_6 is right; we do not claim that we are proposing the
Beta coalescent; it is an established model. However, efficient inference
methods (which have been applied to other Bayesian clustering algorithms)
have not yet been applied to the Beta coalescent. Doing so would allow us
to model bushy data more effectively.
Reviewer_7 points out that
the need of the bushy structures is not wellmotivated. We will try to
motivate the need for bushy structure and explain the challenges of
modeling bushy structure better in the introduction and background, in
addition to the biological examples we explore later in the paper.
=== Exposition ===
Reviewer_5 thought that this paper
was difficult to read without Teh et al’s paper. To conform to NIPS space
limitations, we chose to assume that readers were familiar with Teh et
al’s Kingman coalescent. We thank the reviewers for their concrete
suggestions on how to ameliorate this dependence to make both the paper
and the supplemental material more selfcontained.
Because we are
applying our method to larger trees, it is difficult to visually compare
the output of the Beta coalescent and Kingman’s coalescent. Under NIPS
space constraints, we chose to show the output of Beta coalescent because
(a) that is the novel model and (b) because it is harder to render a
binary tree. However, we offer many numerical comparisons between the
methods. However, we sympathize Reviewer_5’s preference for qualitative
comparisons, and we will gladly add them to the supplemental information.
To use space efficiently, we also represented trees using compact
notation. This caused some confusion. Reviewer_7 reasonably inferred that
the trees discovered in the newsgroup data were mostly binary; it is not,
however, because we lumped subtrees with only leaves into a single
supernode (e.g., one node has 20 children, all of them leaf nodes).
=== Related Work ===
Reviewer_5 and Reviewer_6
correctly point out that diffusion trees also include edge lengths. We
misstated this. However, Knowles and Ghahramani completely ignore edge
lengths in their evaluations. This prevents us from knowing the quality of
their edge lengths. Unlike their paper or other previous clustering
algorithm, we do evaluate edge lengths. Thus, while this is not the first
paper to model edge lengths, this is the first Bayesian clustering paper
to evaluate edge lengths experimentally. We will clarify this distinction
and correct our mistake.
=== Experiments ===
Reviewer_5 asked some questions about the details of experiments:
* The parameter alpha needs to be between 1 and 2, and we compared
different alpha value on synthetic data set, with DPMM proposal, different
alpha values (outside of extreme cases) don’t change the outcome too much
so we didn’t include this analysis in the paper.
* We use 5 chains
throughout. For synthetic data we use 5 particles and 80 particles on real
data. We use the first 10 dimensions of features after dimensionality
reduction.
* The purity score is computed as: for any of the two
leaf nodes with the same class label, we find the smallest subtree
containing these two leaves, and measure the fraction of the leaves in
this subtree with the same class label. The expected value of this
fraction is the purity score. This is also used in Teh et al’s paper and
Heller and Ghahramani’s 2005 paper.
* The sample hierarchies in
Figure 3 and Figure 4 are the final sample from one chain.
Reviwer_6 suggested comparing exhaustive enumeration and the
proposed method on real data sets. However, even for synthetic data, it is
only tractable for datasets with ten observations; thus it would be
impossible on real data, so we needed to do these experiments on synthetic
data to have any results whatsoever.
Reviewer_7 correctly points
out that the beta coalescent is better than existing clustering algorithms
on real data but only comparable on synthetic data. Synthetic data are
simple, and so inference can build binary trees that more closely
approximate the true bushier structure. This is more difficult on real
data.
=== Other issues ===
We heartily thank Reviewer
5 for their careful reading:
* Incrementing i should indeed be
outside the for loop in Algorithm 1.
* We defined \pi on line 94,
\pi_i is a partition of nodes at time t_i, while \theta_i is a subtree at
time t_i; we will ensure that readers are reminded of these definitions.
* p_0(y) is defined the same as Teh et al, except we change
notation from q(y) to p_0(y), and both are an initial distribution at each
node. This initial distribution and messages from children together decide
the local normalizer Z at this node. We also followed Teh et al to use the
notation Z_{\infty}(x) to denote the root normalizer. It is the same as
Eq 5 in our paper.
* Superscript s was mentioned implicitly on
line 168: the notations without s introduce the concept, while adding the
superscript s is used to distinguish between different particles, for
example, \Omega_i and \lambda_{n_i  1}.
 