Paper ID: 635
Title: Bayesian Estimation of Latently-grouped Parameters in Undirected Graphical Models
Reviews

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper uses the intuition that the real data is often generated in a setting where different components share characteristics to suggest that if the data is to be modeled by a graphical model (MRF in the paper) then it is appropriate to assume that the parameter values should somehow come in clusters. To impose such clustering, they propose a DP-mixture prior over the parameters of the MRFs and work out several approximations to Bayesian learning of parameters for an MRF with a given structure.

The paper is well-structured and written, and the empirical evaluation on the simulated data is thorough. The proposed idea (clustering of the parameters of the MRFs) to the best of my knowledge is novel, and the paper is technically sound, at least based on my cursory pass through the details. There are enough details in the paper to reproduce the results (albeit the paper is understandably dense).

Having said all that, I do not necessarily see the improvements over the unconstrained MLE on real data (Senate voting) is due to the assumptions that the authors are basing their idea on. MRFs can be reformulated, perhaps with over-parameterization (e.g., a tree-structured MRF can be represented by a tree BN), and I do not see why the parameters under such reparameterization would exhibit discernible clusters. It is more likely that an unconstrained MLE would overfit, so imposing a strong prior (infinite mixture over parameters) yields a better fit. It would be interesting to see if a simple 2-component mixture prior would do even better than a more computationally complex DP mixture.

Minor:

Section 4 -- need to define VI.

Figure 2 is unreadable.
Q2: Please summarize your review in 1-2 sentences
The paper proposes an interesting prior over the parameters of a graphical model and describes approximations to the Bayesian approach to learning the parameters.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper tackles the difficult problem of hierarchical parameter
estimation in Markov random fields. The model consists of a Dirichlet
process prior for parameters, and the computational contribution is an
approximate MCMC method, since ``exact'' MCMC seems somewhat out of
reach for this problem.

The solution is not particularly elegant, but to be fair the problem
is among the worst for MCMC samplers to tackle.

It would be interesting to understand which of the main features of
the solution contribute most for the results. To begin with, a trace
of the approximate MCMC progression is absent. Without the trace, this
makes me think that most of the work is done by the initialization
procedure and that the Bayesian aspect of it is
minimal. (Nevertheless, I still think that just making this work at
all is already not easy.) What is missing perhaps is a simple
comparison against the method hinted by the initialization procedure:
first, fit parameters; then cluster them; then refit parameters using
the clustered MLE objective function.

The write-up of the method in 3.2 could be improved. Notation gets in
the way. For example, in the equations of line 211-213 (p. 4), please
do not write it as P(X;, theta_i, theta_minus_i) as indicating it
proportional to the last element on that line makes no sense.
Instead, write something like L(theta_i; X, theta_minus_i) or some
other notation so that is is clearer what is random and what is not
here.

The intuitive justification for the beta approximation is a bit
disappointing, since one again it will rely on using PCD (and how is
theta_tilde_i obtained? By fixing c and the other thetas? Is c_i
marginalized? This needs to be clarified.)

I like the experiments comparing "full Bayesian" to MLE and others,
but I have to say I'm quite amazed how close the full Bayes and the
SBA are to each other. That alone make the paper interesting for
discussion, but as I said the thing missing is some sensitivity
analysis trying to understand better which of the aspects of the
approach are the reason for the success.

Minor: is there a way of scaling the vertical axis on the VI
Difference plots so we can understand how well the method does in
absolute?
Q2: Please summarize your review in 1-2 sentences
A practical solution to a general and hard problem on learning multivariate distributions. Good in general, needs some sensitivity analysis on whether the true power behind it is simpler than what the approach proposes.

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a Bayesian approach to learning binary pairwise MRFs with latently tied parameters. A DP prior is used to model the grouping of parameters, and two approximate inference methods are compared with MLE.

Doing Bayesian inference on MRFs is difficult, known as doubly intractable. The first algorithm, M-H with auxiliary variables, is based on the paper of Moller et.al. 2006. It is technically sound (except for the sample generated by Gibbs sampling), but could be slow, hard to mix, and therefore not feasible for real medium-sized applications. The second algorithm, Gibbs_SBA, is novel to my knowledge and seems to be a good candidate for approximate inference. The experiments show that the approximate methods are much faster than the exact method and obtain similar performance on small problems. It would be a meaningful contribution to Bayesian approach to MRFs.

My concern is on the usage of the Stripped Beta Approximation. The derivation of the method is lack of clarity. It is based on two approximations: (1) line 215: the distribution of all variables other than u and v is independent on the value of \theta_i; (2) line 228: approximate the contribution of \theta_i in (5) with a Beta distribution. I don’t understand how the second approximation is adopted. Apparently Equation 5 is a Beta function of \lambda, which is a complicated function of \theta. The authors don’t provide sufficient explanation for the beta approximation or the usage of the MLE of \theta in the Beta parameters. While the experiments show the overall performance of Gibbs_SBA is similar to the exact method in terms of the two first metrics, a figure or a paragraph of discussion on the accuracy of the approximation on one single step of sampling would be very useful.

In the experiment of the real problem, the output of the parameters does not show any hard grouping because the Bayesian approach assumes a posterior distribution. So is the latent grouping prior useful only as regularization to prevent over-fitting, or you can make more meanings out of it? The improvement on the pseudo-likelihood is not very significant. I guess that is because the sampling procedure is initialized on the result of PCD.

My last concern is about the speed. The computation burden is known as a serious problem for Bayesian inference. The authors compared the speed of the two approximate methods with the exact method. But how are they compared with the MLE algorithm based on PCD? It seems that for the fastest method for Bayesian inference, Gibbs_SBA, one has to run MLE (PCD) for every parameter on every iteration. Would it be applicable for any large problem in the real world?

Some minor questions and typos:
1. The binary pairwise MRF in the paper has pairwise potentials only. Would Gibbs_SBA still applies for an MRF with unary bias?
2. For M-H algorithm in section 3.1.1, do you also need to integrate over \theta when a new group is created?
3. Line 126: Section 3.2.2 -> 3.1.2
Q2: Please summarize your review in 1-2 sentences
A Bayesian MRF model is proposed with latently-grouped parameters, and two approximate inference algorithms are discussed. The second algorithm, Gibbs_SBA, is a novel algorithm and could be good contribution for Bayesian inference on MRF, although the deviation or the validity is not clear.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We appreciate the excellent advice from three reviewers. They provided many useful suggestions which can dramatically improve the paper. Below are our replies to the individual reviewers. The additional experiments and results (suggested by the reviewers) are provided in the supplementary material (we do not include them in the main text because of the space restriction).

Reviewer 1 (Assigned_Reviewer_5):

We thank you for your insightful review. We agree with you that on the real data, the improvement is from the strong prior on the parameters which provides further abstraction, just like what we gain during hierarchical modeling and topic modeling. Your idea of using "2-component mixture prior" on the real data is a smart way to diagnose. We just quickly implemented this idea (initialize with two clusters, and keeping the number of clusters to be 2 all through). Its LPL-TEST is -9094.11 in Exp1, and is -11616.89 in Exp2. Its LPL-TRAIN is -11206.38 in Exp1, and is -8618.72 in Exp2. It seemed to us that "2-component mixture prior" performs even worse than MLE. Therefore, we feel the DP part is useful to recover the number of hidden groups, and this is important.

We will define VI in the paper.

We will make Figure 2 more readable.


Reviewer 2 (Assigned_Reviewer_6):

We thank you for your insightful review. We like your suggestion on presenting the "trace" of the Bayesian estimator. We will report this result in the supplementary material (for a given training set, report the estimation error and likelihood of data in terms of the number of MCMC steps performed).

Your suggestion of adding "clustered MLE" is excellent. We implemented this baseline in our experiments, and the performance of clustered MLE is comparable with standard MLE. Therefore, we believe the true power behind our estimator is inferring the hidden groups under the Bayesian framework. We will report the corresponding results and add the discussion in the supplementary materials. We appreciate that you pointed this out.

Thanks for your suggestions (such as replace density function with likelihood function) on rewriting section 3.2. We will try to give more intuitive justification for the beta approximation. For your questions: you are right, theta_tilde_i is from PCD. When we evaluate the group of one parameter, namely c_i , all the rest of the groupings and parameters stay fixed. Then we set the new group of the parameter according to the likelihood, according to (4).

We do not think there is a way of scaling the vertical axis on the VI Difference plots in the literature.


Reviewer 3 (Assigned_Reviewer_7):

All your comments are exactly correct and insightful. There are exactly two approximations within the usage of the Stripped Beta Approximation. For the second approximation, the motivation comes from the fact that the Beta distribution is a conjugate prior to the binomial distribution. Suppose that on edge i, we have a potential function parameterized by theta_i. At the same time, let's imagine that the rest of network imposes another potential function parameterized by \eta_i, telling how likely the rest of the graph thinks X_u and X_v should take the same value (\eta_i stays fixed). When we multiply the two potentials, we get the one potential function parameterized by \lambda_i. However, when we evaluate the posterior distribution of \theta_i (conditional on the rest), we only observe the data (the count of X_u=X_v) which correspond to \lambda_i. Therefore, we have to remove the contribution from the rest of the network, namely "\eta_i". On the other hand, PCD provides an MLE estimator of \theta_i, which implicitly teases apart \theta_i and \eta_i. Therefore in the end, we can get an approximate posterior distribution for \theta_i based on the MLE of \theta_i from PCD. We like your suggestion of further providing "a figure or a paragraph of discussion on the accuracy of the approximation on one single step of sampling". We will do that.

I agree with your comments on the real problem. The improvement is from the strong prior on the parameters which provides further abstraction, just like what we gain during hierarchical modeling and topic modeling. There might be no direct way of evaluating whether the improvement on the pseudo-likelihood is significant or not, but we do feel the increase of log pseudo-likelihood is comparable to the increase of log (pseudo-)likelihood we gain in the simulations (please refer to Figures 1b, 4b and 5b at the points when we simulate 200 and 300 training samples).

You are exactly right that the computational burden is high because likelihood is intractable and Bayesian inference is usually computation intensive. The computational complexity is linear with the number of edges. We feel it should work fine for graphs with thousands of nodes (what you referred to as "medium-sized"). Please also note that our algorithm can be easily parallelized because it is basically MCMC.

For your minor comments:

1. Yes, our algorithm can handle unary bias (node potential) since we can learn the node potential with PCD.
2. We do not need to integrate over \theta when a new group is created in the MH algorithm. We choose proposal Q(c_i^*|c_i) to be the conditional prior \pi(c_i^*|{\bf c}_{-i}). If a new group is proposed, we draw a value for the new group from G_0. Our MH algorithm resembles the Algorithm 5 in [22] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.
3. Thanks for pointing out the typo.