Paper ID: 1339
Title: Bayesian inference as iterated random functions with applications to sequential inference in graphical models
Reviews

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

This paper gives a new perspective on the Bayesian posterior updates (BPU) as
a system of iterated random functions (IRF) [4].
By proving a general convergence theorem of IRF with semigroup property,
the convergence of algorithms for change point detection problems [17] are analyzed.
The major contributions are
- to establish a general convergence theory for IRF with semigroup property (Theorem 1),
- to cast existing (exact) algorithms for the classical change point problem (CCPP)
and the multiple change point problem (MCPP) in the IRF framework, and prove their
convergence, based on Theorem 1.
- to propose a fast approximate algorithm for MCPP and prove its convergence,
based on Theorem 1.

I think this is a nice paper with a significant impact.
BPU is clearly a Markov chain, and so it might be natural that
it is analyzed in the IRF framework.
However, the previous convergence results on IRF are not directly applicable to BPU,
because the Lipschitz condition is not satisfied in BPU.
Instead, the authors use semigroup property of BPU,
and establish another type of convergence theory.
I expect that the established theory would have wide range of applications.

Pros:
- A new perspective on BPU as a special case of IRF with semigroup property (3) is given.
- A general convergence theory (Theorem 1) for IRF with semigroup property
is established, which might have wide range of potential applications.
- A fast algorithm (Section 4.1) for MCPP is proposed with its convergence
guarantees (Theorem 2).

Cons:
- The relation between the theory (Theorem 2) and the experiment (Figure 1) is not
clear (see the first one in 'Other comments' below).
- No practical application of the multiple change point detection is introduced
(see the second one in 'Other comments' below).

Quality:
The paper is technically sound, and the claims are supported by theory.

Clarity:
The paper is clearly written, and well-organized.

Originality:
The perspective on BPU as IRF with semigroup property,
and the obtained convergence theorem (Theorem 1) are novel.

Significance:
The found relation between BPU and IRF with semigroup property,
and the established convergence theory could be used by other theoreticians
to analyze algorithms for other problems.

Other comments:
- It would be nice if Eqs.(19) and (20) are somehow expressed in the posterior path
graphs in Figure 1. This would make it clear that the simulation supports Theorem 2.
In Section 4.3, it is said that 'in some cases, they might deviate for a while,
but quickly snap back together once the change points have occurred.'
Readers cannot see this because the change points are not indicated.
- It would be nice if any practical example of MCCP is introduced
(I could not find any also in [17]).
Especially, I wonder in what applications the rule
lambda_e = min(lambda_s1, lambda_s2) is appropriate.
To me, it seems more natural to assume that the distribution of X_e
changes both at lambda_s1 and at lambda_s2.
Moreover, if I understand correctly, Theorem 2 guarantees the convergence
after all the sensors are broken.
I suppose people are more interested in the convergence after any of the sensors
is broken, or the sensor on which a user focuses is broken.
- Perhaps, some thetas in and around Eq.(4) should not be bold-faced.
- In Line 225, 'probabilities 10 by message-passing' looks corrupted.
Is 10 a citation?
- In the last line in Section 5, parentheses for Eq.(29) are missing.


Q2: Please summarize your review in 1-2 sentences

A nice paper that gives a new perspective on the Bayesian posterior updates
as a system of iterated random functions with semigroup property,
and establishes a general convergence theory for the system.
The used techniques and obtained theorems might have wide range
of potential applications.



Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents an interesting interpretation of Bayesian inference
as as a system of iterated random functions, with a particular focus on
Bayesian changepoint detection. This interpretation allows for
convergence analysis of message passing algorithms which is of wide
interest. The general approach for convergence analysis is presented
with focus on providing details of the proof technique.

I found the interpretation of Bayesian inference as iterated random
functions interesting and think others will as well. It was interesting
to see that we can derive algorithms that we can show to have geometric
convergence. Overall I think the paper would be improved if there was
more of a discussion and focus on the implications of this theory and
the insights it gives us into obtaining efficient algorithms. Here there
was a specific focus on changepoint detection: was this special for some
reason. What other model classes can we extend this to and for which it is reasonable
to assume the conditions will hold. The simulation section can be improved
to be more clear on the setup, indicating where the changepoints are
and connecting the observed with theoretical convergence over n.
Q2: Please summarize your review in 1-2 sentences
An interesting interpretation of the paper showing convergence analysis for message passing algorithms. Can be improved by looking into accessibility for many not familiar with the tools used here and the settings in which such tools should be used.

Submitted by Assigned_Reviewer_10

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper presents iterated random functions and discusses the theory of its convergence. This theory is then used to analyze the convergence of message passing algorithms for Bayesian inference.

Quality: Excellent.

Clarity: Good.

This paper is original and makes significant contributions.

I should admit that I haven't gone through the proofs in complete details, but at the first pass it looks ok to me. I will surely be spending next few days checking things more carefully.

Some questions:
1) What are typical value of -I+epsilon? Is it always negative? If not, under what condition will it be.

2) Can these results be used to explain the convergence for other graphical models where convergence is known, for example, BP on a tree?
Q2: Please summarize your review in 1-2 sentences
To my knowledge, there is very little work done on the analysis of message-passing algorithm, and hence any such attempts are very useful. This paper uses iterated random functions to analyze the convergence of message passing, which is very interesting. I hope that the presented approach can be generalized to other graphical models, to better understand the convergence of message passing algorithms.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We would like to thank the reviewers for their positive and encouraging comments. Below, we try to answer some of the questions raised.

------------
(Reviewer 1)

- "What are typical values of -I+$\eps$? Is it always negative? If not, under what condition will it be."

We can always choose $\eps$ to make $-I +\eps$ negative as long as the information $I$ is strictly positive, which is a very mild condition. (Consider for example $\eps = I/2$.)

- "Can these results be used to explain the convergence for other graphical models where convergence is known, for example, BP on a tree?"

This is certainly an interesting direction for future extensions. As of now, we do not know how the arguments can be adapted to BP, but we are working on it.

------------
(Reviewer 2)

- "It would be nice if Eqs.(19) and (20) are somehow expressed in the posterior path graphs in Figure 1. This would make it clear that the simulation supports Theorem 2. In Section 4.3, it is said that 'in some cases, they might deviate for a while, but quickly snap back together once the change points have occurred.' Readers cannot see this because the change points are not indicated."

Thanks for the suggestion. We agree and will add markers to indicate the change points.


- "It would be nice if any practical example of MCCP is introduced (I could not find any also in [17]). Especially, I wonder in what applications the rule $\lambda_e = \min(\lambda_{s_1}, \lambda_{s_2})$ is appropriate. To me, it seems more natural to assume that the distribution of $X_e$ changes both at $\lambda_{s_1}$ and at $\lambda_{s_2}$."

The minimum rule is perhaps the simplest form the shared information can take. It simplifies theoretical computations (e.g. finding the $T$ operator and bounding its Lipschitz constant) and can still serve as a first approximation to more realistic situations you have in mind. (Roughly speaking, a shared sequence that changes in response to both change points has more information than the one which responds only to the minimum. So, if the simple model exhibits convergence, we conjecture that the more complex model will do so, at a rate at least as fast as the simple model.)

- "Moreover, if I understand correctly, Theorem 2 guarantees the convergence after all the sensors are broken. I suppose people are more interested in the convergence after any of the sensors is broken, or the sensor on which a user focuses is broken."

Because, we are looking at the convergence of the joint posterior, that is the appropriate regime to look at. Deriving weaker conditions under which a certain marginal of the joint distribution converges is an interesting direction for future investigations.

------------
(Reviewer 3)

-"Here there was a specific focus on changepoint detection: was this special for some reason. What other model classes can we extend this to and for which it is reasonable to assume the conditions will hold?"

We are presently working on extending the theory to the general metric context (e.g., Bayesian models with continuous variable). We are also trying to understand how much of the theory carries to a general belief propagation algorithm (BP).

- "The simulation section can be improved to be more clear on the setup, indicating where the changepoints are and connecting the observed with theoretical convergence over $n$."

This was also pointed out by reviewer 2 and we agree. We will indicate the change points on the plots and comment on the connection with theory.

-------------
We again thanks the reviewers for the thoughtful comments.