{"title": "Bayesian inference as iterated random functions with applications to sequential inference in graphical models", "book": "Advances in Neural Information Processing Systems", "page_first": 2922, "page_last": 2930, "abstract": "We propose a general formalism of iterated random functions with semigroup property, under which exact and approximate Bayesian posterior updates can be viewed as specific instances. A convergence theory for iterated random functions is presented. As an application of the general theory we analyze convergence behaviors of exact and approximate message-passing algorithms that arise in a sequential change point detection problem formulated via a latent variable directed graphical model. The sequential inference algorithm and its supporting theory are illustrated by simulated examples.", "full_text": "Bayesian inference as iterated random functions with\n\napplications to sequential inference in graphical\n\nmodels\n\nArash A. Amini\n\nDepartment of Statistics\nUniversity of Michigan\n\nAnn Arbor, Michigan 48109\n\naaamini@umich.edu\n\nXuanLong Nguyen\n\nDepartment of Statistics\nUniversity of Michigan\n\nAnn Arbor, Michigan 48109\nxuanlong@umich.edu\n\nAbstract\n\nWe propose a general formalism of iterated random functions with semigroup\nproperty, under which exact and approximate Bayesian posterior updates can be\nviewed as speci\ufb01c instances. A convergence theory for iterated random functions\nis presented. As an application of the general theory we analyze convergence\nbehaviors of exact and approximate message-passing algorithms that arise in a\nsequential change point detection problem formulated via a latent variable directed\ngraphical model. The sequential inference algorithm and its supporting theory are\nillustrated by simulated examples.\n\n1 Introduction\n\nThe sequential posterior updates play a central role in many Bayesian inference procedures. As an\nexample, in Bayesian inference one is interested in the posterior probability of variables of interest\ngiven the data observed sequentially up to a given time point. As a more speci\ufb01c example which\nprovides the motivation for this work, in a sequential change point detection problem [1], the key\nquantity is the posterior probability that a change has occurred given the data observed up to present\ntime. When the underlying probability model is complex, e.g., a large-scale graphical model, the cal-\nculation of such quantities in a fast and online manner is a formidable challenge. In such situations\napproximate inference methods are required \u2013 for graphical models, message-passing variational\ninference algorithms present a viable option [2, 3].\n\nIn this paper we propose to treat Bayesian inference in a complex model as a speci\ufb01c instance of an\nabstract system of iterated random functions (IRF), a concept that originally arises in the study of\nMarkov chains and stochastic systems [4]. The key technical property of the proposed IRF formal-\nism that enables the connection to Bayesian inference under conditionally independent sampling is\nthe semigroup property, which shall be de\ufb01ned shortly in the sequel. It turns out that most exact and\napproximate Bayesian inference algorithms may be viewed as speci\ufb01c instances of an IRF system.\nThe goal of this paper is to present a general convergence theory for the IRF with semigroup prop-\nerty. The theory is then applied to the analysis of exact and approximate message-passing inference\nalgorithms, which arise in the context of distributed sequential change point problems using latent\nvariable and directed graphical model as the underlying modeling framework.\n\nWe wish to note a growing literature on message-passing and sequential inference based on graph-\nical modeling [5, 6, 7, 8]. On the other hand, convergence and error analysis of message-passing\nalgorithms in graphical models is quite rare and challenging, especially for approximate algorithms,\nand they are typically con\ufb01ned to the speci\ufb01c form of belief propagation (sum-product) algorithm\n[9, 10, 11]. To the best of our knowledge, there is no existing work on the analysis of message-\npassing inference algorithms for calculating conditional (posterior) probabilities for latent random\n\n1\n\n\fvariables present in a graphical model. While such an analysis is a byproduct of this work, the view-\npoint we put forward here that equates Bayesian posterior updates to a system of iterated random\nfunctions with semigroup property seems to be new and may be of general interest.\n\nIn Sections 2\u2013 3, we introduce the general IRF system and\nThe paper is organized as follows.\nprovide our main result on its convergence. The proof is deferred to Section 5. As an example of\nthe application of the result, we will provide a convergence analysis for an approximate sequential\ninference algorithm for the problem of multiple change point detection using graphical models. The\nproblem setup and the results are discussed in Section 4.\n\n2 Bayesian posterior updates as iterated random functions\n\nIn this paper we shall restrict ourselves to multivariate distributions of binary random variables.\nTo describe the general iteration, let Pd := P({0, 1}d) be the space of probability measures on\n{0, 1}d. The iteration under consideration recursively produces a random sequence of elements of\nPd, starting from some initial value. We think of Pd as a subset of R2d\nequipped with the \u21131 norm\n(that is, the total variation norm for discrete probability measures). To simplify, let m := 2d, and\nfor x \u2208 Pd, index its coordinates as x = (x0, . . . , xm\u22121). For \u03b8 \u2208 Rm\n+ , consider the function\nq\u03b8 : Pd \u2192 Pd, de\ufb01ned by\n\nq\u03b8(x) :=\n\nx \u2299 \u03b8\nxT \u03b8\n\n(1)\n\nwhere xT \u03b8 = Pi xi\u03b8i is the usual inner product on Rm and x \u2299 \u03b8 is pointwise multiplication\n\nwith coordinates [x \u2299 \u03b8]i := xi\u03b8i, for i = 0, 1, . . . , m \u2212 1. This function models the prior-to-\nposterior update according to the Bayes rule. One can think of \u03b8 as the likelihood and x as the prior\ndistribution (or the posterior in the previous stage) and q\u03b8(x) as the (new) posterior based on the two.\nThe division by xT \u03b8 can be thought of as the division by the marginal to make a valid probability\nvector. (See Example 1 below.)\n\nWe consider the following general iteration\n\nQn(x) = q\u03b8n(T (Qn\u22121(x)), n \u2265 1,\nQ0(x) = x,\n\n(2)\n\nfor some deterministic operator T : Pd \u2192 Pd and an i.i.d. random sequence {\u03b8n}n\u22651 \u2282 Rm\nchanging operator T , one obtains different iterative algorithms.\n\n+ . By\n\nOur goal is to \ufb01nd suf\ufb01cient conditions on T and {\u03b8n} for the convergence of the iteration to an\nextreme point of Pd, which without loss of generality is taken to be e(0) := (1, 0, 0, . . . , 0). Standard\ntechniques for proving the convergence of iterated random functions are usually based on showing\nsome averaged-sense contraction property for the iteration function [4, 12, 13, 14], which in our\ncase is q\u03b8n (T (\u00b7)). See [15] for a recent survey. These techniques are not applicable to our problem\nsince q\u03b8n is not in general Lipschitz, in any suitable sense, precluding q\u03b8n (T (\u00b7)) from satisfying the\naforementioned conditions.\n\nInstead, the functions {q\u03b8n} have another property which can be exploited to prove convergence;\nnamely, they form a semi-group under pointwise multiplication,\n\nq\u03b8\u2299 \u03b8\u2032 = q\u03b8 \u25e6 q\u03b8\u2032 ,\n\n\u03b8, \u03b8\u2032 \u2208 Rm\n+ ,\n\n(3)\n\ni=1\n\nwhere \u25e6 denotes the composition of functions. If T is the identity, this property allows us to write\n\u03b8i(x) \u2014 this is nothing but the Bayesian posterior update equation, under condi-\nQn(x) = q\u2299 n\ntionally independent sampling, while modifying T results in an approximate Bayesian inference\nprocedure. Since after suitable normalization, \u2299 n\n\u03b8i concentrates around a deterministic quantity,\nby the i.i.d. assumption on {\u03b8i}, this representation helps in determining the limit of {Qn(x)}. The\nmain result of this paper, summarized in Theorem 1, is that the same conclusions can be extended\nto general Lipschitz maps T having the desired \ufb01xed point.\n\ni=1\n\n2\n\n\f3 General convergence theory\n\nConsider a sequence {\u03b8n}n\u22651 \u2282 Rm\n(\u03b80\n\nn, . . . , \u03b8m\u22121\n\nn, \u03b81\n\n+ of i.i.d.\n\n) with \u03b80\n\nn = 1 for all n, and\n\nn\n\nrandom elements, where m = 2d. Let \u03b8n =\n\n\u03b8\u2217\nn :=\n\nmax\n\ni=1,2,...,m\u22121\n\n\u03b8i\nn.\n\n(4)\n\nn = 1 is convenient for showing convergence to e(0). This is without loss of\n\nThe normalization \u03b80\ngenerality, since q\u03b8 is invariant to scaling of \u03b8, that is q\u03b8 = q\u03b2\u03b8 for any \u03b2 > 0.\nAssume the sequence {log \u03b8\u2217\nn} to be i.i.d. sub-Gaussian with mean \u2264 \u2212I\u2217 < 0 and sub-Gaussian\nnorm \u2264 \u03c3\u2217 \u2208 (0, \u221e). The sub-Gaussian norm in can be taken to be the \u03c82 Orlicz norm (cf. [16,\nSection 2.2]), which we denote by k \u00b7 k\u03c82. By de\ufb01nition kY k\u03c82 := inf{C > 0 : E\u03c82(|Y |/C) \u2264 1}\nwhere \u03c82(x) := ex2\nLet k \u00b7 k denote the \u21131 norm on Rm. Consider the sequence {Qn(x)}n\u22650 de\ufb01ned in (2) based on\n{\u03b8n} as above, an initial point x = (x0, . . . , xm\u22121) \u2208 Pd and a Lipschitz map T : Pd \u2192 Pd. Let\nLipT denote the Lipschitz constant of T , that is LipT := supx6=y kT (x) \u2212 T (y)k/kx \u2212 yk.\n\n\u2212 1.\n\nOur main result regarding iteration (2) is the following.\nTheorem 1. Assume that L := LipT \u2264 1 and that e(0) is a \ufb01xed point of T . Then, for all n \u2265 0,\nand \u03b5 > 0,\n\nkQn(x) \u2212 e(0)k \u2264 2\n\n1 \u2212 x0\n\nx0\n\n(cid:0)Le\u2212I\u2217+\u03b5(cid:1)n\n\n(5)\n\nwith probability at least 1 \u2212 exp(\u2212c n\u03b52/\u03c32\n\n\u2217), for some absolute constant c > 0.\n\nThe proof of Theorem 1 is outlined in Section 5. Our main application of the theorem will be to the\nstudy of convergence of stopping rules for a distributed multiple change point problem endowed with\nlatent variable graphical models. Before stating that problem, let us consider the classical (single)\nchange point problem \ufb01rst, and show how the theorem can be applied to analyze the convergence of\nthe optimal Bayes rule.\n\nExample 1.\nIn the classical Bayesian change point problem [1], one observes a sequence\n{X 1, X 2, X 3 . . . } of independent data points whose distributions change at some random time\n\u03bb. More precisely, given \u03bb = k, X 1, X 2, . . . , X k\u22121 are distributed according to g, and\nX k+1, X k+2, . . . according to f . Here, f and g are densities with respect to some underlying\nmeasure. One also assumes a prior \u03c0 on \u03bb, usually taken to be geometric. The goal is to \ufb01nd a\nstopping rule \u03c4 which can predict \u03bb based on the data points observed so far. It is well-known\nthat a rule based on thresholding the posterior probability of \u03bb is optimal (in a Neyman-Pearson\nsense). To be more speci\ufb01c, let Xn := (X 1, X 2, . . . , X n) collect the data up to time n and let\n\u03b3n[n] := P(\u03bb \u2264 n|Xn) be the posterior probability of \u03bb having occurred before (or at) time n.\nThen, the Shiryayev rule\n\n\u03c4 := inf{n \u2208 N : \u03b3n[n] \u2265 1 \u2212 \u03b1}\n\n(6)\n\nis known to asymptotically have the least expected delay, among all stopping rules with false alarm\nprobability bounded by \u03b1.\n\nTheorem 1 provides a way to quantify how fast the posterior \u03b3n[n] approaches 1, once the change\npoint has occurred, hence providing an estimate of the detection delay, even for \ufb01nite number of\nsamples. We should note that our approach here is somewhat independent of the classical techniques\nnormally used for analyzing stopping rule (6). To cast the problem in the general framework of (2),\nlet us introduce the binary variable Z n := 1{\u03bb \u2264 n}, where 1{\u00b7} denotes the indicator of an event.\nLet Qn be the (random) distribution of Z n given Xn, in other words,\n\nQn := (cid:0)P(Z n = 1|Xn), P(Z n = 0|Xn)).\n\nSince \u03b3n[n] = P(Z = 1|Xn), convergence of \u03b3n[n] to 1 is equivalent to the convergence of Qn to\ne(0) = (1, 0). We have\n\nP (Z n|Xn) \u221dZ n P (Z n, X n|Xn\u22121) = P (X n|Z n)P (Z n|Xn\u22121).\n\n(7)\n\n3\n\n\fNote that P (X n|Z n = 1) = f (X n) and P (X n|Z n = 0) = g(X n). Let \u03b8n := (cid:0)1, g(X n)\n\nf (X n)(cid:1) and\n\nRn\u22121 := (cid:0)P(Z n = 1|Xn\u22121), P(Z n = 0|Xn\u22121)).\n\nThen, (7) implies that Qn can be obtained by pointwise multiplication of Rn\u22121 by f (X n)\u03b8n and\nnormalization to make a probability vector. Alternatively, we can multiply by \u03b8n, since the proce-\ndure is scale-invariant, that is, Qn = q\u03b8n(Rn\u22121) using de\ufb01nition (1). It remains to express Rn\u22121 in\nterms of Qn\u22121. This can be done by using the Bayes rule and the fact that P (Xn\u22121|\u03bb = k) is the\nsame for k \u2208 {n, n + 1, . . . }. In particular, after some algebra (see [17]), one arrives at\n\n\u03b3n\u22121[n] =\n\n\u03c0(n)\n\n\u03c0[n \u2212 1]c +\n\n\u03c0[n]c\n\u03c0[n \u2212 1]c \u03b3n\u22121[n \u2212 1],\n\n(8)\n\nwhere \u03b3k[n]\n\n:= P(\u03bb \u2264 n|Xk), \u03c0(n) is the prior on \u03bb evaluated at time n, and \u03c0[k]c\n\n:=\nP\u221e\ni=k+1 \u03c0(i). For the geometric prior with parameter \u03c1 \u2208 [0, 1], we have \u03c0(n) := (1 \u2212 \u03c1)n\u22121\u03c1 and\n\u03c0[k]c = \u03c1k. The above recursion then simpli\ufb01es to \u03b3n\u22121[n] = \u03c1 + (1 \u2212 \u03c1)\u03b3n\u22121[n \u2212 1]. Expressing\nin terms of Rn\u22121 and Qn\u22121, the recursion reads\n\nRn\u22121 = T (Qn\u22121), where T(cid:16)(cid:16) x1\n\n(cid:17)(cid:17) = \u03c1(cid:16) 1\n\n0(cid:17) + (1 \u2212 \u03c1)(cid:16) x1\n\n(cid:17).\n\nx0\nIn other words, T (x) = \u03c1e(0) + (1 \u2212 \u03c1)x for x \u2208 P2.\nThus, we have shown that an iterative algorithm for computing \u03b3n[n] (hence determining rule (6)),\ncan be expressed in the form of (2) for appropriate choices of {\u03b8n} and operator T . Note that T in\nthis case is Lipschitz with constant 1 \u2212 \u03c1 which is always guaranteed to be \u2264 1.\n\nx0\n\nWe can now use Theorem 1 to analyze the convergence of \u03b3n[n]. Let us condition on \u03bb = k + 1,\nthat is, we assume that the change point has occurred at time k + 1. Then, the sequence {X n}n\u2265k+1\nis distributed according to f , and we have E\u03b8\u2217\nf = \u2212I, where I is the KL divergence\nbetween densities f and g. Noting that kQn \u2212 e(0)k = 2(1 \u2212 \u03b3n[n]), we immediately obtain the\nfollowing corollary.\nCorollary 1. Consider Example 1 and assume that log(g(X)/f (X)), where X \u223c f , is sub-\ng . Then, conditioned on \u03bb = k + 1,\n\nGaussian with sub-Gaussian norm \u2264 \u03c3. Let I := R f log f\n\nn = R f log g\n\nwe have for n \u2265 1,\n\n(cid:12)(cid:12)\u03b3n+k[n + k] \u2212 1(cid:12)(cid:12) \u2264 (cid:2)(1 \u2212 \u03c1)e\u2212I+\u03b5(cid:3)n(cid:16) 1\n\n\u03b3k[k]\n\nwith probability at least 1 \u2212 exp(\u2212c n\u03b52/\u03c32).\n\n\u2212 1(cid:17)\n\n4 Multiple change point problem via latent variable graphical models\n\nWe now turn to our main application for Theorem 1, in the context of a multiple change point\nproblem. In [18], graphical model formalism is used to extend the classical change point problem\n(cf. Example 1) to cases where multiple distributed latent change points are present. Throughout\nthis section, we will use this setup which we now brie\ufb02y sketch.\n\nOne starts with a network G = (V, E) of d sensors or nodes, each associated with a change point \u03bbj .\nEach node j observes a private sequence of measurements Xj = (X 1\nj , . . . ) which undergoes a\nchange in distribution at time \u03bbj , that is,\n\nj , X 2\n\nX 1\n\nj , X 2\n\nj , . . . , X k\u22121\n\nj\n\n| \u03bbj = k iid\u223c gj,\n\nX k\n\nj , X k+1\n\nj\n\n, \u00b7 \u00b7 \u00b7 | \u03bbj = k iid\u223c fj,\n\nfor densities gj and fj (w.r.t. some underlying measure). Each connected pair of nodes share\nan additional sequence of measurements. For example, if nodes s1 and s2 are connected, that is,\ne = (s1, s2) \u2208 E, then they both observe Xe = (X 1\ne , . . . ). The shared sequence undergoes a\nchange in distribution at some point depending on \u03bbs1 and \u03bbs2 . More speci\ufb01cally, it is assumed that\nthe earlier of the two change points causes a change in the shared sequence, that is, the distribution\nof Xe conditioned on (\u03bbs1 , \u03bbs2 ) only depends on \u03bbe := \u03bbs1 \u2227 \u03bbs2 , the minimum of the two, i.e.,\n\ne , X 2\n\nX 1\n\ne , X 2\n\ne , . . . , X k\n\ne | \u03bbe = k iid\u223c ge,\n\nX k+1\n\ne\n\n, X k+2\n\ne\n\n, \u00b7 \u00b7 \u00b7 | \u03bbe = k iid\u223c fe.\n\n4\n\n\fLetting \u03bb\u2217 := {\u03bbj}j\u2208V and Xn\nvariables as\n\n\u2217 = {Xn\n\ne }j\u2208V,e\u2208E, we can write the joint density of all random\n\nP (\u03bb\u2217, Xn\n\n\u2217 ) = Y\n\nj\u2208V\n\nP (Xn\n\nj |\u03bbj) Y\n\ne \u2208E\n\nP (Xn\n\ne |\u03bbs1 , \u03bbs2 ).\n\n(9)\n\nj , Xn\n\u03c0j(\u03bbj ) Y\n\nj\u2208V\n\nwhere \u03c0j is the prior on \u03bbj , which we assume to be geometric with parameter \u03c1j . Network G\ninduces a graphical model [2] which encodes the factorization (9) of the joint density. (cf. Fig. 1)\n\nSuppose now that each node j wants to detect its change point \u03bbj , with minimum expected delay,\nwhile maintaining a false alarm probability at most \u03b1. Inspired by the classical change point prob-\nlem, one is interested in computing the posterior probability that the change point has occurred up\nto now, that is,\n\n\u03b3n\nj [n] := P(\u03bbj \u2264 n | Xn\n\n\u2217 ).\n\n(10)\n\nThe difference with the classical setting is the conditioning is done on all the data in the network (up\nto time n). It is easy to verify that the natural stopping rule\n\n\u03c4j = inf{n \u2208 N : \u03b3n\n\nj [n] \u2265 1 \u2212 \u03b1}\n\nsatisfy the false alarm constraint. It has also been shown that this rule is asymptotically optimal in\nterms of expected detection delay. Moreover, an algorithm based on the well-known sum-product [2]\nhas been proposed, which allows the nodes to compute their posterior probabilities 10 by message-\npassing. The algorithm is exact when G is a tree, and scales linearly in the number of nodes. More\nprecisely, at time n, the computational complexity is O(nd). The drawback is the linear dependence\non n, which makes the algorithm practically infeasible if the change points model rare events (where\nn could grow large before detecting the change.)\n\nIn the next section, we propose an approximate message passing algorithm which has computational\ncomplexity O(d), at each time step. This circumvents the drawback of the exact algorithm and\nallows for inde\ufb01nite run times. We then show how the theory developed in Section 3 can be used to\nprovide convergence guarantees for this approximate algorithm, as well as the exact one.\n\n4.1 Fast approximate message-passing (MP)\n\nWe now turn to an approximate message-passing algorithm which, at each time step, has com-\nputational complexity O(d). The derivation is similar to that used for the iterative algorithm in\nExample 1. Let us de\ufb01ne binary variables\n\nThe idea is to compute P (Z n\nis proportional in Z n\n\n\u2217 to P (Z n\n\nZ n\n\n\u2217 = (Z n\n\nj = 1{\u03bbj \u2264 n}, Z n\n\u2217 ) recursively based on P (Z n\u22121\n\u2217 ) P (Z n\n\n1 , . . . , Z n\nd ).\n|Xn\u22121\n\u2217 |Xn\u22121\n\n) = P (X n\n\n\u2217 |Xn\u22121\n\n\u2217 |Z n\n\n\u2217 , X n\n\n\u2217 |Xn\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\n(11)\n\n). By Bayes rule, P (Z n\n), hence\n\n\u2217 |Xn\n\u2217 )\n\nP (Z n\n\n\u2217 |Xn\n\n\u2217 ) \u221dZ n\n\nP (X n\n\nj |Z n\n\nP (X n\n\nij|Z n\n\ni , Z n\n\n\u2217 |Xn\u22121\n\n\u2217\n\n),\n\n(12)\n\n\u2217 h Y\n\nj\u2208V\n\nj ) Y\n\n{i,j}\u2208E\n\nj )i P (Z n\n\nwhere we have used the fact that given Z n\n\n\u2217 , X n\n\n\u2217 is independent of Xn\u22121\n\n\u2217\n\n. To simplify notation, let us\n\nextend the edge set to eE := E \u222a{{j} : j \u2208 V }. This allows us to treat the private data of node j, i.e.,\nXj , as shared data of a self-loop in the extended graph (V, eE). Let ue(z; \u03be) := [ge(\u03be)]1\u2212z[fe(\u03be])z\nfor e \u2208 eE, z \u2208 {0, 1}. Then, for i 6= j,\nij|Z n\nj ; X n\nj ) = uj(Z n\n) in terms of P (Z n\u22121\n\nIt remains to express P (Z n\n). It is possible to do this, exactly, at\na cost of O(2|V |). For brevity, we omit the exact expression. (See Lemma 1 for some details.) We\nterm the algorithm that employs the exact relationship, the \u201cexact algorithm\u201d.\n\ni , Z n\n|Xn\u22121\n\nj ) = uij(Z n\n\nj ), P (X n\n\n\u2217 |Xn\u22121\n\ni \u2228 Z n\n\nj ; X n\n\nP (X n\n\nj |Z n\n\nij).\n\n(13)\n\n\u2217\n\n\u2217\n\n\u2217\n\nIn practice, however, the exponential complexity makes the exact recursion of little use for large\nnetworks. To obtain a fast algorithm (i.e., O(poly(d)), we instead take a mean-\ufb01eld type approxi-\nmation:\n\n\u03bd(Z n\n\nj ; \u03b3n\u22121\n\nj\n\n[n]),\n\n(14)\n\nP (Z n\n\n\u2217 |Xn\u22121\n\n\u2217\n\n) \u2248 Y\n\nj\u2208V\n\n) = Y\n\nj\u2208V\n\nP (Z n\n\nj |Xn\u22121\n\n\u2217\n\n5\n\n\fwhere \u03bd(z; \u03b2) := \u03b2z(1 \u2212 \u03b2)1\u2212z. That is, we approximate a multivariate distribution by the product\nof its marginals. By an argument similar to that used to derive (8), we can obtain a recursion for the\nmarginals,\n\n\u03b3n\u22121\nj\n\n[n] =\n\n\u03c0j(n)\n\n\u03c0j[n \u2212 1]c +\n\n\u03c0j[n]c\n\u03c0j[n \u2212 1]c \u03b3n\u22121\n\nj\n\n[n \u2212 1],\n\n(15)\n\nwhere we have used the notation introduced earlier in (8). Thus, at time n, the RHS of (14) is known\nbased on values computed at time n \u2212 1 (with initial value \u03b30\nj [0] = 0, j \u2208 V ). Inserting this RHS\ninto (12) in place of P (Z n\n\u2217 (instead of \u03bb\u2217)\nwhich has the same form as (9) with \u03bd(Z n\n\n), we obtain a graphical model in variables Z n\n\n[n]) playing the role of the prior \u03c0(\u03bbj ).\n\n\u2217 |Xn\u22121\n\n\u2217\n\nj ; \u03b3n\u22121\n\nIn order to obtain the marginals \u03b3n\nij [n] with respect to the approximate\nversion of the joint distribution P (Z n\n), we need to marginalize out the latent variables\nZ n\nj \u2019s, for which a standard sum-product algorithm can be applied (see [2, 3, 18]). The message\nupdate equations are similar to those in [18]; the difference is that the messages are now binary and\ndo not grow in size with n.\n\n\u2217 ) and \u03b3n\n\n\u2217 , X n\n\nj\nj [n] = P (Z n\nj = 1|Xn\n\u2217\n\n\u2217 |Xn\u22121\n\n4.2 Convergence of MP algorithms\n\nWe now turn to the analysis of the approximate algorithm introduced in Section 4.1. In particular, we\nwill look at the evolution of {eP (Z n\n\u2217 )}n\u2208N as a sequence of probability distribution on {0, 1}d.\nHere, eP signi\ufb01es that this sequence is an approximation. In order to make a meaningful comparison,\n\nwe also look at the algorithm which computes the exact sequence {P (Z n\n\u2217 )}n\u2208N, recursively. As\nmentioned before, this we will call the \u201cexact algorithm\u201d, the details of which are not of concern to\nus at this point (cf. Prop. 1 for these details.)\n\n\u2217 |Xn\n\n\u2217 |Xn\n\nTo make this correspondence formal and the notation simpli\ufb01ed, we use the symbol :\u2261 as follows\n\n\u2217 |Xn\n\n\u2217 ) and P (Z n\n\n\u2217 |Xn\n\n\u2217 ), as distributions for Z n\n\n\u2217 , to be elements of Pd \u2282 Rm.\n\nRecall that we take eP (Z n\n\neyn :\u2261 eP (Z n\n\n\u2217 |Xn\n\n\u2217 ),\n\nyn :\u2261 P (Z n\n\n\u2217 |Xn\n\u2217 )\n\n(16)\n\nXn\n\n\u2217 . We have the following description.\n\nwhere now eyn, yn \u2208 Pd. Note that eyn and yn are random elements of Pd, due the randomness of\nProposition 1. The exact and approximate sequences, {yn} and {eyn}, follow general iteration (2)\n\nwith the same random sequence {\u03b8n}, but with different deterministic operators T , denoted respec-\ntively with Tex and Tap. Tex is linear and given by a Markov transition kernel. Tap is a polynomial\nmap of degree d. Both maps are Lipschitz and we have\n\nLipTex \u2264 L\u03c1 := (cid:16)1 \u2212\n\ndY\n\nj=1\n\n\u03c1j(cid:17), LipTap \u2264 K\u03c1 :=\n\ndX\n\nj=1\n\n(1 \u2212 \u03c1j).\n\n(17)\n\nDetailed descriptions of the sequence {\u03b8n} and the operators Tex and Tap are given in [17]. As\nsuggested by Theorem 1, a key assumption for the convergence of the approximate algorithm will\nbe K\u03c1 \u2264 1. In contrast, we always have L\u03c1 \u2264 1.\n\nRecall that {\u03bbj} are the change points and their priors are geometric with parameters {\u03c1j}. We\nanalyze the algorithms, once all the change points have happened. More precisely, we condition\non Mn0 := {maxj \u03bbj \u2264 n0} for some n0 \u2208 N. Then, one expects the (joint) posterior of Z n\n\u2217 to\ncontract to the point Z\u221e\n{yn} to converge to e(0). Theorem 2 below quanti\ufb01es this convergence in \u21131 norm (equivalently,\ntotal variation for measures).\n\nj = 1, for all j \u2208 V . In the vectorial notation, we expect both {eyn} and\n\nRecall pre-change and post-change densities ge and fe, and let Ie denote their KL divergence, that\n\nis, Ie := R fe log(fe/ge). We will assume that\nis sub-Gaussian, for all e \u2208 eE, where eE is extended edge notation introduced in Section 4.1. The\n\nchoice X \u223c fe is in accordance with conditioning on Mn0 . Note that EYe = \u2212Ie < 0. We de\ufb01ne\n\nYe := log(ge(X)/fe(X)) with X \u223c fe\n\n(18)\n\n\u03c3max := max\ne\u2208 eE\n\nkYek\u03c82,\n\nImin := min\ne\u2208 eE\n\nIe,\n\nI\u2217(\u03ba) := Imin \u2212 \u03ba \u03c3maxplog D..\n\nwhere D := |V | + |E|. Theorem 1 and Lemma 1 give us the following. (See [17] for the proof.)\n\n6\n\n\f\u03bb1\n\n\u03bb2\n\n\u03bb3\n\nX23\n\nX23\n\n\u03bb3\n\n\u03bb1\n\n\u03bb2\n\nX12\n\n\u03bb3\nX24 X45\n\n\u03bb1\n\n\u03bb4\n\n \n\n1\n\n\u03bb5\n\nmn\n12\n\u03bb2\n\nX12\n\n \n\n1\n\nmn\n24\n\nmn\n32\nX24 X45\n\n\u03bb4\n\nmn\n45\n\n\u03bb5\n\n \n\n\u03bb4\n\n\u03bb5\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\nMP\nAPPROX\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\nMP\nAPPROX\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\nMP\nAPPROX\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\nMP\nAPPROX\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\nFigure 1:\nillustrates one stage of message-passing to compute posterior probabilities \u03b3\nexamples of posterior paths, n 7\u2192 \u03b3\nfor the subgraph on nodes {1, 2, 3, 4}. The change points are designated with vertical dashed lines.\n\nTop row illustrates a network (left), which induces a graphical model (middle). Right panel\nn\nj [n]. Bottom row illustrates typical\nn\nj [n], obtained by EXACT and approximate (APPROX) message passing,\n\nTheorem 2. There exists an absolute constant \u03ba > 0, such that if I\u2217(\u03ba) > 0, the exact algorithm\nconverges at least geometrically w.h.p., that is, for all n \u2265 1,\n\nK\u03c1 \u2264 1, the approximate algorithm also converges at least geometrically w.h.p., i.e., for all n \u2265 1,\n\nkyn+n0 \u2212 e(0)k \u2264 2\nwith probability at least 1 \u2212 exp(cid:2)\u2212c n\u03b52/(\u03c32\n\n1 \u2212 yn0\n\nyn0\n\nwith the same (conditional) probability as the exact algorithm.\n\nkeyn+n0 \u2212 e(0)k \u2264 2\n\n(cid:0)L\u03c1e\u2212I\u2217(\u03ba)+\u03b5(cid:1)n\n\n(19)\n\nmaxD2 log D)(cid:3), conditioned on Mn0 . If in addition,\n1 \u2212 eyn0\neyn0\n\n(cid:0)K\u03c1e\u2212I\u2217(\u03ba)+\u03b5(cid:1)n\n\n(20)\n\n4.3 Simulation results\n\nWe present some simulation results to verify the effectiveness of the proposed approximation al-\ngorithm in estimating the posterior probabilities \u03b3n\nj [n]. We consider a star graph on d = 4 nodes.\nThis is the subgraph on nodes {1, 2, 3, 4} in Fig. 1. Conditioned on the change points \u03bb\u2217, all data\nsequences X\u2217 are assumed Gaussian with variance 1, pre-change mean 1 and post-change mean\nzero. All priors are geometric with \u03c1j = 0.1. We note that higher values of \u03c1j yield even faster\nconvergence in the simulations, but we omit these \ufb01gures due to space constraints. Fig. 1 illustrates\ntypical examples of posterior paths n 7\u2192 \u03b3n\nj [n], for both the exact and approximate MP algorithms.\nOne can observe that the approximate path often closely follows the exact one. In some cases, they\nmight deviate for a while, but as suggested by Theorem 2, they approach one another quickly, once\nthe change points have occurred.\n\nFrom the theorem and triangle inequality, it follows that under I\u2217(\u03ba) > 0 and K\u03c1 \u2264 1, kyn \u2212 eynk\n\nconverges to zero, at least geometrically w.h.p. This gives some theoretical explanation for the good\ntracking behavior of approximate algorithm as observed in Fig. 1.\n\n5 Proof of Theorem 1\n\nFor x \u2208 Rm (including Pd), we write x = (x0,ex) where ex = (x1, . . . , xm\u22121). Recall that e(0) =\n(1, 0, . . . , 0) and kxk = Pm\u22121\n\ni=0 |xi|. For x \u2208 Pd, we have 1 \u2212 x0 = kexk, and\nkx \u2212 e(0)k = k(x0 \u2212 1,ex)k = 1 \u2212 x0 + kexk = 2(1 \u2212 x0).\n\nFor \u03b8 = (\u03b80, e\u03b8) \u2208 Rm\n\n+ , let\n\n\u03b8\u2217 := ke\u03b8k\u221e = max\n\ni=1,...,m\u22121\n\n\u03b8i,\n\n\u03b8\u2020 := (cid:0)\u03b80, (\u03b8\u2217L)1m\u22121(cid:1) \u2208 Rm\n\n+\n\n7\n\n(21)\n\n(22)\n\n\fwhere 1m\u22121 is a vector in Rm\u22121 whose coordinates are all ones. We start by investigating how\nkq\u03b8(x) \u2212 e(0)k varies as a function of kx \u2212 e(0)k.\nLemma 1. For L \u2264 1, \u03b8\u2217 > 0, and \u03b80 = 1,\n\nN :=\n\nsup\n\nx,y \u2208 Pd,\n\nkx\u2212e(0)k \u2264 Lky\u2212e(0)k\n\nkq\u03b8(x) \u2212 e(0)k\nkq\u03b8\u2020(y) \u2212 e(0)k\n\n= 1;\n\n(23)\n\nLemma 1 is proved in [17]. We now proceed to the proof of the theorem. Recall that T : Pd \u2192 Pd\nis an L-Lipschitz map, and that e(0) is a \ufb01xed point of T , that is, T (e(0)) = e(0). It follows that for\nany x \u2208 Pd, kT (x) \u2212 e(0)k \u2264 Lkx \u2212 e(0)k. Applying Lemma 1, we get\n\nkq\u03b8(T (x)) \u2212 e(0)k \u2264 kq\u03b8\u2020(x) \u2212 e(0)k\n\n(24)\n\nfor \u03b8 \u2208 Rm\n\n+ with \u03b80 = 1, and x \u2208 Pd. (This holds even if \u03b8\u2217 = 0 where both sides are zero.)\n\nRecall the sequence {\u03b8n}n\u22651 used in de\ufb01ning functions {Qn} accroding to (2), and the assumption\nthat \u03b80\nn = 1, for all n \u2265 1. Inequality (24) is key in allowing us to peel operator T , and bring\nsuccessive elements of {q\u03b8n} together. Then, we can exploit the semi-group property (3) on adjacent\nelements of {q\u03b8n}.\nTo see this, for each \u03b8n, let \u03b8\u2217\nn and \u03b8\u2020\nQn\u22121(x), and \u03b8 with \u03b8n, we can write\n\nn be de\ufb01ned as in (22). Applying (24) with x replaced with\n\nkQn(x) \u2212 e(0)k \u2264 kq\u03b8\u2020\n\nn\n\n= kq\u03b8\u2020\n\nn\n\n(Qn\u22121(x)) \u2212 e(0)k\n(by (24))\n(q\u03b8n\u22121(T (Qn\u22122(x)))) \u2212 e(0)k\n\n= kq\u03b8\u2020\n\nn\u2299 \u03b8n\u22121\n\n(T (Qn\u22122(x)))) \u2212 e(0)k (by semi-group property (3))\n\nWe note that (\u03b8\u2020\n\nn \u2299 \u03b8n\u22121)\u2217 = L\u03b8\u2217\n\nn\n\n\u03b8\u2217\n\nn\u22121 and\n\n(\u03b8\u2020\n\nn \u2299 \u03b8n\u22121)\n\n\u2020\n\n= (cid:0)1, L(\u03b8\u2020\n\nn \u2299 \u03b8n\u22121)\u22171m\u22121(cid:1) = (cid:0)1, L2\u03b8\u2217\n\nn\n\n\u03b8\u2217\n\nn\u22121\n\n1m\u22121(cid:1).\n\nHere, \u2217 and \u2020 act on a general vector in the sense of (22). Applying (24) once more, we get\n\nkQn(x) \u2212 e(0)k \u2264 kq(1,L2\u03b8\u2217\n\nn\n\n1m\u22121)(Qn\u22122(x)) \u2212 e(0)k.\n\n\u03b8\u2217\n\nn\u22121\n\nThe pattern is clear. Letting \u03b7n := LnQn\n\nk=1\n\n\u03b8\u2217\n\nk, we obtain by induction\n\nkQn(x) \u2212 e(0)k \u2264 kq(1,\u03b7n1m\u22121)(Q0(x)) \u2212 e(0)k.\n\nRecall that Q0(x) := x. Moreover,\n\nkq(1,\u03b7n 1m\u22121)(x) \u2212 e(0)k = 2(cid:0)1 \u2212 [q(1,\u03b7n 1m\u22121)(x)]0(cid:1) = 2(cid:0)1 \u2212 g\u03b7n(x0)(cid:1)\n\n(25)\n\n(26)\n\nwhere the \ufb01rst inequality is by (21), and the second is easily veri\ufb01ed by noting that all the elements\nof (1, \u03b7n1m\u22121), except the \ufb01rst, are equal. Putting (25) and (26) together with the bound 1\u2212g\u03b8(r) =\n1\u2212x0\nr+\u03b8(1\u2212r) \u2264 \u03b8 1\u2212r\nx0 .\nBy sub-Gaussianity assumption on {log \u03b8\u2217\n\nr , which holds for \u03b8 > 0 and r \u2208 (0, 1], we obtain kQn(x) \u2212 e(0)k \u2264 2\u03b7n\n\n\u03b8(1\u2212r)\n\nk}, we have\n\nP(cid:16) 1\n\nn\n\nnX\n\nk=1\n\nlog \u03b8\u2217\n\nk \u2212 E log \u03b8\u2217\n\n1 > \u03b5(cid:17) \u2264 exp(\u2212c n\u03b52/\u03c32\n\n\u2217),\n\n(27)\n\nfor some absolute constant c > 0. (Recall that \u03c3\u2217 is an upper bound on the sub-Gaussian norm\nk \u2264 en(\u2212I\u2217+\u03b5), which com-\n\u03b8\u2217\nk log \u03b8\u2217\npletes the proof.\n\n1k\u03c82 .) On the complement of the event in 27, we have Qn\n\nk=1\n\nAcknowledgments\n\nThis work was supported in part by NSF grants CCF-1115769 and OCI-1047871.\n\n8\n\n\fReferences\n\n[1] A. N. Shiryayev. Optimal Stopping Rules. Springer-Verlag, 1978.\n\n[2] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.\n\nMorgan Kaufmann, 1988.\n\n[3] M. I. Jordan. Graphical models. Statistical Science, 19:140\u2013155, 2004.\n\n[4] P. Diaconis and D. Freedman. Iterated random functions. SIAM Rev., 41(1):45\u201376, 1999.\n\n[5] O. P. Kreidl and A. Willsky. Inference with minimum communication: a decision-theoretic\n\nvariational approach. In NIPS, 2007.\n\n[6] M. Cetin, L. Chen, J. W. Fisher III, A. Ihler, R. Moses, M. Wainwright, and A. Willsky. Dis-\ntributed fusion in sensor networks: A graphical models perspective. IEEE Signal Processing\nMagazine, July:42\u201355, 2006.\n\n[7] X. Nguyen, A. A. Amini, and R. Rajagopal. Message-passing sequential detection of multiple\n\nchange points in networks. In ISIT, 2012.\n\n[8] A. Frank, P. Smyth, and A. Ihler. A graphical model representation of the track-oriented\nmultiple hypothesis tracker. In Proceedings, IEEE Statistical Signal Processing (SSP). August\n2012.\n\n[9] A. T. Ihler, J. W. Fisher III, and A. S. Willsky. Loopy belief propagation: Convergence and\n\neffects of message errors. Journal of Machine Learning Research, 6:905\u2013936, May 2005.\n\n[10] Alexander Ihler. Accuracy bounds for belief propagation. In Proceedings of UAI 2007, July\n\n2007.\n\n[11] T. G. Roosta, M. Wainwright, and S. S. Sastry. Convergence analysis of reweighted sum-\n\nproduct algorithms. IEEE Trans. Signal Processing, 56(9):4293\u20134305, 2008.\n\n[12] D. Steinsaltz. Locally contractive iterated function systems. Ann. Probab., 27(4):1952\u20131979,\n\n1999.\n\n[13] W. B. Wu and M. Woodroofe. A central limit theorem for iterated random functions. J . Appl.\n\nProbab., 37(3):748\u2013755, 2000.\n\n[14] W. B. Wu and X. Shao. Limit theorems for iterated random functions.. :. J. Appl. Probab.,\n\n[15]\n\n41(2):425\u2013436, 2004.\n\u00a8O. Sten\ufb02o. A survey of average contractive iterated function systems. J. Diff. Equa. and Appl.,\n18(8):1355\u20131380, 2012.\n\n[16] A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes: With Applica-\n\ntions to Statistics. Springer, 1996.\n\n[17] A. A. Amini and X. Nguyen. Bayesian inference as iterated random functions with applications\n\nto sequential inference in graphical models. arXiv preprint.\n\n[18] A. A. Amini and X. Nguyen. Sequential detection of multiple change points in networks:\nIEEE Transactions on Information Theory, 59(9):5824\u20135841,\n\na graphical model approach.\n2013.\n\n9\n\n\f", "award": [], "sourceid": 1339, "authors": [{"given_name": "Arash", "family_name": "Amini", "institution": "University of Michigan"}, {"given_name": "XuanLong", "family_name": "Nguyen", "institution": "University of Michigan"}]}