{"title": "Accumulator Networks: Suitors of Local Probability Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 486, "page_last": 492, "abstract": null, "full_text": "Accumulator networks: Suitors of local \n\nprobability propagation \n\nIntelligent Algorithms Lab, University of Toronto, www. cs. toronto. edu/ \"-'frey \n\nBrendan J. Frey and Anitha Kannan \n\nAbstract \n\nOne way to approximate inference in richly-connected graphical \nmodels is to apply the sum-product algorithm (a.k.a. probabil(cid:173)\nity propagation algorithm), while ignoring the fact that the graph \nhas cycles. The sum-product algorithm can be directly applied in \nGaussian networks and in graphs for coding, but for many condi(cid:173)\ntional probability functions - including the sigmoid function - di(cid:173)\nrect application of the sum-product algorithm is not possible. We \nintroduce \"accumulator networks\" that have low local complexity \n(but exponential global complexity) so the sum-product algorithm \ncan be directly applied. In an accumulator network, the probability \nof a child given its parents is computed by accumulating the inputs \nfrom the parents in a Markov chain or more generally a tree. After \ngiving expressions for inference and learning in accumulator net(cid:173)\nworks, we give results on the \"bars problem\" and on the problem \nof extracting translated, overlapping faces from an image. \n\nIntroduction \n\n1 \nGraphical probability models with hidden variables are capable of representing com(cid:173)\nplex dependencies between variables, filling in missing data and making Bayes(cid:173)\noptimal decisions using probabilistic inferences (Hinton and Sejnowski 1986; Pearl \n1988; Neal 1992). Large, richly-connected networks with many cycles can poten(cid:173)\ntially be used to model complex sources of data, such as audio signals, images and \nvideo. However, when the number of cycles in the network is large (more precisely, \nwhen the cut set size is exponential), exact inference becomes intractable. Also, to \nlearn a probability model with hidden variables, we need to fill in the missing data \nusing probabilistic inference, so learning also becomes intractable. \n\nTo cope with the intractability of exact inference, a variety of approximate inference \nmethods have been invented, including Monte Carlo (Hinton and Sejnowski 1986; \nNeal 1992), Helmholz machines (Dayan et al. 1995; Hinton et al. 1995), and \nvariational techniques (Jordan et al. 1998). \nRecently, the sum-product algorithm (a.k.a. probability propagation, belief prop(cid:173)\nagation) (Pearl 1988) became a major contender when it was shown to produce \nastounding performance on the problem of error-correcting decoding in graphs with \nover 1,000,000 variables and cut set sizes exceeding 2100,000 (Frey and Kschischang \n1996; Frey and MacKay 1998; McEliece et al. 1998). \nThe sum-product algorithm passes messages in both directions along the edges in a \ngraphical model and fuses these messages at each vertex to compute an estimate of \nP(variablelobs), where obs is the assignment of the observed variables. In a directed \n\n\f(b) \n\n\u2022\u2022\u2022 Xj \n\n(c) \n\nYnj(Xj) \n\nYnj(Xj) \n\n~K(Y) \n\n~K(Y) \n\n\u2022\u2022\u2022 ZK \n\nFigure 1: The sum-product algorithm passes messages in both directions along each edge in a \nBayesian network. Each message is a function of the parent. (a) Incoming messages are fused \nto compute an estimate of P(ylobservations). (b) Messages are combined to produce an \noutgoing message 7rk(Y). (c) Messages are combined to produce an outgoing message >'j(Xj). \nInitially, all messages are set to 1. Observations are accounted for as described in the text. \ngraphical model (Bayesian belief network) the message on an edge is a function of \nthe parent of the edge. The messages are initialized to 1 and then the variables are \nprocessed in some order or in parallel. Each variable fuses incoming messages and \nproduces outgoing messages, accounting for observations as described below. \n\nIf Xl, ... , X J are the parents of a variable y and Zl, ... , Z K are the children of y, \nmessages are fused at y to produce function F(y) as follows (see Fig. Ia): \nF(y) = (IT Ak(y)) (L: ... L: P(yIXI, . . . , xJ) IT 7I\"j (Xj)) Rj P(y, obs), \n\n(1) \n\nk \n\nX l \n\nX J \n\nj \n\nwhere P(y lXI, ... , xJ) is the conditional probability function associated with y. If \nthe graph is a tree and if messages are propagated from every variable in the network \nto y, as described below, the estimate is exact: F(y) = P(y, obs). Also, normalizing \nF(y) gives P(ylobs). If the graph has cycles, this inference is approximate. \nThe message 7rk(Y) passed from y to Zk is computed as follows (see Fig. Ib): \n\n7I\"k(Y) = F(y)/Ak(Y). \n\nThe message Aj(Xj) passed from y to Xj is computed as follows (see Fig. Ic): \nAj(Xj) = L:L: ... L: L: ... L:(IT Ak(y))P(ylxl, ... ,XJ)(IT 7I\"i(Xi)). \n\ny \n\nX l \n\nX; -\n\nl Xj+1 \n\nXJ \n\nk \n\ni#j \n\n(2) \n\n(3) \n\nNotice that Xj is not summed over and is excluded from the product of the 7r(cid:173)\nmessages on the right. \nIf y is observed to have the value y*, \nmessages are modified as follows: \n\nthe fused result at y and the outgoing 71\" \n\nF \n\n(y) f-\n\n{F(Y) \n\n0 \n\nif Y = y* \notherwise \n\n, 7I\"k (y) f-\n\n{ 7I\"k(Y) \n0 \n\nif Y = y* \notherwise \n\n(4) \n\nThe outgoing A messages are computed as follows: \nAj(Xj) = L: ... L: L: ... L:(IT Ak(Y*))P(y = Y*IXI' . . . ,XJ)(IT 7I\"i(Xi)). (5) \n\nXl \n\nX; -\n\nl X;+l \n\nXJ \n\nk \n\ni#j \n\nIf the graph is a tree, these formulas can be derived quite easily using the fact that \nsummations distribute over products. If the graph is not a tree, a local independence \nassumption can be made to justify these formulas. \nIn any case, the algorithm \ncomputes products and summations locally in the graph, so it is often called the \n\"sum-product\" algorithm. \n\n\fSN ,2 \n\n, , \n\nl' ' N,N- ! \n\n' N \n\n'\" \n\nFigure 2: The local complexity of a richly connected directed graphical model such as the one \nin (a) can be simplified by assuming that the effects of a child's parents are accumulated by a \nlow-complexity Markov chain as shown in (b) . (c) The general structure of the \"accumulator \nnetwork\" considered in this paper. \n2 Accumulator networks \nThe complexity of the local computations at a variable generally scales exponen(cid:173)\ntially with the number of parents of the variable. For example, fusion (1) re(cid:173)\nquires summing over all configurations of the parents. However, for certain types \nof conditional probability function P(yIXI,' .. ,xJ), this exponential sum reduces \nto a linear-time computation. For example, if P(yIXI,' .. ,xJ) is an indicator func(cid:173)\ntion for y = Xl XOR X2 XOR ... XOR XJ (a common function for error-correcting \ncoding), the summation can be computed in linear time using a trellis (Frey and \nMacKay 1998). If the variables are real-valued and P(y lXI, ... ,xJ) is Gaussian \nwith mean given by a linear function of Xl, ... ,X J, the integration can be com(cid:173)\nputed using linear algebra (c.f. Weiss and Freeman 2000; Frey 2000). \nIn contrast, exact local computation for the sigmoid function, P(y lXI, ... ,XJ) = \n1/(1 + exp[-Bo - L:j BjXj]) , requires the full exponential sum. Barber (2000) con(cid:173)\nsiders approximating this sum using a central limit theorem approximation. \n\nIn an \"accumulator network\" , the probability of a child given its parents is computed \nby accumulating the inputs from the parents in a Markov chain or more generally \na tree. (For simplicity, we use Markov chains in this paper.) Fig. 2a and b show \nhow a layered Bayesian network can be redrawn as an accumulator network. Each \naccumulation variable (state variable in the accumulation chain) has just 2 parents, \nand the number of computations needed for the sum-product computations for \neach variable in the original network now scales with the number of parents and the \nmaximum state size of the accumulation chain in the accumulator network. \n\nFig. 2c shows the general form of accumulator network considered in this paper, \nwhich corresponds to a fully connected Bayes net on variables Xl, ... ,X N. In this \nnetwork, the variables are Xl, ... ,X N and the accumulation variables for Xi are \nSi,l, ... ,Si,i-l' The effect of variable Xj on child Xi is accumulated by Si,j' The \njoint distribution over the variables X = {Xi: i = 1, ... ,N} and the accumulation \nvariables S = {Si,j : i = 1, ... ,N,j = 1, ... ,i -1} is \n\nN \n\ni-I \n\nP(X, S) = II [(II P(Si,j IXj, Si,j-l)) P(xilsi,i-d]. \n\n(6) \n\ni=l \n\nj=l \n\nIf Xj is not a parent of Xi in the original network, we set P(si,jlxj,Si,j-t) = 1 if \nSi,j = Si,j-l and P(Si,j IXj, Si,j-l) = 0 if Si,j :I Si,j-l' \nA well-known example of an accumulator network is the noisy-OR network (Pearl \n\n\f1988; Neal 1992). In this case, all variables are binary and we set \n\nif Si,j-l = 1, \nif Xj = 1 and Si,j-l = 0, \notherwise, \n\n(7) \n\nwhere Pi,j is the probability that Xj = 1 turns on the OR-chain. \nUsing an accumulation chain whose state space size equals the number of configu(cid:173)\nrations of the parent variables, we can produce an accumulator network that can \nmodel the same joint distributions on Xl, ... , XN as any Bayesian network. \n\nInference in an accumulator network is performed by passing messages as described \nabove, either in parallel, at random, or in a regular fashion, such as up the accumu(cid:173)\nlation chains, left to the variables, right to the accumulation chains and down the \naccumulation chains, iteratively. \n\nLater, we give results for an accumulator network that extracts images of trans(cid:173)\nlated, overlapping faces from an visual scene. The accumulation variables represent \nintensities of light rays at different depths in a layered 3-D scene. \n\n2.1 Learning accuIIlulator networks \n\nTo learn the conditional probability functions in an accumulator network, we ap(cid:173)\nply the sum-product algorithm for each training case to compute sufficient statis(cid:173)\ntics. Following Russell and Norvig (1995), the sufficient statistic needed to up(cid:173)\ndate the conditional probability function P(si,jlxj,Si,j-t) for Si,j in Fig. 2c is \nP(Si,j, Xj, Si,j_llobs). In particular, \n\n8 log P(obs) \n\n_ P(Si,j, Xj, Si,j-llobs) \n\n8P(Si,j IXj, Si,j-l) \n\nP(Si,j IXj, Si,j-l) \n\n(8) \n\nP(Si,j,Xj,Si,j-llobs) \nP(Si,j IXj, Si,j-l) and the). and 11'\" messages arriving at Si,j. \nmation is exact if the graph is a tree.) \n\nby normalizing \n\nis \n\napproximated \n\nthe product of \n\n(This approxi(cid:173)\n\nThe sufficient statistics can be used for online learning or batch learning. If batch \nlearning is used, the sufficient statistics are averaged over the training set and \nthen the conditional probability functions are modified. In fact, the conditional \nprobability function P(si,jlxj,Si,j-l) can be set equal to the normalized form of \nthe average sufficient statistic, in which case learning performs approximate EM, \nwhere the E-step is approximated by the sum-product algorithm. \n\n3 The bars problem \nFig. 3a shows the network structure for the binary bars problem and Fig. 3b shows \n30 training examples. For an N x N binary image, the network has 3 layers of \nbinary variables: 1 top-layer variable (meant to select orientation); 2N middle(cid:173)\nlayer variables (mean to select bars); and N 2 bottom-layer image variables. For \nlarge N, performing exact inference is computationally intractable and hence the \nneed for approximate inference. \n\nAccumulator networks enable efficient inference using probability propagation since \nlocal computations are made feasible. The topology of the accumulator network \ncan be easily tailored to the bars problem, as described above. \n\nGiven an accumulator network with the proper conditional probability tables, \ninference computes the probability of each bar and the probability of vertical \n\n\f(a) \n\n(b) \n\n(c) \" \n\n0 \n\n_ c .=a \n\n11.111 \n1II Imi \n1IIIml \n11111111 \nI 1I III \n\n_ \n\n0 \n\n-=-::::J _\n\nDC -\n\n\"'1m \n. , \n\n# of Iterations \n\n, , \n\nFigure 3: (a) Bayesian network for bars problem. (b) Examples of typical images. (c) KL \ndivergence between approximate inference and exact inference after each iteration \n\nversus horizontal orientation for an input image. After each iteration of prob(cid:173)\nability propagation, messages are fused to produce estimates of these proba(cid:173)\nbilities. Fig. 3c shows the Kullback Leibler divergence between these approx(cid:173)\nimate probabilities and the exact probabilities after each iteration, for 5 in(cid:173)\nput images. The figure also shows the most probable configuration found by \nIn most cases, we found that probability propagation \napproximate inference. \ncorrectly infers the presence of appropriate \nbars and the overall orientation of the bars. \nIn cases of multiple interpretations of the im(cid:173)\nage (e.g., Fig. 3c, image 4), probability prop(cid:173)\nagation tended to find appropriate interpre(cid:173)\ntations, although the divergence between the \napproximate and exact inferences is larger. \n\n-:,~========:::;;\",,;;;;; .. _;;;;;;;;;:.,~;;;; ... \"\",\"Tloo \n\n..... \n\n~ \n\n.c \ns: - 10 \n\n.\" g \n.\u00a3-12 \n]! \n.:J \n~-1 4 \n\nStarting with an accumulator network with \nrandom parameters, we trained the network \nas described above. Fig. 4 shows the online \nlearning curves corresponding to different \nlearning rates. The log-likelihood oscillates \nand although the optimum (horizontal line) \nis not reached, the results are encouraging. \n\n-\" \n\n2 \n\n# ofsweeps \n\n\" \n\n\" \n\nxlD\" \n\nFigure 4: Learning curves for learn(cid:173)\ning rates .05, .075 and .1 \n\n4 Accumulating light rays for layered vision \n\nWe give results on an accumulator network that extracts image components from \nscenes constructed from different types of overlapping face at random positions. \nSuppose we divide up a 3-D scene into L layers and assume that one of 0 objects \ncan sit in each layer in one of P positions. The total number of object-position \ncombinations per layer is K = 0 x P. For notational convenience, we assume \nthat each object-position pair is a different object modeled by an opaqueness map \n(probability that each pixel is opaque) and an appearance map (intensity of each \npixel). We constrain the opaqueness and appearance maps of the same object in \ndifferent positions to be the same, up to translation. Fig. 5a shows the appearance \nmaps of 4 such objects (the first one is a wall). \n\nIn our model, Pkn is the probability that the nth pixel of object k is opaque and \nWkn is the intensity of the nth pixel for object k. The input images are modeled \nby randomly picking an object in each of L layers, choosing whether each pixel in \neach layer is transparent or opaque, and accumulating light intensity by imaging \nthe pixels through the layers, and then adding Gaussian noise. \nFig. 6 shows the accumulator network for this model. zl E {I, .. . ,K} is the index \n\n\f(a) \n\n(b) \n\nFigure 5: (a) Learned appearance maps for a wall (all pixels dark and nearly opaque) and 3 \nfaces. (b) An image produced by combining the maps in (a) and adding noise. (c) Object(cid:173)\nspecific segmentation maps. The brightness of a pixel in the kth picture corresponds to the \nprobability that the pixel is imaged by object k. \n\nof the object in the lth layer, where layer 1 is adjacent to the camera and layer Lis \nfarthest from the camera. y~ is the accumulated discrete intensity of the light ray \nfor pixel n at layer l. y~ depends on the identity of the object in the current layer \nzl and the intensity of pixel n in the previous layer y~+1. So, \n\n1 \n1 \n\nZl = 0, y~ = y~+1 \nzl > 0, y~ = W z l n = y~+l \nzl > 0 yl = W \n...J. yl+l \nz ln -r n \nzl > 0 yl = yl+l ...J. W \n'n \nn \notherwise. \n\n'n \n\n-;-\n\nzln \n\n(9) \n\nEach condition corresponds to a different imaging operation at layer l for the light \nray corresponding to pixel n. Xn is the discretized intensity of pixel n, obtained \nfrom the light ray arriving at the camera, y~. P(xnly~) adds Gaussian noise to y~. \n\nzL \n\n~'::::::::===:::::::::---\n\nAfter training the network on 200 labeled images, \nwe applied iterative inference to identify and 10-\ncate image components. After each iteration, the \nmessage passed from y~ to zl is an estimate of the \nprobability that the light ray for pixel n is imaged \nby object zl at layer l (i.e., not occluded by other \nobjects). So, for each object at each layer, we have \nan n-pixel \"probabilistic segmentation map\". In \nFig. 5c we show the 4 maps in layer 1 correspond(cid:173)\ning to the objects shown in Fig. 5a, obtained after \n12 iterations of the sum-product algorithm. \n\nOne such set of segmentation maps can be drawn \nfor each layer. For deeper layers, the maps hope(cid:173)\nfully segment the part of the scene that sits behind \nthe objects in the shallower layers. Fig. 7a shows \nthe sets of segmentation maps corresponding to \ndifferent layers, after each iteration of probability \npropagation, for the input image shown on the far \nright. After 1 iteration, the segmentation in the \nfirst layer is quite poor, causing uncertain segmen-\ntation in deeper layers (except for the wall, which is mostly segmented properly in \nlayer 2). As iterations increases, the algorithm converges to the correct segmenta(cid:173)\ntion, where object 2 is in front, followed by objects 3, 4 and 1 (the wall). \nIt may appear from the input image in Fig. 7a that another possible depth ordering \nis object 2 in front, followed by objects 4, 3 and 1 - i.e., objects 3 and 4 may be \nreversed. However, it turns out that if this were the order, a small amount of dark \nhair from the top of the horizontal head would be showing. \n\nFigure 6: An accumulator net-\nwork for layered vision. \n\nWe added an extremely large amount of noise the the image used above, to see what \nthe algorithm would do when the two depth orders really are equally likely. Fig. 7b \nshows the noisy image and the series of segmentation maps produced at each layer \n\n\fLayer 3 \u2022 [] ... \u2022\u2022 C \n\n~ \u2022\u2022 C \n~ \u2022\u2022 C \n~ \u2022\u2022 C \n~ \u2022\u2022 C \n~ \u2022\u2022 C \n~ \u2022\u2022 C \n~ \u2022\u2022 C \n~ \u2022\u2022 C \n~ \u2022\u2022 C \n\nLayer 3 \u2022 [I. [J. \n\n(a) Layer 4 \n\n\u2022\u2022\u2022\u2022 \n\n\u2022\u2022 [J. \n' \u2022\u2022 C \nD \u2022\u2022\u2022 \nD \u2022\u2022\u2022 \nD \u2022\u2022\u2022 \nD \u2022\u2022\u2022 \nD \u2022\u2022\u2022 \nD \u2022\u2022\u2022 \nD \u2022\u2022\u2022 \nD \u2022\u2022\u2022 \nD \u2022\u2022\u2022 \n\n(b) Layer 4 \u2022\u2022\u2022 \n\nLayer 2 \n\nIII \n\n~ .[]. \nI. [J ~ \n~.[JI \n~.[J~ \n~.[J~ \n~.[J~ \n~.[J~ \n~.[J~ \n~.[J~ \n~.[J~ \n~.[JI \n\nLayer 2 \n\n1 111 l.lrI. \n\nLayer 1 \n11)[1 \u2022 \n1l[][I. \nI[][lill \nIl[][lill \" \nIl[][lill \nIl[][lill \nIl[][lill \nIl[)[lill \nIl[)[lill \nIl[][lill \nIl[][lill \nIl[][lill \nLayer 1 \n1111. \n1l0[l. \n101111 \n10[1. \n101111 \n10[1. \nIHIII \n[10[1. \nIGIIII \n10[1 \u2022 \n1111111 \n10[1 \u2022 \n\nas the number of iterations \nincreases. The segmenta(cid:173)\ntion maps for layer 1 show \nthat object 2 is correctly \nidentified as being in the \nfront. \n\nQuite surprisingly, the seg(cid:173)\nmentation maps in layer 2 \noscillate between the two \nplausible interpretations of \nthe scene - object 3 in front \nof object 4 and object 4 in \nfront of object 3. Although \nwe do not yet know how ro(cid:173)\nbust these oscillations are, \nor how accurately they re(cid:173)\nflect the probability masses \nin the different modes, this \nbehavior is potentially very \nuseful. \n\nReferences \n\nD. Barber 2000. Tractable \nbelief propagation. \nThe \nLearning Workshop, Snow(cid:173)\nbird, UT. \n\nProbability \n\nB. J. Frey and F. R. Kschis(cid:173)\nchang 1996. \npropagation and iterative de(cid:173)\ncoding. Proceedings of the \n34th Allerton Conference on \nCommunication, Control and \nComputing 1996, University \nof Illinois at Urbana. \n\n1.[lC \n\u2022\u2022 [J. \n1.[lC \nI.[J. \n1.[lC \nI.[J. \nI.IIC \nI.[J. \n1.[lC \n. '. [J. \n\n. [J1Il \n1.[lC \n. [J1Il \n\u2022\u2022 [lC \n\u2022\u2022 [J1Il \n.[lC \n..[J1Il \n1.[lC \nI.[J. \n\u2022\u2022 [lC \nI.[J. \nFigure 7: (a) Probabilistic segmentation maps for each \nlayer (column) after each iteration (row) of probability \npropagation for the image on the far right. (b) When a \nlarge amount of noise is added to the image. the network \nOSCI ates etween interpretations. \n\n1.[lC \nI.[J \u2022 \nl.tlC \nI.[J. \n1.[lC \nI.[J. \n.[lC \n\u2022\u2022 [J. \n\u2022\u2022 [lC \n\u2022\u2022 [J \u2022 \n\nb \n\n' 11 \n\n. \n\nac ay \n\nC \nB. J . Frey and D. J . . \nM K \nrevo u IOn: \nBelief propagation in graphs with cycles. In M. I. Jordan, M. I. Kearns and S. A. Solla \n(eds) Advances in Neural Information Processing Systems 10, MIT Press, Cambridge MA. \n\n1998 A \n\nI t \u00b7 \n\n. \n\n. \n\nM. I. Jordan, Z. Ghahramani, T. S. Jaakkola and L. K Saul 1999. An introduction to \nvariational methods for graphical models. In M. I. Jordan (ed) Learning in Graphical \nModels , MIT Press, Cambridge, MA. \n\nR. McEliece, D. J. C. MacKay and J. Cheng 1998. Turbodecoding as an instance of Pearl's \nbelief propagation algorithm. IEEE Journal on Selected Areas in Communications 16:2. \n\nK P. Murphy, Y. Weiss and M. I. Jordan 1999. Loopy belief propagation for approximate \ninference: An empirical study. Proceedings of the Fifteenth Conference on Uncertainty in \nArtificial Intelligence, Morgan Kaufmann, San Francisco, CA. \n\nJ . Pearl 1988. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann , San \nMateo CA. \n\nS. Russell and P. Norvig 1995. Artificial Intelligence: A Modern Approach. Prentice-Hall. \n\nY. Weiss and W . T. Freeman 2000. Correctness of belief propagation in Gaussian graphical \nmodels of arbitrary topology. In S.A. Solla, T. KLeen, and K-R. Miiller (eds) Advances \nin Neural Information Processing Systems 12, MIT Press. \n\n\f", "award": [], "sourceid": 1859, "authors": [{"given_name": "Brendan", "family_name": "Frey", "institution": null}, {"given_name": "Anitha", "family_name": "Kannan", "institution": null}]}