{"title": "Continuously-adaptive discretization for message-passing algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 737, "page_last": 744, "abstract": "Continuously-Adaptive Discretization for Message-Passing (CAD-MP) is a new message-passing algorithm employing adaptive discretization. Most previous message-passing algorithms approximated arbitrary continuous probability distributions using either: a family of continuous distributions such as the exponential family; a particle-set of discrete samples; or a fixed, uniform discretization. In contrast, CAD-MP uses a discretization that is (i) non-uniform, and (ii) adaptive. The non-uniformity allows CAD-MP to localize interesting features (such as sharp peaks) in the marginal belief distributions with time complexity that scales logarithmically with precision, as opposed to uniform discretization which scales at best linearly. We give a principled method for altering the non-uniform discretization according to information-based measures. CAD-MP is shown in experiments on simulated data to estimate marginal beliefs much more precisely than competing approaches for the same computational expense.", "full_text": "Continuously-adaptive discretization for\n\nmessage-passing algorithms\n\nKannan Achan\n\nMicrosoft Research Silicon Valley\nMountain View, California, USA\n\nMichael Isard\n\nMicrosoft Research Silicon Valley\nMountain View, California, USA\n\nJohn MacCormick\nDickinson College\n\nCarlisle, Pennsylvania, USA\n\nAbstract\n\nContinuously-Adaptive Discretization for Message-Passing (CAD-MP) is a new\nmessage-passing algorithm for approximate inference. Most message-passing al-\ngorithms approximate continuous probability distributions using either: a family\nof continuous distributions such as the exponential family; a particle-set of dis-\ncrete samples; or a \ufb01xed, uniform discretization. In contrast, CAD-MP uses a dis-\ncretization that is (i) non-uniform, and (ii) adaptive to the structure of the marginal\ndistributions. Non-uniformity allows CAD-MP to localize interesting features\n(such as sharp peaks) in the marginal belief distributions with time complexity that\nscales logarithmically with precision, as opposed to uniform discretization which\nscales at best linearly. We give a principled method for altering the non-uniform\ndiscretization according to information-based measures. CAD-MP is shown in\nexperiments to estimate marginal beliefs much more precisely than competing ap-\nproaches for the same computational expense.\n\n1 Introduction\n\nMessage passing algorithms such as Belief Propagation (BP) [1] exploit factorization to perform\ninference. Exact inference is only possible when the distribution to be inferred can be represented\nby a tree and the model is either linear-Gaussian or fully discrete [2, 3]. One attraction of BP is\nthat algorithms developed for tree-structured models can be applied analogously [4] to models with\nloops, such as Markov Random Fields.\n\nThere is at present no general-purpose approximate algorithm that is suitable for all problems, so\nthe choice of algorithm is governed by the form of the model. Much of the literature concentrates on\nproblems from statistics or control where point measurements are made (e.g. of an animal population\nor a chemical plant temperature), and where the state evolution is non-linear or the process noise\nis non-Gaussian [5, 6]. Some problems, notably those from computer vision, have more complex\nobservation distributions that naturally occur as piecewise-constant functions on a grid (i.e. images),\nand so it is common to discretize the underlying continuous model to match the structure of the\nobservations [7, 8]. As the dimensionality of the state-space increases, a na\u00a8\u0131ve uniform discretization\nrapidly becomes intractable [8]. When models are complex functions of the observations, sampling\nmethods such as non-parametric belief propagation (NBP) [9, 10], have been successful.\n\nDistributions of interest can often be represented by a factor graph [11]. \u201cMessage passing\u201d is a\nclass of algorithms for approximating these distributions, in which messages are iteratively updated\nbetween factors and variables. When a given message is to be updated, all other messages in the\ngraph are \ufb01xed and treated as though they were exact. The algorithm proceeds by picking, from\n\n1\n\n\fa family of approximate functions, the message that minimizes a divergence to the local \u201cexact\u201d\nmessage. In some forms of the approach [12] this minimization takes place over approximate belief\ndistributions rather than approximate messages.\n\nA general recipe for producing message passing algorithms, summarized by Minka [13], is as fol-\nlows: (i) pick a family of approximating distributions; (ii) pick a divergence measure to minimize;\n(iii) construct an optimization algorithm to perform this minimization within the approximating\nfamily. This paper makes contributions in all three steps of this recipe, resulting in a new algorithm\ntermed Continuously-Adaptive Discretization for Message-Passing (CAD-MP).\n\nFor step (i), we advocate an approximating family that has received little attention in recent years:\npiecewise-constant probability densities with a bounded number of piecewise-constant regions. Al-\nthough others have used this family in the past [14], it has not to our knowledge been employed in a\nmodern message-passing framework. We believe piecewise-constant probability densities are very\nwell suited to some problem domains, and this constitutes the chief contribution of the paper. For\nstep (ii), we have chosen for our initial investigation the \u201cinclusive\u201d KL-divergence [13]\u2014a stan-\ndard choice which leads to the well known Belief Propagation message update equations. We show\nthat for a special class of piecewise-constant probability densities (the so-called naturally-weighted\ndensities), the minimal divergence is achieved by a distribution of minimum entropy, leading to\nan intuitive and easily-implemented algorithm. For step (iii), we employ a greedy optimization\nby traversing axis-aligned binary-split kd-trees (explained in Section 3). The contribution here is an\nef\ufb01cient algorithm called \u201cinformed splitting\u201d for performing the necessary optimization in practice.\n\nAs we show in Section 4, CAD-MP computes much more accurate approximations than competing\napproaches for a given computational budget.\n\n2 Discretizing a factor graph\n\nLet us consider what it means to discretize an inference problem represented by a factor graph with\nfactors fi and continuous variables x\u03b1 taking values in some subset of RN . One constructs a non-\nuniform discretization of the factor graph by partitioning the state space of each variable x\u03b1 into\nK regions H k\ni of the\n\u03b1 taking integer values in the set\nfactors, which are now regarded as functions of discrete variables x0\n{1, 2, . . . , K}:\n\n\u03b1 for k = 1, . . . , K. This discretization induces a discrete approximation f 0\n\nf 0\n\ni (k, l, . . .) = Zx\u03b1\u2208H k\n\n\u03b1,x\u03b2 \u2208H l\n\n\u03b2 ,...\n\nfi(x\u03b1, x\u03b2, . . .),\n\n(1)\n\nfor k, l, . . . = 1, . . . , K. A slight variant of BP [4] could then be used to infer the marginals on x0\n\u03b1\naccording to the update equations for messages m and beliefs b:\n\nm\u03b1,i(k) = Yf 0\n\n\u03b1\\f 0\ni\n\nmj,\u03b1(k)\n\nmi,\u03b1(k) =\n\nf 0\ni (x\n\u03b1=k\n\n0) Yx0\n\n\u03b2 \u223cf 0\n\ni \\x0\n\u03b1\n\nm\u03b2,i(x0\n\n\u03b2)\n\nj \u223cx0\n1\n|H k\n\n0|x0\n\n\u03b1| Xx\n\u03b1| Yf 0\n\nj \u223cx0\n\u03b1\n\nb\u03b1(k) = |H k\n\nmi,\u03b1(k),\n\n(2)\n\n(3)\n\n(4)\n\n\u03b1\n\nwhere a \u223c b\\c means \u201call neighbors a of b except c\u201d, x0 is an assignment of values to all variables,\nand |H k\n1. Thus, given a factor graph of continuous variables and a particular choice of dis-\n\u03b1| = RH k\n\u03b1}, one gets a piecewise-constant approximation to the marginals by \ufb01rst discretizing\ncretization {H k\nthe variables according to (1), then using BP according to (2)\u2013(4). The error in the approximation\nto the true marginals arises from (3) when f 0\nConsider the task of selecting between discretizations of a continuous probability distribution p(x)\nover some subset U of Euclidean space. A discretization of p consists in partitioning U into K\n\ni (x) is not constant over x in the given partition.\n\ndisjoint subsets V1, . . . , VK and assigning a weight wk to each Vk, with Pk wk = 1. The corre-\nsponding discretized probability distribution q(x) assigns density wk/|Vk| to Vk. We are interested\nin \ufb01nding a discretization for which the KL divergence KL(p||q) is as small as possible. The opti-\np(x) [14]; we call\nmal choice of the wk for any \ufb01xed partitioning V1, . . . , VK is to take wk = Rx\u2208Vk\n\n2\n\n\f0.61\n\nH\n\n0.29 0.09\n\nPPPPPq\n\n0.01\n\n0.29\n\nH \u2212+\n\nH ++\n\n-\n\n0.11\n\n0.01\n\nH \u2212\u2212\n\nH +\u2212\n\n0.14\n\n0.02\n\nH 1\u2212\n\u0012\n\n0.25\n\n\n\nH 1+\n\nH 2+\n\nH\n\n0.12\n\n0.03\n\n\u0018\u0018\u0018\u0018\u0018\u0018\u0018\u0018:\nXXXXXXXXXXXXXXz\n\nH 2\u2212\n\n0.16\n\n0.28\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 1: Expanding a hypercube in two dimensions. Hypercube H (b), a subset of the full\nstate space (a), is \ufb01rst \u201cexpanded\u201d into the sub-cubes {H \u2212\u2212, H +\u2212, H \u2212+, H ++} (c) by splitting\nalong each possible dimension. These sub-cubes are then re-combined to form two possible split\ncandidates {H 1\u2212, H 1+} (d) and {H 2\u2212, H 2+} (e). Informed belief values are computed for the\nre-combined hypercubes, including a new estimate for \u02c6b(H) (f), by summing the beliefs in the\n\ufb01ner-scale partitioning. The new estimates are more accurate since the error introduced by the\ndiscretization decreases as the partitions become smaller.\n\nthese the natural weights for p(x), given the Vk. There is a simple relationship between the quality\nof a naturally-weighted discretization and its entropy H(\u00b7):\nTheorem 1. Among any collection of naturally-weighted discretizations of p(x), the minimum KL\ndivergence to p(x) is achieved by a discretization of minimal entropy.\n\nk=1 wk log wk\n\nProof. For a naturally-weighted discretization q, KL(p||q) = \u2212PK\n\n|Vk| + RU p log p =\nH(q) \u2212 H(p). H(p) is constant, so KL(p||q) is minimized by minimizing H(q).\nSuppose we are given a discretization {H k\n\u03b1} and have computed messages and beliefs for every\nnode using (2)\u2013(4). The messages have not necessarily reached a \ufb01xed point, but we nevertheless\nhave some current estimate for them. For any arbitrary hypercube H at x\u03b1 (not necessarily in its\ncurrent discretization) we can de\ufb01ne the informed belief, denoted \u02c6b(H), to be the belief H would\nreceive if all other nodes and their incoming messages were left unaltered. To compute the informed\nbelief, one \ufb01rst computes new discrete factor function values involving H using integrals like (1).\nThese values are fed into (2), (3) to produce \u201cinformed\u201d messages mi,\u03b1(H) arriving at x\u03b1 from each\nneighbor fi. Finally, the informed messages are fed into (4) to obtain the informed belief \u02c6b(H).\n\n(cid:3)\n\n3 Continuously-adaptive discretization\n\nThe core of the CAD-MP algorithm is the procedure for passing a message to a variable x\u03b1. Given\n\ufb01xed approximations at every other node, any discretization of \u03b1 induces an approximate belief dis-\ntribution q\u03b1(x\u03b1). The task of the algorithm is to select the best discretization, and as Theorem 1\nshows, a good strategy for this selection is to look for a naturally-weighted discretization that min-\nimizes the entropy of q\u03b1. We achieve this using a new algorithm called \u201cinformed splitting\u201d which\nis described next.\n\nCAD-MP employs an axis-aligned binary-split kd-tree [15] to represent the discrete partitioning of\na D-dimensional continuous state space at each variable (the same representation was used in [14]\nwhere it was called a Binary Split Partitioning). For our purposes, a kd-tree is a binary tree in which\neach vertex is assigned a subset\u2014actually a hypercube\u2014of the state space. The root is assigned the\nwhole space, and any internal vertex splits its hypercube equally between its two children using an\naxis-aligned plane. The subsets assigned to all leaves partition the state space into hypercubes.\n\nWe build the kd-tree greedily by recursively splitting leaf vertices: at each step we must choose\na hypercube H k\n\u03b1 in the current partitioning to split, and a dimension d to split it. According to\nTheorem 1, we should choose k and d to minimize the entropy of the resulting discretization\u2014\nprovided that this discretization has \u201cnatural\u201d weights. In practice, the natural weights are estimated\nusing informed beliefs; we nevertheless proceed as though they were exact and choose the k- and\n\n3\n\n\fd-values leading to lowest entropy. A subroutine of the algorithm involves \u201cexpanding\u201d a hypercube\ninto sub-cubes as illustrated in the two-dimensional case in Figure 1. The expansion procedure\ngeneralizes to D dimensions by \ufb01rst expanding to 2D subcubes and then re-combining these into\n2D candidate splits. Note that for all d \u2208 {1, . . . , D}\n\n\u02c6b(H) \u2261 \u02c6b(H d\u2212) + \u02c6b(H d\u2212).\n\n(5)\n\nOnce we have expanded each hypercube in the current partitioning and thereby computed values for\n\u02c6b(H k\n) for all k and d, we choose k and d to minimize the \u201csplit entropy\u201d\n\n) and \u02c6b(H k,d+\n\n\u03b1), \u02c6b(H k,d\u2212\n\n\u03b1\n\n\u03b1\n\n\u02c6b(H i\n\u03b1)\n|H i\n\u03b1|\n\n)\n\n\u03b1\n\n\u02c6b(H k,d\u2212\n|H k,d\u2212\n|\n\n\u03b1\n\n\u03b1\n\n) log\n\n\u02c6b(H i\n\n\u03b1) log\n\n\u2212 \u02c6b(H k,d\u2212\n\n\u03b3\u03b1(k, d) = \u2212Xi6=k\nNote that from (5) we can perform this minimization without normalizing the \u02c6b(\u00b7).\nWe can now describe the CAD-MP algorithm using informed splitting, which re-partitions a vari-\nable of the factor graph by producing a new kd-tree whose leaves are the hypercubes in the new\npartitioning:\n\n\u2212 \u02c6b(H k,d+\n\n) log\n\n(6)\n\n\u03b1\n\n)\n\n\u03b1\n\n\u02c6b(H k,d+\n|H k,d+\n|\n\n\u03b1\n\n.\n\n1. Initialize the root vertex of the kd-tree with its associated hypercube being the whole state\n\nspace, with belief 1. Add this root to a leaf set L and \u201cexpand\u201d it as shown in Figure 1.\n\n2. While the number of leaves |L| is less than the desired number of partitions in the dis-\n\ncretized model:\n(a) Pick the leaf H and split dimension d that minimize the split-entropy (6).\n(b) Create two new vertices H \u2212 and H + by splitting H along dimension d, and \u201cexpand\u201d\n\nthese new vertices.\n\n(c) Remove H from L, and add H \u2212 and H + to L.\n\nAll variables in the factor graph are initialized with the trivial discretization (a single partition). Vari-\nables can be visited according to any standard message-passing schedule, where a \u201cvisit\u201d consists\nof repartitioning according to the above algorithm. A simple example showing the evolution of the\nbelief at one variable is shown in Figure 2.\n\nIf the variable being repartitioned has T neighbors and we require a partitioning of K hypercubes,\nthen a straightforward implementation of this algorithm requires the computation of 2K \u00d7 2D \u00d7\nKT message components. Roughly speaking, then, informed splitting pays a factor of 2D+1 over\nBP which must compute K 2T message components. But CAD-MP trades this for an exponential\nfactor in K since it can home in on interesting areas of the state space using binary search, so if\nBP requires K partitions for a given level of accuracy, CAD-MP (empirically) achieves the same\naccuracy with only O(log K) partitions. Note that in special cases, including some low-level vision\napplications [16], classical BP can be performed in O(KT ) time and space; however this is still\nprohibitive for large K.\n\n4 Experiments\n\nWe would like to compare our candidate algorithms against the marginal belief distributions that\nwould be computed by exact inference, however no exact inference algorithm is known for our\nmodels. Instead, for each experiment we construct a \ufb01ne-scale uniform discretization Df of the\nmodel and input data, and compute the marginal belief distributions p(x\u03b1; Df ) at each variable\nx\u03b1 using the standard forward-backward BP algorithm. Given a candidate approximation C we\ncan then compare the marginals p(x\u03b1; C) under that approximation to the \ufb01ne-scale discretization\nby computing the KL-divergence KL(p(x\u03b1; Df )||p(x\u03b1; C)) at each variable. In results below, we\nreport the mean of this divergence across all variables in the graph, and refer to it in the text as \u00b5(C).\nWhile a \u201c\ufb01ne-enough\u201d uniform discretization will tend to the true marginals, we do not a priori\nknow how \ufb01ne that is. We therefore construct a sequence of coarser uniform discretizations Di\nc of\nthe same model and data, and compute \u00b5(Di\nc) is converging rapidly\nenough to zero, as is the case in the experiments below, we have con\ufb01dence that the \ufb01ne-scale\ndiscretization is a good approximation to the exact marginals.\n\nc) for each of them. If \u00b5(Di\n\n4\n\n\fObservation (local factor)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Evolution of discretization at a single variable. The left image is the local (single-\nvariable) factor at the \ufb01rst node in a simple chain MRF whose nodes have 2-D state spaces. The\nnext three images, from left to right, show the evolution of the informed belief. Initially (a) the par-\ntitioning is informed simply by the local factor, but after messages have been passed once along the\nchain and back (b), the posterior marginal estimate has shifted and the discretization has adapted ac-\ncordingly. Subsequent iterations over the chain (c) do not substantially alter the estimated marginal\nbelief. For this toy example only 16 partitions are used, and the normalized log of the belief is\ndisplayed to make the structure of the distribution more apparent.\n\nWe compare our adaptive discretization algorithm against non-parametric belief propagation\n(NBP) [9, 10] which represents the marginal distribution at a variable by a particle set. We generate\nsome importance samples directly from the observation distribution, both to initialize the algorithm\nand to \u201cre-seed\u201d the particle set when it gets lost. Particle sets typically do not approximate the tails\nof a distribution well, leading to zeros in the approximate marginals and divergences that tend to\nin\ufb01nity. We therefore regularize all divergence computations as follows:\n\nKL\u2217(p||q) = Xk\n\np\u2217\nk log(\n\np\u2217\nk\nq\u2217\nk\n\n),\n\np\u2217\nk =\n\n\u0001 +RH k p(x)\n\nPn(\u0001 +RH n p(x))\n\n,\n\nq\u2217\nk =\n\nq(x)\n\n\u0001 +Rxk\n\nPn(\u0001 +RH n q(x))\n\n(7)\n\nwhere {H k} are the partitions in the \ufb01ne-scale discretization Df . All experiments use \u0001 = 10\u22124\nwhich was found empirically to show good results for NBP.\n\nWe begin with a set of experiments over ten randomly generated input sequences of a one-\ndimensional target moving through structured clutter of similar-looking distractors. One of the\nsequences is shown in Figure 3a, where time goes from bottom to top. The measurement at a time-\nstep consists in 240 \u201cpixels\u201d (piecewise-constant regions of uniform width) generated by simulating\na small one-dimensional target in clutter, with additive Gaussian shot-noise. There are stationary\nclutter distractors, and also periodic \u201cforkings\u201d where a moving clutter distractor emerges from the\ntarget and proceeds for a few time-steps before disappearing. Each sequence contains 256 time-\nsteps, and the \u201cexact\u201d marginals (Figure 3b) are computed using standard discrete BP with 15360\nstates per time-step. The modes of the marginals generated by all the experiments are similar to\nthose in Figure 3b, except for one run of NBP shown in Figure 3c that failed entirely to \ufb01nd the\nmode (red line) due to an unlucky random seed. However, the distributions differ in \ufb01ne structure,\nwhere CAD-MP approximates the tails of the distribution much better than NBP.\n\nFigure 4a shows the divergences \u00b5(\u00b7) for the various discrete algorithms: both uniform discretization\nat various degrees of coarseness, and adaptive discretization using CAD-MP with varying numbers\nof partitions. Each data point shows the mean divergence \u00b5(\u00b7) for one of the ten simulated one-\ndimensional datasets. As the number of adaptive partitions increases, the variance of \u00b5(\u00b7) across\ntrials increases, but the divergence stays small. Higher divergences in CAD-MP trials correspond\nto a mis-estimation of the tails of the marginal belief at a few time-steps. The straight line on\nthe log/log plot for the uniform discretizations gives us con\ufb01dence that the \ufb01ne-scale discretization\nis a close approximation to the exact beliefs. The adaptive discretization provides a very faithful\napproximation to this \u201cexact\u201d distribution with vastly fewer partitions.\n\nFigure 4b shows the divergences for the same ten one-dimensional trial sequences when the\nmarginals are computed using NBP with varying numbers of particles. The NBP algorithm was\nrun \ufb01ve times on each of the ten simulated one-dimensional datasets with different random seeds\neach time, and the particle-set sizes were chosen to approximately match the computation time of\nthe CAD-MP algorithm. The NBP algorithm does worse absolutely (the divergences are much larger\neven after regularization, indicating that areas of high belief are sometimes mis-estimated), and also\n\n5\n\n\f(a): Observations\n\n(b): \u201cExact\u201d beliefs\n\n(c): an NBP \u201cfailure\u201d\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\nExact beliefs (d) are represented more faithfully by CAD-MP (e), (f) than NBP (g)\n\nFigure 3: One of the one-dimensional test sequences. The region of the white rectangle in (b) is\nexpanded in (d)\u2013(g), with beliefs now plotted on log intensity scale to expand their dynamic range.\nCAD-MP using only 16 partitions per time-step (e) already produces a faithful approximation to the\nexact belief (d), and increasing to 128 partitions (f) \ufb01lls in more details. The NBP algorithm using\n800 particles (g) does not approximate the tails of the distribution well.\n\n\u0003\u0004\u0005\u0006\u0007\b\u0003\t\n\u0004\u000b\u000b\t\f\u0004\u0003\u0005\u0004\r\u000e\u0004\nYZ[\\]^Y_`Zaa_bZY[ZcdZ\n\n\u0002\n\n\u0001\u0002\u0001\u0002\u0001\u0002\u0002\n\u0001\u0002\n\u0001\u0002\n\u0001\u0002\nVWVVXVWVXVWXX\nVWVVVX\nXV\n\n\u001c\u001d\u001e\u001f !\"\n#$#%&\u001e'(\nrstuvwx\nyzy{|t}~\n\n\u0002\n\u0002\n\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0014\u0019\u001a\u0019\u001a\u0015\u000f\u001b\nXVV\nXVVV\nXVVVV\nefghijklmnjopopkeq\n\n\u0002\nXVVVVV\n\n,-./01234-5536-2.-78-\n\u0082\u0083\u0084\u0085\u0086\u0087\u0088\u0089\u008a\u0083\u008b\u008b\u0089\u008c\u0083\u0088\u0084\u0083\u008d\u008e\u0083\n\n)*)+)*+++)\n\u007f\u0080\u007f\u0081\u007f\u0080\u0081\u0081\u0081\u007f\n\n+))\n\u0081\u007f\u007f\n\n6\n\n(a): 1D test\u2014discrete algorithms\n\n(b): 1D test\u2014NBP\n\n(c): 2D test\u2014discrete algorithms\n\n(d): 2D test\u2014NBP\n\nFigure 4: Adaptive discretization achieves the same accuracy as uniform discretization using\nmany fewer partitions, but non-parametric belief propagation is less effective. See Section 4\nfor details.\n\nHIIJKLMNOPQR\nSIIJKLMNOPQR\nTIIJKLMNOPQR\nUIIJKLMNOPQR\n\u009e\u009f\u009f\u00a0\u00a1\u00a2\u00a3\u00a4\u00a5\u00a6\u00a7\u00a8\n\u00a9\u009f\u009f\u00a0\u00a1\u00a2\u00a3\u00a4\u00a5\u00a6\u00a7\u00a8\n\u00aa\u009f\u009f\u00a0\u00a1\u00a2\u00a3\u00a4\u00a5\u00a6\u00a7\u00a8\n\u00ab\u00ac\u009f\u009f\u00a0\u00a1\u00a2\u00a3\u00a4\u00a5\u00a6\u00a7\u00a8\n\u00ad\u009e\u009f\u009f\u00a0\u00a1\u00a2\u00a3\u00a4\u00a5\u00a6\u00a7\u00a8\n\n+)))\n\u0081\u007f\u007f\u007f\u007f\n\n9:;<=>?@AB>CDEF=G\n\u0081\u007f\u007f\u007f\n\u008f\u0090\u0091\u0092\u0093\u0094\u0095\u0096\u0097\u0098\u0094\u0099\u009a\u009b\u009c\u0093\u009d\n\n\fvaries greatly across different trial sequences, and when re-run with different random seeds on the\nsame trial sequence. Note also that the \u00b5(\u00b7) are bi-modal\u2014values of \u00b5(\u00b7) above around 0.5 signify\nruns on which NBP incorrectly located the mode of the marginal belief distribution at some or all\ntime-steps, as in Figure 3c.\n\nWe performed a similar set of experiments using a simulated two-dimensional data-set. This time\nthe input data is a 64 \u00d7 64 image grid, and the \u201cexact\u201d \ufb01ne-scale discretization is at a resolution\nof 512 \u00d7 512 giving 262144 discrete states in total. Figures 4c and 4d show that adaptive dis-\ncretization still greatly outperforms NBP for an equivalent computational cost. Again there is a\nstraight-line trend in the log/log plots for both CAD-MP and uniform discretization, though as in\nthe one-dimensional case the variance of the divergences increases with more partitions. NBP again\nperforms less accurately, and frequently fails to \ufb01nd the high-weight regions of the belief at all at\nsome time-steps, even with 3200 particles.\n\nAdaptive discretization seems to correct some of the well-known limitations of particle-based meth-\nods. The discrete distribution is able to represent probability mass well into the tails of the distri-\nbution, which leads to a more faithful approximation to the exact beliefs. This also prevents the\ncatastrophic failure case for NBP shown in Figure 3c, where the mode of the distribution is lost\nentirely because no particles were placed nearby. Moreover, CAD-MP\u2019s computational complexity\nscales linearly with the number of incoming messages at a factor. NBP has to resort to heuristics to\nsample from the product of incoming messages once the number of messages is greater than two.\n\n5 Related work\n\nThe work most closely related to CAD-MP is the 1997 algorithm of Kozlov and Koller [14]. We\nrefer to this algorithm as \u201cKK97\u201d; its main differences to CAD-MP are: (i) KK97 is described in a\njunction tree setting and computes the marginal posterior of just the root node, whereas CAD-MP\ncomputes beliefs everywhere in the graph; (ii) KK97 discretizes messages (on junction tree edges)\nrather than variables (in a factor graph), so multiplying incoming messages together requires the\nsubstantial additional complexity of merging disparate discretizations, compared to CAD-MP in\nwhich the incoming messages share the same discretization. Difference (i) is the more serious, since\nit renders KK97 inapplicable to the type of early-vision problem we are motivated by, where the\nmarginal at every variable must be estimated.\n\nCoarse-to-\ufb01ne techniques can speed up the convergence of loopy BP [16] but this does not address\nthe discrete state-space explosion. One can also prune the state space based on local evidence [17,\n18]. However, this approach is unsuitable when the data function has high entropy; moreover, it is\nvery dif\ufb01cult to bring a state back into the model once it has been pruned.\n\nAnother interesting approach is to retain the uniform discretization, but enforce sparsity on messages\nto reduce computational cost. This was done in both [19] (in which messages are approximated us-\ning a using a mixture of delta functions, which in practice results in retaining the K largest message\ncomponents) and [20] (which uses an additional uniform distribution in the approximating distri-\nbution to ensure non-zero weights for all states in the discretization). However, these approaches\nappear to suffer when multiplying messages with disjoint peaks whose tails have been truncated to\nenforce sparsity: such peaks are unable to fuse their evidence correctly. Also, [20] is not directly\napplicable when the state-space is multi-dimensional.\n\nExpectation Propagation [5] is a highly effective algorithm for inference in continuous-valued net-\nworks, but is not valid for densities that are multimodal mixtures.\n\n6 Discussion\n\nWe have demonstrated that our new algorithm, CAD-MP, performs accurate approximate infer-\nence with complex, multi-modal observation distributions and corresponding multi-modal posterior\ndistributions. It substantially outperforms the two standard methods for inference in this setting:\nuniform-discretization and non-parametric belief propagation. While we only report results here on\nsimulated data, we have successfully used the method on low-level vision problems and are prepar-\ning a companion publication to describe these results. We believe CAD-MP and variants on it may\nbe applicable to other domains where complex distributions must be estimated in spaces of low to\n\n7\n\n\fmoderate dimension. The main challenge in applying the technique to an arbitrary factor graph is\nthe tractability of the de\ufb01nite integrals (1).\n\nThis paper describes a particular set of engineering choices motivated by our problem domain. We\nuse kd-trees to describe partitionings: other data structures could certainly be used. Also, we employ\na greedy heuristic to select a partitioning with low entropy rather than exhaustively computing a\nminimimum entropy over some family of discretizations. We have experimented with a Metropolis\nalgorithm to augment this greedy search: a Metropolis move consists in \u201ccollapsing\u201d some sub-tree\nof the current partitioning and then re-expanding using a randomized form of the minimum-entropy\ncriterion. We have also tried tree-search heuristics that do not need the O(2D) \u201cexpansion\u201d step,\nand thus may be more effective when D is large. The choices reported here seem to give the best\naccuracy on our problems for a given computational budget, however many others are possible and\nwe hope this work will serve as a starting point for a renewed interest in adaptive discretization in a\nvariety of inference settings.\n\nReferences\n[1] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauf-\n\nmann, 1988.\n\n[2] P. Dagum and M. Luby. Approximating probabilistic inference in bayesian belief networks is NP-hard.\n\nArti\ufb01cial Intelligence, 60(1):141\u2013153, 1993.\n\n[3] Robert G. Cowell, A. Philip Dawid, Steffen L. Lauritzen, and David J. Spiegelhalter. Probabilistic Net-\n\nworks and Expert Systems. Springer, 1999.\n\n[4] Jonathan S. Yedidia, William T. Freeman, and Yair Weiss. Generalized belief propagation. In NIPS, pages\n\n689\u2013695, 2000.\n\n[5] T. Minka. Expectation propagation for approximate bayesian inference. In Proc. UAI, pages 362\u2013369,\n\n2001.\n\n[6] G. Kitagawa. The two-\ufb01lter formula for smoothing and an implementation of the gaussian-sum smoother.\n\nAnn. Inst. Statist. Math., 46(4):605\u2013623, 1994.\n\n[7] P.F. Felzenszwalb and D.P. Huttenlocher. Ef\ufb01cient belief propagation for early vision. In Proc. CVPR,\n\n2004.\n\n[8] M. Isard and J. MacCormick. Dense motion and disparity estimation via loop belief propagation.\n\nACCV, pages 32\u201341, 2006.\n\nIn\n\n[9] E. Sudderth, A. Ihler, W. Freeman, and A. Willsky. Nonparametric belief propagation. In Proc. CVPR,\n\nvolume 1, pages 605\u2013612, 2003.\n\n[10] M. Isard. Pampas: Real-valued graphical models for computer vision. In Proc. CVPR, volume 1, pages\n\n613\u2013620, 2003.\n\n[11] F.R. Kschischang, B.J. Frey, and H.A. Loeliger. Factor graphs and the sum-product algorithm. IEEE\n\nTransactions on Information Theory, 47(2):498\u2013519, 2001.\n\n[12] O. Zoeter and H. Heskes. Deterministic approximate inference techniques for conditionally gaussian state\n\nspace models. Statistics and Computing, 16(3):279\u2013292, 2006.\n\n[13] T. Minka. Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft\n\nResearch, 2005.\n\n[14] Alexander V. Kozlov and Daphne Koller. Nonuniform dynamic discretization in hybrid networks.\n\nProc. UAI, pages 314\u2013325, 1997.\n\nIn\n\n[15] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM,\n\n18(9):509\u2013517, 1975.\n\n[16] P.F. Felzenszwalb and D.P. Huttenlocher. Pictorial structures for object recognition.\n\nVision, 61(1):55\u201379, 2005.\n\nInt. J. Computer\n\n[17] J. Coughlan and S. Ferreira. Finding deformable shapes using loopy belief propagation. In Proc. ECCV,\n\npages 453\u2013468, 2002.\n\n[18] J. Coughlan and H. Shen. Shape matching with belief propagation: Using dynamic quantization to ac-\n\ncommodate occlusion and clutter. In Proc. Workshop on Generative-Model Based Vision, 2004.\n\n[19] C. Pal, C. Sutton, and A. McCallum. Sparse forward-backward using minimum divergence beams for\nfast training of conditional random \ufb01elds. In International Conference on Acoustics, Speech, and Signal\nProcessing, 2006.\n\n[20] J. Lasserre, A. Kannan, and J. Winn. Hybrid learning of large jigsaws. In Proc. CVPR, 2007.\n\n8\n\n\f", "award": [], "sourceid": 160, "authors": [{"given_name": "Michael", "family_name": "Isard", "institution": null}, {"given_name": "John", "family_name": "MacCormick", "institution": null}, {"given_name": "Kannan", "family_name": "Achan", "institution": null}]}