{"title": "Efficient Multiscale Sampling from Products of Gaussian Mixtures", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 8, "abstract": "", "full_text": "Ef\ufb01cient Multiscale Sampling from\n\nProducts of Gaussian Mixtures\n\nAlexander T. Ihler, Erik B. Sudderth, William T. Freeman, and Alan S. Willsky\n\nihler@mit.edu, esuddert@mit.edu, billf@ai.mit.edu, willsky@mit.edu\n\nDepartment of Electrical Engineering and Computer Science\n\nMassachusetts Institute of Technology\n\nAbstract\n\nThe problem of approximating the product of several Gaussian mixture\ndistributions arises in a number of contexts, including the nonparametric\nbelief propagation (NBP) inference algorithm and the training of prod-\nuct of experts models. This paper develops two multiscale algorithms\nfor sampling from a product of Gaussian mixtures, and compares their\nperformance to existing methods. The \ufb01rst is a multiscale variant of pre-\nviously proposed Monte Carlo techniques, with comparable theoretical\nguarantees but improved empirical convergence rates. The second makes\nuse of approximate kernel density evaluation methods to construct a fast\napproximate sampler, which is guaranteed to sample points to within a\ntunable parameter (cid:15) of their true probability. We compare both multi-\nscale samplers on a set of computational examples motivated by NBP,\ndemonstrating signi\ufb01cant improvements over existing methods.\n\nIntroduction\n\n1\nGaussian mixture densities are widely used to model complex, multimodal relationships.\nAlthough they are most commonly associated with parameter estimation procedures like\nthe EM algorithm, kernel or Parzen window nonparametric density estimates [1] also take\nthis form for Gaussian kernel functions. Products of Gaussian mixtures naturally arise\nwhenever multiple sources of statistical information, each of which is individually mod-\neled by a mixture density, are combined. For example, given two independent observa-\ntions y1; y2 of an unknown variable x, the joint likelihood p(y1; y2jx) / p(y1jx)p(y2jx) is\nequal to the product of the marginal likelihoods. In a recently proposed nonparametric be-\nlief propagation (NBP) [2, 3] inference algorithm for graphical models, Gaussian mixture\nproducts are the mechanism by which nodes fuse information from different parts of the\ngraph. Product densities also arise in the product of experts (PoE) [4] framework, in which\ncomplex densities are modeled as the product of many \u201clocal\u201d constraint densities.\n\nThe primary dif\ufb01culty associated with products of Gaussian mixtures is computational. The\nproduct of d mixtures of N Gaussians is itself a Gaussian mixture with N d components.\nIn many practical applications, it is infeasible to explicitly construct these components,\nand therefore intractable to build a smaller approximating mixture using the EM algorithm.\nMixture products are thus typically approximated by drawing samples from the product\ndensity. These samples can be used to either form a Monte Carlo estimate of a desired\nexpectation [4], or construct a kernel density estimate approximating the true product [2].\n\n\fAlthough exact sampling requires exponential cost, Gibbs sampling algorithms may often\nbe used to produce good approximate samples [2, 4].\n\nWhen accurate approximations are required, existing methods for sampling from products\nof Gaussian mixtures often require a large computational cost. In particular, sampling is\nthe primary computational burden for both NBP and PoE. This paper develops a pair of\nnew sampling algorithms which use multiscale, KD-Tree [5] representations to improve\naccuracy and reduce computation. The \ufb01rst is a multiscale variant of existing Gibbs sam-\nplers [2, 4] with improved empirical convergence rate. The second makes use of approx-\nimate kernel density evaluation methods [6] to construct a fast (cid:15)-exact sampler which, in\ncontrast with existing methods, is guaranteed to sample points to within a tunable parame-\nter (cid:15) of their true probability. Following our presentation of the algorithms, we demonstrate\ntheir performance on a set of computational examples motivated by NBP and PoE.\n\n2 Products of Gaussian Mixtures\n\nLet fp1(x); : : : ; pd(x)g denote a set of d mixtures of N Gaussian densities, where\n\npi(x) = X\n\nwliN (x; (cid:22)li ; (cid:3)i)\n\nli\n\n(1)\n\nHere, li are a set of labels for the N mixture components in pi(x), wli are the normalized\ncomponent weights, and N (x; (cid:22)li ; (cid:3)i) denotes a normalized Gaussian density with mean\n(cid:22)li and diagonal covariance (cid:3)i. For simplicity, we assume that all mixtures are of equal\nsize N, and that the variances (cid:3)i are uniform within each mixture, although the algorithms\nwhich follow may be readily extended to problems where this is not the case. Our goal is\nto ef\ufb01ciently sample from the N d component mixture density p(x) / Qd\n2.1 Exact Sampling\nSampling from the product density can be decomposed into two steps: randomly select one\nof the product density\u2019s N d components, and then draw a sample from the corresponding\nGaussian. Let each product density component be labeled as L = [l1; : : : ; ld], where li\nlabels one of the N components of pi(x).1 The relative weight of component L is given by\nwL = Qd\n\ni=1 wliN (x; (cid:22)li ; (cid:3)i)\n\ni=1 pi(x).\n\n(cid:3)(cid:0)1\n\n(2)\n\n(cid:3)(cid:0)1\n\nL (cid:22)L =\n\nX\n\n(cid:3)(cid:0)1\n\ni (cid:22)li\n\nd\n\nd\n\nN (x; (cid:22)L; (cid:3)L)\n\n(cid:3)(cid:0)1\n\nL =\n\nX\n\ni\n\ni=1\n\ni=1\n\nwhere (cid:22)L, (cid:3)L are the mean and variance of product component L, and this equation may be\nevaluated at any x (the value x = (cid:22)L may be numerically convenient). To form the product\ndensity, these weights are normalized by the weight partition function Z , PL wL.\nDetermining Z exactly takes O(N d) time, and given this constant we can draw N samples\nfrom the distribution in O(N d) time and O(N ) storage. This is done by drawing and sort-\ning N uniform random variables on the interval [0; 1], and then computing the cumulative\ndistribution of p(L) = wL=Z to determine which, if any, samples are drawn from each L.\n\nImportance Sampling\n\n2.2\nImportance sampling is a Monte Carlo method for approximately sampling from (or com-\nputing expectations of) an intractable distribution p(x), using a proposal distribution q(x)\nfor which sampling is feasible [7]. To draw N samples from p(x), an importance sampler\ndraws M (cid:21) N samples xi (cid:24) q(x), and assigns the ith sample weight wi / p(xi)=q(xi).\nThe weights are then normalized by Z = Pi wi, and N samples are drawn (with replace-\nment) from the discrete distribution (cid:22)p(xi) = wi=Z.\n\n1Throughout this paper, we use lowercase letters (li) to label input density components, and cap-\n\nital letters (L = [l1; : : : ; ld]) to label the corresponding product density components.\n\n\fSequential Gibbs Sampler\n\n1\n\n \n\nx\ni\nM\n\n2\n\n \n\nx\ni\nM\n\nParallel Gibbs Sampler\n\n1\n\n \n\nx\ni\nM\n\n2\n\n \n\nx\ni\nM\n\nX\n\nX\n\n.\n.\n.\n\n.\n.\n.\n\nFigure 1: Two possible Gibbs samplers for a product of 2 mixtures of 5 Gaussians. Arrows show the\nweights assigned to each label. Top left: At each iteration, one label is sampled conditioned on the\nother density\u2019s current label. Bottom left: Alternate between sampling a data point X conditioned on\nthe current labels, and resampling all labels in parallel. Right: After (cid:20) iterations, both Gibbs samplers\nidentify mixture labels corresponding to a single kernel (solid) in the product density (dashed).\n\nFor products of Gaussian mixtures, we consider two different proposal distributions. The\n\ufb01rst, which we refer to as mixture importance sampling, draws each sample by randomly\nselecting one of the d input mixtures, and sampling from its N components (q(x) = pi(x)).\nThe remaining d (cid:0) 1 mixtures then provide the importance weight (wi = Qj6=i pj(xi)).\nThis is similar to the method used to combine density trees in [8]. Alternatively, we can\napproximate each input mixture pi(x) by a single Gaussian density qi(x), and choose\nq(x) / Qi qi(x). We call this procedure Gaussian importance sampling.\n2.3 Gibbs Sampling\nSampling from Gaussian mixture products is dif\ufb01cult because the joint distribution over\nproduct density labels, as de\ufb01ned by equation (2), is complicated. However, conditioned\non the labels of all but one mixture, we can compute the conditional distribution over the\nremaining label in O(N ) operations, and easily sample from it. Thus, we may use a Gibbs\nsampler [9] to draw asymptotically unbiased samples, as illustrated in Figure 1. At each\niteration, the labels fljgj6=i for d (cid:0) 1 of the input mixtures are \ufb01xed, and the ith label is\nsampled from the corresponding conditional density. The newly chosen li is then \ufb01xed,\nand another label is updated. After a \ufb01xed number of iterations (cid:20), a single sample is drawn\nfrom the product mixture component identi\ufb01ed by the \ufb01nal labels. To draw N samples, the\nGibbs sampler requires O(d(cid:20)N 2) operations; see [2] for further details.\nThe previously described sequential Gibbs sampler de\ufb01nes an iteration over the labels of\nthe input mixtures. Another possibility uses the fact that, given a data point (cid:22)x in the product\ndensity space, the d input mixture labels are conditionally independent [4]. Thus, one can\nde\ufb01ne a parallel Gibbs sampler which alternates between sampling a data point conditioned\non the current input mixture labels, and parallel sampling of the mixture labels given the\ncurrent data point (see Figure 1). The complexity of this sampler is also O(d(cid:20)N 2).\n\n3 KD\u2013Trees\nA KD-tree is a hierarchical representation of a point set which caches statistics of subsets\nof the data, thereby making later computations more ef\ufb01cient [5]. KD-trees are typically\nbinary trees constructed by successively splitting the data along cardinal axes, grouping\npoints by spatial location. We use the variable l to denote the label of a leaf node (the index\nof a single point), and l to denote a set of leaf labels summarized at a node of the KD-tree.\n\n\f{1,2,3,4,5,6,7,8}\n{1,2,3,4,5,6,7,8}\n{1,2,3,4,5,6,7,8}\n\nx\nx\nx\n\nx x\nx x\nx x\n\nx\nx\nx\n\nx x\nx x\nx x\n\nx x\nx x\nx x\n\nx\n\nx x\n\nx\n\nx x\n\nx x\n\n{1,2,3,4}\n{1,2,3,4}\n{1,2,3,4}\n\n{5,6,7,8}\n{5,6,7,8}\n{5,6,7,8}\n\nx\nx\nx\n\nx x\nx x\nx x\n\nx\nx\nx\n\nx x\nx x\nx x\n\nx x\nx x\nx x\n\nx\n\nx x\n\nx\n\nx x\n\nx x\n\n{1,2}\n{1,2}\n{1,2}\nx\nx\nx\n\nx x\nx x\nx x\n\n{3,4}\n{3,4}\n{3,4}\n\n{5,6}\n{5,6}\n{5,6}\nx x x\nx x x\nx x x\n\n{7,8}\n{7,8}\n{7,8}\nx x\nx x\nx x\n\nx\n\nx x\n\nx x x\n\nx x\n\n(a)\n\n(b)\n\nFigure 2: Two KD-tree representations of the same one-dim. point set. (a) Each node maintains a\nbounding box (label sets l are shown in braces). (b) Each node maintains mean and variance statistics.\n\nFigure 2 illustrates one-dimensional KD-trees which cache different sets of statistics. The\n\ufb01rst (Figure 2(a)) maintains bounding boxes around the data, allowing ef\ufb01cient computa-\ntion of distances; similar trees are used in Section 4.2. Also shown in this \ufb01gure are the\nlabel sets l for each node. The second (Figure 2(b)) precomputes means and variances of\npoint clusters, providing a multi-scale Gaussian mixture representation used in Section 4.1.\n\n3.1 Dual Tree Evaluation\nMultiscale representations have been effectively\napplied to kernel density estimation problems.\nGiven a mixture of N Gaussians with means f(cid:22)ig,\nwe would like to evaluate\n\nx\n\nx x\n\nx\n\nx x\n\nx x\n\nD\n\nmax\n\nminD\n\np(xj) = X\n\nwiN (xj; (cid:22)i; (cid:3))\n\n(3)\n\noooo\n\noo\n\noo\n\ni\n\nat a given set of M points fxjg. By representing\nthe means f(cid:22)ig and evaluation points fxjg with\ntwo different KD-trees, it is possible to de\ufb01ne a\ndual\u2013tree recursion [6] which is much faster than\ndirect evaluation of all N M kernel\u2013point pairs.\nThe dual-tree algorithm uses bounding box statistics (as in Figure 2(a)) to approximately\nevaluate subsets of the data. For any set of labels in the density tree l(cid:22) and location tree lx,\none may use pairwise distance bounds (see Figure 3) to \ufb01nd upper and lower bounds on\n\nFigure 3: Two KD-tree representations\nmay be combined to ef\ufb01ciently bound\nthe maximum (Dmax) and minimum\n(Dmin) pairwise distances between sub-\nsets of the summarized points (bold).\n\nwiN (xj; (cid:22)i; (cid:3))\n\nfor any\n\nj 2 lx\n\n(4)\n\nX\n\ni2l(cid:22)\n\nWhen the distance bounds are suf\ufb01ciently tight, the sum in equation (4) may be approxi-\nmated by a constant, asymptotically allowing evaluation in O(N ) operations [6].\n\n4 Sampling using Multiscale Representations\n4.1 Gibbs Sampling on KD-Trees\nAlthough the pair of Gibbs samplers discussed in Section 2.3 are often effective, they some-\ntimes require a very large number of iterations to produce accurate samples. The most dif\ufb01-\ncult densities are those for which there are multiple widely separated modes, each of which\nis associated with disjoint subsets of the input mixture labels. In this case, conditioned\non a set of labels corresponding to one mode, it is very unlikely that a label or data point\ncorresponding to a different mode will be sampled, leading to slow convergence.\n\nSimilar problems have been observed with Gibbs samplers on Markov random \ufb01elds [9].\nIn these cases, convergence can often be accelerated by constructing a series of \u201ccoarser\n\n\fscale\u201d approximate models in which the Gibbs sampler can move between modes more eas-\nily [10]. The primary challenge in developing these algorithms is to determine procedures\nfor constructing accurate coarse scale approximations. For Gaussian mixture products,\nKD-trees provide a simple, intuitive, and easily constructed set of coarser scale models.\n\nAs in Figure 2(b), each level of the KD-tree stores the mean and variance (biased by kernel\nsize) of the summarized leaf nodes. We start at the same coarse scale for all input mixtures,\nand perform standard Gibbs sampling on that scale\u2019s summary Gaussians. After several\niterations, we condition on a data sample (as in the parallel Gibbs sampler of Section 2.3)\nto infer labels at the next \ufb01nest scale. Intuitively, by gradually moving from coarse to \ufb01ne\nscales, multiscale sampling can better explore all of the product density\u2019s important modes.\n\nAs the number of sampling iterations approaches in\ufb01nity, multiscale samplers have the\nsame asymptotic properties as standard Gibbs samplers. Unfortunately, there is no guar-\nantee that multiscale sampling will improve performance. However, our simulation results\nindicate that it is usually very effective (see Section 5).\n\n4.2 Epsilon-Exact Sampling using KD-Trees\nIn this section, we use KD-trees to ef\ufb01ciently compute an approximation to the partition\nfunction Z, in a manner similar to the dual tree evaluation algorithm of [6] (see Section 3.1).\nThis leads to an (cid:15)-exact sampler for which a label L = [l1; : : : ; ld], with true probability\npL, is guaranteed to be sampled with some probability ^pL 2 [pL (cid:0) (cid:15); pL + (cid:15)]. We denote\nsubsets of labels in the input densities with lowercase script (li), and sets of labels in the\nproduct density by L = l1 (cid:2) (cid:1) (cid:1) (cid:1) (cid:2) ld. The approximate sampling procedure is similar to\nthe exact sampler of Section 2.1. We \ufb01rst construct KD-tree representations of each input\ndensity (as in Figure 2(a)), and use a multi\u2013tree recursion to approximate the partition\nfunction ^Z = P ^wL by summarizing sets of labels L where possible. Then, we compute\nthe cumulative distribution of the sets of labels, giving each label set L probability ^wL= ^Z.\n\n4.2.1 Approximate Evaluation of the Weight Partition Function\nWe \ufb01rst note that the weight function (equation (2)) can be rewritten using terms which\ninvolve only pairwise distances (the quotient is computed elementwise):\n\nwL = (cid:0)\n\nd\n\nY\n\nj=1\n\nwlj(cid:1) (cid:1) Y\n\n(li;lj>i)\n\nN ((cid:22)li ; (cid:22)lj ; (cid:3)(i;j))\n\nwhere\n\n(cid:3)(i;j) =\n\n(cid:3)i(cid:3)j\n(cid:3)L\n\n(5)\n\nThis equation may be divided into two parts: a weight contribution Qd\ni=1 wli, and a distance\ncontribution (which we denote by KL) expressed in terms of the pairwise distances between\nkernel centers. We use the KD-trees\u2019 distance bounds to compute bounds on each of these\npairwise distance terms for a collection of labels L = l1(cid:2)(cid:1) (cid:1) (cid:1)(cid:2)ld. The product of the upper\n(lower) pairwise bounds is itself an upper (lower) bound on the total distance contribution\nfor any label L within the set; denote these bounds by K +\n\nL , respectively.2\n\nL and K (cid:0)\n\nL = 1\n\nL (cid:0) K (cid:0)\n\nL (cid:1) to approximate KL, we incur a maximum error\nBy using the mean K (cid:3)\n2 (cid:0)K +\n1\nL (cid:1) for any label L 2 L. If this error is less than Z(cid:14) (which we ensure by\ncomparing to a running lower bound Zmin on Z), we treat it as constant over the set L and\napproximate the contribution to Z by\n\n2 (cid:0)K +\n\nL + K (cid:0)\n\nX\n\n^wL = K (cid:3)\n\nL X\n\n(Y\n\nwli ) = K (cid:3)\n\nL Y\n\n(X\n\nwli )\n\n(6)\n\nL2L\n\nL2L\n\ni\n\ni\n\nli2li\n\nThis is easily calculated using cached statistics of the weight contained in each set. If the\nerror is larger than Z(cid:14), we need to re\ufb01ne at least one of the label sets; we use a heuristic\nto make this choice. This procedure is summarized in Algorithm 1. Note that all of the\n\n2We can also use multipole methods such as the Fast Gauss Transform [11] to ef\ufb01ciently compute\n\nalternate, potentially tighter bounds on the pairwise values.\n\n\fMultiTree([l1; : : : ; ld])\n\n1. For each pair of distributions (i; j > i), use their bounding boxes to compute\n\n(a) K (i;j)\n(b) K (i;j)\n\nmax (cid:21) maxli2li;lj 2lj N (xli (cid:0) xlj ; 0; (cid:3)(i;j))\nmin (cid:20) minli2li;lj 2lj N (xli (cid:0) xlj ; 0; (cid:3)(i;j))\n\n2. Find Kmax = Q(i;j>i) K (i;j)\n3. If 1\n\nmax and Kmin = Q(i;j>i) K (i;j)\n\nmin\n\n2 (Kmax (cid:0) Kmin) (cid:20) Zmin(cid:14), approximate this combination of label sets:\n\n2 (Kmax + Kmin) (Q wli ), where wli = Pli2li\n\nwli is cached by the KD-trees\n\n(a) ^wL = 1\n(b) Zmin = Zmin + Kmin (Q wli )\n(c) ^Z = ^Z + ^wL\n\n4. Otherwise, re\ufb01ne one of the label sets:\nmax=K (i;j)\n\n(a) Find arg max(i;j) K (i;j)\n(b) Call recursively:\n\nmin such that range(li) (cid:21) range(lj).\n\ni. MultiTree([l1; : : : ; Nearer(Left(li); Right(li); lj); : : : ; ld])\nii. MultiTree([l1; : : : ; Farther(Left(li); Right(li); lj); : : : ; ld])\n\nwhere Nearer(Farther) returns the nearer (farther) of the \ufb01rst two arguments to the third.\n\nAlgorithm 1: Recursive multi-tree algorithm for approximately evaluating the partition function Z\nof the product of d Gaussian mixture densities represented by KD\u2013trees. Zmin denotes a running\nlower bound on the partition function, while ^Z is the current estimate. Initialize Zmin = ^Z = 0.\n\nGiven the \ufb01nal partition function estimate ^Z, repeat Algorithm 1 with the following modi\ufb01cations:\n\n3. (c) If ^c (cid:20) ^Zuj < ^c + ^wL for any j, draw L 2 L by sampling li 2 li with weight wli =wli\n3. (d) ^c = ^c + ^wL\n\nAlgorithm 2: Recursive multi-tree algorithm for approximate sampling. ^c denotes the cumulative\nsum of weights ^wL. Initialize by sorting N uniform [0; 1] samples fujg, and set Zmin = ^c = 0.\n\nquantities required by this algorithm may be stored within the KD\u2013trees, avoiding searches\nover the sets li. At the algorithm\u2019s termination, the total error is bounded by\n\njZ (cid:0) ^Zj (cid:20) X\n\njwL (cid:0) ^wLj (cid:20) X\n\nL\n\nL\n\n1\n\n2 (cid:0)K +\n\nL (cid:0) K (cid:0)\n\nL (cid:1) Y wli (cid:20) Z(cid:14) X\n\nY wli (cid:20) Z(cid:14)\n\n(7)\n\nL\n\nwhere the last inequality follows because each input mixture\u2019s weights are normalized.\nThis guarantees that our estimate ^Z is within a fractional tolerance (cid:14) of its true value.\n\n4.2.2 Approximate Sampling from the Cumulative Distribution\nTo use the partition function estimate ^Z for approximate sampling, we repeat the approx-\nimation process in a manner similar to the exact sampler: draw N sorted uniform random\nvariables, and then locate these samples in the cumulative distribution. We do not explicitly\nconstruct the cumulative distribution, but instead use the same approximate partial weight\nsums used to determine ^Z (see equation (6)) to \ufb01nd the block of labels L = l1 (cid:2)(cid:1) (cid:1) (cid:1) (cid:2) ld\nassociated with each sample. Since all labels L 2 L within this block have approximately\nequal distance contribution KL (cid:25) K (cid:3)\nL, we independently sample a label li within each set\nli proportionally to the weight wli.\nThis procedure is shown in Algorithm 2. Note that, to be consistent about when approxima-\ntions are made and thus produce weights ^wL which still sum to ^Z, we repeat the procedure\nfor computing ^Z exactly, including recomputing the running lower bound Zmin. This al-\ngorithm is guaranteed to sample each label L with probability ^pL 2 [pL (cid:0) (cid:15); pL + (cid:15)]:\n\nProof: From our bounds on the error of K (cid:3)\nZ (cid:0) ^wL\nj ^wL\nZ j1 (cid:0) 1\n^Z\nof choosing label L has at most error j wL\n\nZ j1 (cid:0) 1\n^Z=Z\n\nj (cid:20) ^wL\n\nj = ^wL\n\nZ (cid:0) ^wL\n\nL, j wL\n1(cid:0)(cid:14) j (cid:20) ^wL\n\nZ\n\nZ (cid:0) ^wL\n^Z\n\nj (cid:20) j wL\n\nZ Q wli (cid:20) (cid:14)(Q wli ) (cid:20) (cid:14) and\n1(cid:0)(cid:14) (cid:14). Thus, the estimated probability\nZ j + j ^wL\n\nZ (cid:0) ^wL\n^Z\n\nj (cid:20) 2(cid:14)\n\n1(cid:0)(cid:14) .\n\nj^pL (cid:0) pLj = (cid:12)(cid:12)(cid:12)(cid:12)\n\n^wL\n^Z\n\n(cid:0)\n\n, (cid:15)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:20)\n\n2(cid:14)\n\n1 (cid:0) (cid:14)\n\nwL\nZ\nZ j = jKL(cid:0)K (cid:3)\nLj\n1(cid:0)(cid:14) (cid:20) 1+(cid:14)\nZ (cid:0) ^wL\n\n(cid:14)\n\n(8)\n\n\f5 Computational Examples\n5.1 Products of One\u2013Dimensional Gaussian Mixtures\nIn this section, we compare the sampling methods discussed in this paper on three chal-\nlenging one\u2013dimensional examples, each involving products of mixtures of 100 Gaussians\n(see Figure 4). We measure performance by drawing 100 samples, constructing a kernel\ndensity estimate using likelihood cross\u2013validation [1], and calculating the KL divergence\nfrom the true product density. We repeat this test 250 times for each of a range of parameter\nsettings of each algorithm, and plot the average KL divergence versus computation time.\n\nFor the product of three mixtures in Figure 4(a), the multiscale (MS) Gibbs samplers dra-\nmatically outperform standard Gibbs sampling. In addition, we see that sequential Gibbs\nsampling is more accurate than parallel. Both of these differences can be attributed to the\nbimodal product density. However, the most effective algorithm is the (cid:15)\u2013exact sampler,\nwhich matches exact sampling\u2019s performance in far less time (0.05 versus 2.75 seconds).\nFor a product of \ufb01ve densities (Figure 4(b)), the cost of exact sampling increases to 7.6\nhours, but the (cid:15)\u2013exact sampler matches its performance in less than one minute. Even\nfaster, however, is the sequential MS Gibbs sampler, which takes only 0.3 seconds.\n\nFor the previous two examples, mixture importance sampling (IS) is nearly as accurate\nas the best multiscale methods (Gaussian IS seems ineffective). However, in cases where\nall of the input densities have little overlap with the product density, mixture IS performs\nvery poorly (see Figure 4(c)). In contrast, multiscale samplers perform very well in such\nsituations, because they can discard large numbers of low weight product density kernels.\n\n5.2 Tracking an Object using Nonparametric Belief Propagation\nNBP [2] solves inference problems on non\u2013Gaussian graphical models by propagating the\nresults of local sampling computations. Using our multiscale samplers, we applied NBP\nto a simple tracking problem in which we observe a slowly moving object in a sea of ran-\ndomly shifting clutter. Figure 5 compares the posterior distributions of different samplers\ntwo time steps after an observation containing only clutter. (cid:15)\u2013exact sampling matches the\nperformance of exact sampling, but takes half as long.\nIn contrast, a standard particle\n\ufb01lter [7], allowed ten times more computation, loses track. As in the previous section,\nmultiscale Gibbs sampling is much more accurate than standard Gibbs sampling.\n6 Discussion\nFor products of a few mixtures, the (cid:15)\u2013exact sampler is extremely fast, and is guaranteed to\ngive good performance. As the number of mixtures grow, (cid:15)\u2013exact sampling may become\noverly costly, but the sequential multiscale Gibbs sampler typically produces accurate sam-\nples with only a few iterations. We are currently investigating the performance of these\nalgorithms on large\u2013scale nonparametric belief propagation applications.\nReferences\n[1] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall, 1986.\n[2] E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky. Nonparametric belief propagation.\n\nIn CVPR, 2003.\n\n[3] M. Isard. PAMPAS: Real\u2013valued graphical models for computer vision. In CVPR, 2003.\n[4] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Technical\n\nReport 2000-004, Gatsby Computational Neuroscience Unit, 2000.\n\n[5] K. Deng and A. W. Moore. Multiresolution instance-based learning. In IJCAI, 1995.\n[6] A. G. Gray and A. W. Moore. Very fast multivariate kernel density estimation. In JSM, 2003.\n[7] A. Doucet, N. de Freitas, and N. Gordon, editors. Sequential Monte Carlo Methods in Practice.\n\nSpringer-Verlag, New York, 2001.\n\n[8] S. Thrun, J. Langford, and D. Fox. Monte Carlo HMMs. In ICML, pages 415\u2013424, 1999.\n[9] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restora-\n\ntion of images. IEEE Trans. PAMI, 6(6):721\u2013741, November 1984.\n\n[10] J. S. Liu and C. Sabatti. Generalised Gibbs sampler and multigrid Monte Carlo for Bayesian\n\ncomputation. Biometrika, 87(2):353\u2013369, 2000.\n\n[11] J. Strain. The fast Gauss transform with variable scales. SIAM J. SSC, 12(5):1131\u20131139, 1991.\n\n\fe\nc\nn\ne\ng\nr\ne\nv\nD\nL\nK\n\n \n\ni\n\n(a)\n\nInput Mixtures\n\nProduct Mixture\n\n(b)\n\nInput Mixtures\n\ne\nc\nn\ne\ng\nr\ne\nv\nD\n \nL\nK\n\ni\n\n0.1\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n0\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nProduct Mixture\n\n0\n0\n\n0.5\n\n(c)\n\nInput Mixtures\n\ne\nc\nn\ne\ng\nr\ne\nv\nD\n \nL\nK\n\ni\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nProduct Mixture\n\n0\n0\n\n0.1\n\nExact\nMS e \u2212Exact\nMS Seq. Gibbs\nMS Par. Gibbs\nSeq. Gibbs\nPar. Gibbs\nGaussian IS\nMixture IS\n\n0.2\n\n0.3\n\n0.4\n\nComputation Time (sec)\n\n0.5\n\n0.6\n\nExact\nMS e \u2212Exact\nMS Seq. Gibbs\nMS Par. Gibbs\nSeq. Gibbs\nPar. Gibbs\nGaussian IS\nMixture IS\n\n1\n\nComputation Time (sec)\n\n1.5\n\n2\n\n2.5\n\nExact\nMS e \u2212Exact\nMS Seq. Gibbs\nMS Par. Gibbs\nSeq. Gibbs\nPar. Gibbs\nGaussian IS\nMixture IS\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\nComputation Time (sec)\n\nFigure 4: Comparison of average sampling accuracy versus computation time for different algo-\nrithms (see text). (a) Product of 3 mixtures (exact requires 2.75 sec). (b) Product of 5 mixtures (exact\nrequires 7.6 hours). (c) Product of 2 mixtures (exact requires 0.02 sec).\n\nTarget Location\nObservations\nExact NBP\n\nTarget Location\ne \u2212Exact NBP\nParticle Filter\n\nTarget Location\nMS Seq. Gibbs NBP\nSeq. Gibbs NBP\n\n(a)\n\n(b)\n\n(c)\n\nFigure 5: Object tracking using NBP. Plots show the posterior distributions two time steps after an\nobservation containing only clutter. The particle \ufb01lter and Gibbs samplers are allowed equal compu-\ntation. (a) Latest observations, and exact sampling posterior. (b) (cid:15)\u2013exact sampling is very accurate,\nwhile a particle \ufb01lter loses track. (c) Multiscale Gibbs sampling leads to improved performance.\n\n\f", "award": [], "sourceid": 2435, "authors": [{"given_name": "Alexander", "family_name": "Ihler", "institution": null}, {"given_name": "Erik", "family_name": "Sudderth", "institution": null}, {"given_name": "William", "family_name": "Freeman", "institution": null}, {"given_name": "Alan", "family_name": "Willsky", "institution": null}]}