{"title": "Greedy Importance Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 596, "page_last": 602, "abstract": null, "full_text": "Greedy importance sampling \n\nDale Schuurmans \n\nDepartment of Computer Science \n\nUniversity of Waterloo \ndale@cs.uwaterloo.ca \n\nAbstract \n\nI present a simple variation of importance sampling that explicitly search(cid:173)\nes for important regions in the target distribution. I prove that the tech(cid:173)\nnique yields unbiased estimates, and show empirically it can reduce the \nvariance of standard Monte Carlo estimators. This is achieved by con(cid:173)\ncentrating samples in more significant regions of the sample space. \n\n1 Introduction \n\nIt is well known that general inference and learning with graphical models is computa(cid:173)\ntionally hard [1] and it is therefore necessary to consider restricted architectures [13], or \napproximate algorithms to perform these tasks [3, 7]. Among the most convenient and \nsuccessful techniques are stochastic methods which are guaranteed to converge to a correct \nsolution in the limit oflarge samples [10, 11, 12, 15]. These methods can be easily applied \nto complex inference problems that overwhelm deterministic approaches. \n\nThe family of stochastic inference methods can be grouped into the independent Monte \nCarlo methods (importance sampling and rejection sampling [4, 10, 14]) and the dependent \nMarkov Chain Monte Carlo (MCMC) methods (Gibbs sampling, Metropolis sampling, and \n\"hybrid\" Monte Carlo) [5, 10, 11, 15]. The goal of all these methods is to simulate drawing \na random sample from a target distribution P (x) (generally defined by a Bayesian network \nor graphical model) that is difficult to sample from directly. \n\nThis paper investigates a simple modification of importance sampling that demonstrates \nsome advantages over independent and dependent-Markov-chain methods. The idea is to \nexplicitly search for important regions in a target distribution P when sampling from a \nsimpler proposal distribution Q. Some MCMC methods, such as Metropolis and \"hybrid\" \nMonte Carlo, attempt to do something like this by biasing a local random search towards \nhigher probability regions, while preserving the asymptotic \"fair sampling\" properties of \nthe exploration [11, 12]. Here I investigate a simple direct approach where one draws \npoints from a proposal distribution Q but then explicitly searches in P to find points from \nsignificant regions. The main challenge is to maintain correctness (i.e., unbiased ness) of \nthe resulting procedure, which we achieve by independently sampling search subsequences \nand then weighting the sample points so that their expected weight under the proposal \ndistribution Q matches their true probability under the target P. \n\n\fGreedy Importance Sampling \n\n597 \n\nImportance sampling \n\u2022 Draw Xl , ... , X n independently from Q. \n\u2022 Weight each point Xi by W(Xi) = ~I::l. \n\u2022 For a random variable, f, estimate E p (.,) f(x) \n\nby \n\nA \n\nf = n L.Ji=I f(Xi)W(Xi). \n\nI \",n \n\n\"Indirect\" importance sampling \n\u2022 Draw Xl, ... ,xn independently from Q. \n.Weighteachpointxibyu(xi) = (3~~i/. \n\u2022 For a random variable, f, estimate Ep(.,)f(x) \nf = L.Ji=I f(Xi)U(Xi)/ L.Ji=I U(Xi). \n\n\",n \n\n\",n \n\nby \n\nA \n\nFigure 1: Regular and \"indirect\" importance sampling procedures \n\n2 Generalized importance sampling \n\nMany inference problems in graphical models can be cast as determining the expected \nvalue of a random variable of interest, f, given observations drawn according to a target \ndistribution P. That is, we are interested in computing the expectation Ep(x) f(x). Usually \nthe random variable f is simple, like the indicator of some event, but the distribution P \nis generally not in a form that we can sample from efficiently. Importance sampling is a \nuseful technique for estimating Ep(x) f (x) in these cases. The idea is to draw independent \npoints xl, .. \" Xn from a simpler \"proposal\" distribution Q, but then weight these points \nby w(x) = P(x)/Q(x) to obtain a \"fair\" representation of P. Assuming that we can \nefficiently evaluate P(x) at each point, the weighted sample can be used to estimate de(cid:173)\nsired expectations (Figure 1). The correctness (i.e., unbiasedness) of this procedure is easy \nto establish, since the expected weighted value of f under Q is just Eq(x)f(x)w(x) = \nEXEX [f(x)w(x)] Q(x) = Ex EX [f(x)~t:n Q(x) = EXEX f(x)P(x) = Ep(x)f(x), \nThis technique can be implemented using \"indirect\" weights u( x) = f3P (x) / Q ( x) and an \nalternative estimator (Figure 1) that only requires us to compute a fixed multiple of P (x). \nThis preserves asymptotic correctness because ~ E7=1 f(xdu(xd and ~ E?=l U(Xi) con(cid:173)\nverge to f3Ep(x)f(x) and f3 respectively, which yields j -t Ep(x)f(x) (generally [4]). It \nwill always be possible to apply this extended approach below, but we drop it for now. \n\nImportance sampling is an effective estimation technique when Q approximates P over \nmost of the domain, but it fails when Q misses high probability regions of P and system(cid:173)\natically yields samples with small weights. In this case, the reSUlting estimator will have \nhigh variance because the sample will almost always contain unrepresentative points but is \nsometimes dominated by a few high weight points. To overcome this problem it is criti(cid:173)\ncal to obtain data points from the important regions of P. Our goal is to avoid generating \nsystematically under-weight samples by explicitly searching for significant regions in the \ntarget distribution P. To do this, and maintain the unbiased ness of the resulting procedure, \nwe develop a series of extensions to importance sampling that are each provably correct. \n\nThe first extension is to consider sampling blocks of points instead of just individual points. \nLet B be a partition of X into finite blocks B, where UBE8 B = X, B n B' = 0, \nand each B is finite. (Note that B can be infinite.) The \"block\" sampling procedure \n(Figure 2) draws independent blocks of points to construct the final sample, but then \nweights points by their target probability P(x) divided by the total block probability \nQ (B (x)), For discrete spaces it is easy to verify that this procedure yields unbiased esti-\nmates, since Eq(x) [EXjEB(X) f(xj)w(Xj)] = EXEX [L:XjEB(X) f(xj)w(Xj)] Q(x) = \nEBE8 EXiEB [EXjEB f(xj)w(Xj)] Q(Xi) = EBE8 [EXjEB f(xj)w(Xj)] Q(B) = \nEBE8 [EXjEB f(xj) ~\u00ab~n Q(B) = EBE8 [EXjEB f(xj )P(Xj)] = L:xEx f(x)P(x). \n\n\f598 \n\nD. Schuurmans \n\n\"Block\" importance sampling \n\u2022 Draw Xl , ... , Xn independently from Q. \n\u2022 For Xi, recover block Bi = {Xi,l, ... ,Xi,bJ. \n\u2022 Create a large sample out of the blocks \nXl ,1, ... , Xl ,bl , X2 ,1 , ... , X2,b2' \u2022\u2022\u2022 , Xn,l, ... , Xn ,b\" . \n\u2022 Weighteachx ' ' byw(x.') = \n\nP(zi,j) \n\nI ,} \n\nI,} \n\n\",oi Q(z,,)' \nL.Jj=l \n\n' ,J \n\n\u2022 For a random variable, f, estimate Ep(z) f(x) \nj = ~ 2:~=1 2:~~1 f(Xi ,j)W(Xi,j). \n\nby \n\n\"Sliding window\" importance sampling \n\u2022 Draw Xl, ... , xn independently from Q. \n\u2022 For Xi , recover block Bi , and let Xi ,l = Xi : \n- Get Xi,l 'S successors Xi,l, Xi,2, ... , Xi ,m \nby climbing up m - 1 steps from Xi ,l . \n\n- Get predecessorsxi,_m+l\" ... ,Xi,-l , Xi,O \n\nby climbing down m - 1 steps from Xi,l . \n\n- Weight W(Xi ,j)= P(Xi ,i)/2:!=i_m+lQ (x; ,k) \n\u2022 Create final sample from successor points \nXI ,I, ... , Xl ,m, X2,1 , \u2022\u2022. , X2 ,m, \u2022.. , Xn ,l, ... , Xn ,m. \n\u2022 For a random variable, f, estimate Ep(z) f(x) \nf = n 6i=1 6j=1 f(Xi,j)W(Xi,j) . \n\n1 \",n \",m \n\nby \n\nA \n\nFigure 2: \"Block\" and \"sliding window\" importance sampling procedures \n\nCrucially, this argument does not depend on how the partition of X is chosen. In fact, we \ncould fix any partition, even one that depended on the target distribution P, and still obtain \nan unbiased procedure (so long as the partition remains fixed) . Intuitively, this works be(cid:173)\ncause blocks are drawn independently from Q and the weighting scheme still produces a \n\"fair\" representation of P . (Note that the results presented in this paper can all be extended \nto continuous spaces under mild technical restrictions. However, for the purposes of clarity \nwe will restrict the technical presentation in this paper to the discrete case.) \n\nThe second extension is to allow countably infinite blocks that each have a discrete to(cid:173)\ntal order . . . < Xi -1 < Xi < Xi +1 < .. . defined on their elements. This order could \nreflect the relative probability of Xi and X j under P, but for now we just consider it \nto be an arbitrary discrete order. To cope with blocks of unbounded length, we em(cid:173)\nploy a \"sliding window\" sampling procedure that selects a contiguous sub-block of size \nm from within a larger selected block (Figure 2). This procedure builds each indepen(cid:173)\ndent subsample by choosing a random point Xl from the proposal distribution Q, de(cid:173)\ntermining its containing block B(xt), and then climbing up m - 1 steps to obtain the \nsuccessors Xl, X2, \u2022\u2022. , Xm , and climbing down m - 1 steps to obtain the predecessors \nX- m +1 , ... , X-I, Xo . The successor points (including Xl) appear in the final sample, but \nthe predecessors are only used to determine the weights of the sample points. Weights \nare determined by the target probability P (x) divided by the probability that the point \nX appears in a random reconstruction under Q. This too yields an unbiased estimator \n\nsinceEQ(x) [2:~lf(xj)w(Xj)] = 2: XtEX [2:~;~-1 f(xj)2:~=j~~:: Q(Xk)] Q(Xl ) = \n\n'\" \n6BEB \n\n2: \n\nx t EB \n\n2:l +m - 1 f(xj)P(Xj)Q(xd -\n\nj=l \"'J. \n\n6k=J-m+l \n\n'\" \n\n2: \n\nQ(Xk) - 6BEB Xj EB \n\n2:j \n\nl=j-m+1 \"'J . \n\nf(xj)P(Xj)Q(xt} -\nQ(Xk) -\n6k=J-m+l \n\n2:BEB2:x j EBf(xj )P(Xj )Ei::=::: ~;::~ = 2:BEB2:xjEBf(xj )P(Xj)= 2: xEx f(x)P(x). \n\n(The middle line breaks the sum into disjoint blocks and then reorders the sum so that in(cid:173)\nstead of first choosing the start point Xl and then XL'S successors Xl, .. \u2022 , Xl+m-l. we first \nchoose the successor point Xj and then the start points Xj-m+1 , ... , Xj that could have led \nto Xj). Note that this derivation does not depend on the particular block partition nor on the \nparticular discrete orderings, so long as they remain fixed. This means that, again, we can \nuse partitions and orderings that explicitly depend on P and still obtain a correct procedure. \n\n\fGreedy Importance Sampling \n\n599 \n\n\"Greedy\" importance sampling (I-D) \ne Draw Xl , ... , Xn independently from Q. \neForeachxi , letxi,l =Xi : \n- Compute successors Xi,l, Xi,2, ... ,Xi,m by taking \n\nm - 1 size \u20ac steps in the direction of increase. \n\nIf(x)P(x)1 \n\ncollision \n\n- If an improper ascent or descent occurs, \n\n- Compute predecessors Xi,-m+l, ... ,Xi ,-l ,Xi ,O by \ntaking m -1 size \u20ac steps in the direction of decrease. \n\n- Weightw(xi,j) = P(Xi,j)/L:~=j_m+l Q(Xi ,k) . k \n\ne Create the final sample from successor points \n\ntruncate paths as shown on the upper right. \n\nXI ,l, \u2022\u2022\u2022 , XI ,m , X'2 , I , \u2022\u2022\u2022 , X2 ,m , \u2022\u2022\u2022 ,Xn ,l, \u2022\u2022\u2022 , ::tn ,m-\n\ne For a random variable, f, estimate Ep(z) f(x) \n\nby \n\nA \n\nf = -;; L..ti=l L.Jj=l f(Xi ,j)W(Xi ,j) . \n\n1 \",n \",m \n\nx\u00b7 x\" \n\n~ \nif 6 \"\\, \n\nmerge \n\nFigure 3: \"Greedy\" importance sampling procedure; \"colliding\" and \"merging\" paths. \n\n3 Greedy importance sampling: I-dimensional case \n\nFinally, we apply the sliding window procedure to conduct an explicit search for impor(cid:173)\ntant regions in X. It is well known that the optimal proposal distribution for importance \nsampling isjust Q* (x) = If(x )P(x)11 EXEX If(x )P(x) 1 (which minimizes variance [2]). \nHere we apply the sliding window procedure using an order structure that is determined by \nthe objective If(x )P(x )1 . The hope is to obtain reduced variance by sampling independent \nblocks of points where each block (by virtue of being constructed via an explicit search) is \nlikely to contain at least one or two high weight points. That is, by capturing a moderate \nsize sample of independent high weight points we intuitively expect to outperform standard \nmethods that are unlikely to observe such points by chance. Our experiments below verify \nthis intuition (Figure 4). \n\nThe main technical issue is maintaining unbiasedness, which is easy to establish in the 1-\ndimensional case. In the simple I-d setting, the \"greedy\" importance sampling procedure \n(Figure 3) first draws an initial point Xl from Q and then follows the direction of increas(cid:173)\ning If(x)P(x)l, taking fixed size \u20ac steps, until either m - 1 steps have been taken or we \nencounter a critical point. A single \"block\" in our final sample is comprised of a complete \nsequence captured in one ascending search. To weight the sample points we account for all \npossible ways each point could appear in a subsample, which, as before, entails climbing \ndown m-l steps in the descent direction (to calculate the denominators). The unbiasedness \nof the procedure then follows directly from the previous section, since greedy importance \nsampling is equivalent to sliding window importance sampling in this setting. \n\nThe only nontrivial issue is to maintain disjoint search paths. Note that a search path must \nterminate whenever it steps from a point x\u00b7 to a point x** with lower value; this indicates \nthat a collision has occurred because some other path must reach x\u00b7 from the \"other side\" \nof the critical point (Figure 3). At a collision, the largest ascent point x\u00b7 must be allocated \nto a single path. A reasonable policy is to allocate x\u00b7 to the path that has the lowest weight \npenultimate point (but the only critical issue is ensuring that it gets assigned to a single \nblock). By ensuring that the critical point is included in only one of the two distinct search \npaths, a practical estimator can be obtained that exhibits no bias (Figure 4). \n\nTo test the effectiveness of the greedy approach I conducted several I-dimensional experi(cid:173)\nments which varied the relationship between P, Q and the random variable f (Figure 4). In \n\n\f600 \n\nD. Schuurmans \n\nthese experiments greedy importance sampling strongly outperformed standard methods, \nincluding regular importance sampling and directly sampling from the target distribution P \n(rejection sampling and Metropolis sampling were not competitive). The results not only \nverify the unbiasedness of the greedy procedure, but also show that it obtains significantly \nsmaller variances across a wide range of conditions. Note that the greedy procedure actu(cid:173)\nally uses m out of 2m - 1 points sampled for each block and therefore effectively uses a \ndouble sample. However, Figure 4 shows that the greedy approach often obtains variance \nreductions that are far greater than 2 (which corresponds to a standard deviation reduction \nof V2). \n\n4 Multi-dimensional case \n\nOf course, this technique is worthwhile only if it can be applied to multi-dimensional prob(cid:173)\nlems. In principle, it is straightforward to apply the greedy procedure of Section 3 to \nmulti-dimensional sample spaces. The only new issue is that discrete search paths can now \npossibly \"merge\" as well as \"collide\"; see Figure 3. (Recall that paths could not merge \nin the previous case.) Therefore, instead of decomposing the domain into a collection of \ndisjoint search paths, the objective If(x)P(x)1 now decomposes the domain into a forest \nof disjoint search trees. However, the same principle could be used to devise an unbiased \nestimator in this case: one could assign a weight to a sample point x that is just its target \nprobability P (x) divided by the total Q-probability of the subtree of points that lead to x \nin fewer than m steps. This weighting scheme can be shown to yield an unbiased estimator \nas before. However, the resulting procedure is impractical because in an N-dimensional \nsample space a search tree will typically have a branching factor of n(N); yielding expo(cid:173)\nnentially large trees. Avoiding the need to exhaustively examine such trees is the critical \nissue in applying the greedy approach to multi-dimensional spaces. \n\nThe simplest conceivable strategy is just to ignore merge events. Surprisingly, this turns \nout to work reasonably well in many circumstances. Note that merges will be a measure \nzero event in many continuous domains. In such cases one could hope to ignore merges \nand trust that the probability of \"double counting\" such points would remain near zero. \nI conducted simple experiments with a version of greedy importance sampling procedure \nthat ignored merges. This procedure searched in the gradient ascent direction of the objec(cid:173)\ntive If{x)p{x)1 and heuristically inverted search steps by climbing in the gradient descent \ndirection. Figures 5 and 6 show that, despite the heuristic nature of this procedure, it nev(cid:173)\nertheless demonstrates credible performance on simple tasks. \n\nThe first experiment is a simple demonstration from [12, 10] where the task is to sample \nfrom a bivariate Gaussian distribution P of two highly correlated random variables using a \n\"weak\" proposal distribution Q that is standard normal (depicted by the elliptical and circu(cid:173)\nlar one standard deviation contours in Figure 5 respectively). Greedy importance sampling \nonce again performs very well (Figure 5); achieving unbiased estimates with lower variance \nthan standard Monte Carlo estimators, including common MCMC methods. \n\nTo conduct a more significant study, I applied the heuristic greedy method to an inference \nproblem in graphical models: recovering the hidden state sequence from a dynamic proba(cid:173)\nbilistic model, given a sequence of observations. Here I considered a simple Kalman filter \nmodel which had one state variable and one observation variable per time-step, and used \ntheconditionaldistributionsXtlXt_ l \"-' N(Xt_l,O';), ZtlXt \"\" N(xt,O'~) and initial dis(cid:173)\ntribution Xl \"-' N(O,O';) . The problem was to infer the value of the final state variable Xt \ngiven the observations Zl, Z2, \"', Zt. Figure 6 again demonstrates that the greedy approach \n\n\fGreedy Importance Sampling \n\n601 \n\nhas a strong advantage over standard importance sampling. (In fact, the greedy approach \ncan be applied to \"condensation\" [6, 8] to obtain further improvements on this task, but \nspace bounds preclude a detailed discussion.) \n\nOverall, these preliminary results show that despite the heuristic choices made in this sec(cid:173)\ntion, the greedy strategy still performs well relative to common Monte Carlo estimators, \nboth in terms of bias and variance (at least on some low and moderate dimension prob(cid:173)\nlems). However, the heuristic nature of this procedure makes it extremely unsatisfying. In \nfact, merge points can easily make up a significant fraction of finite domains. It turns out \nthat a rigorously unbiased and feasible procedure can be obtained as follows. First, take \ngreedy fixed size steps in axis parallel directions (which ensures the steps can be inverted). \nThen, rather than exhaustively explore an entire predecessor tree to calculate the weights of \na sample point, use the well known technique of Knuth [9] to sample a single path from the \nroot and obtain an unbiased estimate of the total Q-probability of the tree. This procedure \nallows one to formulate an asymptotically unbiased estimator that is nevertheless feasible \nto implement. It remains important future work to investigate this approach and compare \nit to other Monte Carlo estimation methods on large dimensional problems-in particular \nhybrid Monte Carlo [11, 12]. The current results already suggest that the method could \nhave benefits. \n\nReferences \n\n[1] P. Dagum and M. Luby. Approximating probabilistic inference in Bayesian belief networks is \n\nNP-hard. Artif Intell, 60: 141-153, 1993. \n\n[2] M. Evans. Chaining via annealing. Ann Statist, 19:382-393, 1991. \n\n[3] B. Frey. Graphical Models for Machine Learning and Digital Communication. MIT Press, \n\nCambridge, MA, 1998. \n\n[4] J. Geweke. Baysian inference in econometric models using Monte Carlo integration. Econo(cid:173)\n\nmetrica, 57:1317-1339, 1989. \n\n[5] W. Gilks, S. Richardson, and D. Spiegelhalter. Markov chain Monte Carlo in practice. Chapman \n\nand Hall, 1996. \n\n[6] M. Isard and A. Blake. Coutour tracking by stochastic propagation of conditional density. In \n\nECCV, 1996. \n\n[7] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods for \n\ngraphical models. In Learning in Graphical Models. Kluwer, 1998. \n\n[8] K. Kanazawa, D. Koller, and S. Russell. Stochastic simulation algorithms for dynamic proba(cid:173)\n\nbilistic networks. In UAl, 1995. \n\n[9] D. Knuth. Estimating the efficiency of backtracking algorithms. Math. Comput., 29(129): 121-\n\n136,1975. \n\n[10] D. MacKay. Intro to Monte Carlo methods. In Learning in Graphical Models. Kluwer, 1998. \n\n[11] R. Neal. Probabilistic inference using Markov chain Monte Carlo methods. 1993. \n\n[12] R. Neal. Bayesian Learning for Neural Networks . Springer, New York, 1996. \n\n[13] J. Pearl. Probabilistic Reasoning in Intelligence Systems. Morgan Kaufmann, 1988. \n\n[14] R. Shacter and M. Peot. Simulation approaches to general probabilistic inference in belief \n\nnetworks. In Uncertainty in Artificial Intelligence 5. Elsevier, 1990. \n\n[15] M. Tanner. Tools for statistical inference: Methods for exploration of posterior distributions \n\nand likelihoodfunctions. Springer, New York, 1993. \n\n\f602 \n\nD. Schuurmans \n\n. \n\nf ' \n\n._.11 \n\nmean \nbias \nstdev \n\nDirec \n0.779 \n0.001 \n0.071 \n\nGreed \n0.781 \n0.001 \n0.Q38 \n\nImprt Direc \n1.038 \n0.777 \n0.002 \n0.003 \n0.065 \n0.088 \n\nGreed \n1.044 \n0.003 \n0.049 \n\n1m rt \n1.032 \n0.008 \n0.475 \n\nDirec \n0.258 \n0.049 \n0.838 \n\nGreed \n0.208 \n0.000 \n0.010 \n\nImrt Direc \n6.024 \n0.209 \n0.001 \n0.001 \n0.095 \n0.069 \n\nGreed \n6.028 \n0.004 \n0.037 \n\nImprt \n6.033 \n0.009 \n0.094 \n\nFigure 4: I-dimensional experiments: 1000 repetitions on estimation samples of size 100. \nProblems with varying relationships between P, Q, I and II PI. \n\n{/... . \n\n\u2022 . . \n\n/ .. :~:. \n\n. . \n\n'I:'::'~\" \n\n,1 :.~ \n\n/ . . . . \n\n.. \n\n.. . \n\n. , \n\nmean \nbias \nstdev \n\nDirect \n0.1884 \n0.0022 \n0.07 \n\nGreedy \n0.1937 \n0.0075 \n0.1374 \n\nImportance \n\n0.1810 \n0.0052 \n0.1762 \n\nRejection \n0.1506 \n0.0356 \n0.2868 \n\nGibbs \n0.3609 \n0.1747 \n0.5464 \n\nMetropolis \n\n8.3609 \n8.1747 \n22.1212 \n\nFigure 5: 2-dimensional experiments: 500 repetitions on estimation samples of size 200. \nPictures depict: direct, greedy importance, regular importance, and Gibbs sampling, show(cid:173)\ning 1 standard deviation countours (dots are sample points, vertical lines are weights). \n\nmean \nbias \nstdev \n\nImportance \n\n5.2269 \n2.7731 \n1.2107 \n\nGreedy \n6.9236 \n1.0764 \n0.1079 \n\nFigure 6: A 6-dimensional experiment: 500 repetitions on estimation samples of size 200. \nEstimating the value of Xt given the observations Zl, \"\" Zt. Pictures depict paths sampled \nby regular versus greedy importance sampling. \n\n\f", "award": [], "sourceid": 1691, "authors": [{"given_name": "Dale", "family_name": "Schuurmans", "institution": null}]}