{"title": "Policy Search via Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1022, "page_last": 1028, "abstract": null, "full_text": "Policy Search via Density Estimation \n\nComputer Science Division \n\nAndrewY. Ng \nu.c. Berkeley \n\nBerkeley, CA 94720 \nang@cs.berkeley.edu \n\nRonald Parr \n\nDaphne Koller \n\nComputer Science Dept. \n\nComputer Science Dept. \n\nStanford University \nStanford, CA 94305 \nparr@cs.stanjord.edu \n\nStanford University \nStanford, CA 94305 \n\nkolle r@cs.stanjord.edu \n\nAbstract \n\nWe propose a new approach to the problem of searching a space of \nstochastic controllers for a Markov decision process (MDP) or a partially \nobservable Markov decision process (POMDP). Following several other \nauthors, our approach is based on searching in parameterized families \nof policies (for example, via gradient descent) to optimize solution qual(cid:173)\nity. However, rather than trying to estimate the values and derivatives \nof a policy directly, we do so indirectly using estimates for the proba(cid:173)\nbility densities that the policy induces on states at the different points \nin time. This enables our algorithms to exploit the many techniques for \nefficient and robust approximate density propagation in stochastic sys(cid:173)\ntems. We show how our techniques can be applied both to deterministic \npropagation schemes (where the MDP's dynamics are given explicitly in \ncompact form,) and to stochastic propagation schemes (where we have \naccess only to a generative model, or simulator, of the MDP). We present \nempirical results for both of these variants on complex problems. \n\n1 Introduction \n\nIn recent years, there has been growing interest in algorithms for approximate planning \nin (exponentially or even infinitely) large Markov decision processes (MDPs) and par(cid:173)\ntially observable MDPs (POMDPs). For such large domains, the value and Q-functions \nare sometimes complicated and difficult to approximate, even though there may be simple, \ncompactly representable policies which perform very well. This observation has led to par(cid:173)\nticular interest in direct policy search methods (e.g., [9, 8, 1]), which attempt to choose a \ngood policy from some restricted class IT of policies. In our setting, IT = {1ro : (J E ~m} is \na class of policies smoothly parameterized by (J E ~m. If the value of 1ro is differentiable \nin (J, then gradient ascent methods may be used to find a locally optimal 1ro. However, \nestimating values of 1ro (and the associated gradient) is often far from trivial. One simple \nmethod for estimating 1ro's value involves executing one or more Monte Carlo trajectories \nusing 1ro, and then taking the average empirical return; cleverer algorithms executing sin(cid:173)\ngle trajectories also allow gradient estimates [9, 1]. These methods have become a standard \napproach to policy search, and sometimes work fairly well. \n\nIn this paper, we propose a somewhat different approach to this value/gradient estimation \nproblem. Rather than estimating these quantities directly, we estimate the probability den(cid:173)\nsity over the states of the system induced by 1ro at different points in time. These time slice \n\n\fPolicy Search via Density Estimation \n\n1023 \n\ndensities completely determine the value of the policy 1re. While density estimation is not \nan easy problem, we can utilize existing approaches to density propagation [3, 5], which al(cid:173)\nlow users to specify prior knowledge about the densities, and which have also been shown, \nboth theoretically and empirically, to provide robust estimates for time slice densities. We \nshow how direct policy search can be implemented using this approach in two very differ(cid:173)\nent settings of the planning problem: In the first, we have access to an explicit model of the \nsystem dynamics, allowing us to provide an explicit algebraic operator that implements the \napproximate density propagation process. In the second, we have access only to a genera(cid:173)\ntive model of the dynamics (which allows us only to sample from, but does not provide an \nexplicit representation of, next-state distributions). We show how both of our techniques \ncan be combined with gradient ascent in order to perform policy search, a somewhat subtle \nargument in the case of the sampling-based approach. We also present empirical results for \nboth variants in complex domains. \n\n2 Problem description \n\nA Markov Decision Process (MDP) is a tuple (S, So, A, R, P) where:! S is a (possibly \ninfinite) set of states; So E S is a start state; A is a finite set of actions; R is a reward \nfunction R : S f-t [0, Rmax]; P is a transition model P : S x A f-t ils, such that \nP(s' I s, a) gives the probability oflanding in state s' upon taking action a in state s. \nA stochastic policy is a map 1r : S f-t ilA, where 1r( a Is) is the probability of taking action \na in state s. There are many ways of defining a policy 1r'S \"quality\" or value. For a horizon \nT and discount factor 1', the finite horizon discounted value function VT,\"Y[1r] is defined by \nVO,\"Y[1r](s) = R(s) ; vt+1,\"Y[1r](s) = R(s) + l' L:a 1r(a I s) L:sl P(s' Is, a)vt'''Y[1r](s'). \nFor an infinite state space (here and below), the summation is replaced by an integral. We \ncan now define several optimality criteria. The finite horizon total reward with horizon \nT is VT[1r] = VT,d1r](so). The infinite horizon discounted reward with discount l' < \n1 is V\"Y[1r] = limT-HXl VT,\"Y[1r](So). The infinite horizon average reward is Vavg [1r] = \nlimT-HXl ~ VT,1 [1r](so), where we assume that the limit exists. \nFix an optimality criterion V. Our goal is to find a policy that has a high value. As dis(cid:173)\nWe assume that II = {1re I \u00b0 E ffim} is a set of policies parameterized by 0 E ffi.m, and \ncussed, we assume we have a restricted set II of policies, and wish to select a good 1r E II. \nthat 1re(a I s) is continuously differentiable in 0 for each s, a. As a very simple example, \nwe may have a one-dimensional state, two-action MDP with \"sigmoidal\" 1re, such that the \nprobability of choosing action ao at state x is 1re(ao I x) = 1/(1 + exp( -81 - 82x)) . \nNote that this framework also encompasses cases where our family II consists of policies \nthat depend only on certain aspects of the state. In particular, in POMDPs, we can restrict \nattention to policies that depend only on the observables. This restriction results in a sub(cid:173)\nclass of stochastic memory-free policies. By introducing artificial \"memory bits\" into the \nprocess state, we can also define stochastic limited-memory policies. [6] \nEach 0 has a value V[O] = V[1re], as specified above. To find the best policy in II, we can \nsearch for the 0 that maximizes V[O]. If we can compute or approximate V[O], there are \nmany algorithms that can be used to find a local maximum. Some, such as Nelder-Mead \nsimplex search (not to be confused with the simplex algorithm for linear programs), require \nonly the ability to evaluate the function being optimized at any point. If we can compute \nor estimate V[O]'s gradient with respect to 0, we can also use a variety of (deterministic or \nstochastic) gradient ascent methods. \n\nIWe write rewards as R(s) rather than R(s, a), and assume a single start state rather than an \ninitial-state distribution, only to simplify exposition; these and several other minor extensions are \ntrivial. \n\n\f1024 \n\nA. Y Ng, R. Parr and D. Koller \n\n3 Densities and value functions \n\nMost optimization algorithms require some method for computing V[O] for any 0 (and \nsometimes also its gradient). In many real-life MOPs, however, doing so exactly is com(cid:173)\npletely infeasible, due to the large or even infinite number of states. Here, we will consider \nan approach to estimating these quantities, based on a density-based reformulation of the \nvalue function expression. A policy 71\" induces a probability distribution over the states at \neach time t. Letting \u00a2(O) be the initial distribution (giving probability 1 to so), we define \nthe time slice distributions via the recurrence: \n\ns \n\na \n\n(1) \n\nIt is easy to verify that the standard notions of value defined earlier can reformulated in \nterms of \u00a2(t); e.g., VT,1'[7I\"](So) = Ei'=o ,,/(\u00a2(t) . R), where\u00b7 is the dot-product operation \n(equivalently, the expectation of R with respect to \u00a2(t). Somewhat more subtly, for the \ncase of infinite horizon average reward, we have that Vavg [71\"] = \u00a2(oo) . R, where \u00a2(oo) is \nthe limiting distribution of (1), if one exists. \n\nThis reformulation gives us an alternative approach to evaluating the value of a policy 71\"0: \nwe first compute the time slice densities \u00a2(t) (or \u00a2(oo), and then use them to compute the \nvalue. Unfortunately, that modification, by itself, does not resolve the difficulty. Repre(cid:173)\nsenting and computing probability densities over large or infinite spaces is often no easier \nthan representing and computing value functions. However, several results [3, 5] indicate \nthat representing and computing high-quality approximate densities may often be quite \nfeasible. The general approach is an approximate density propagation algorithm, using \ntime-slice distributions in some restricted family 3. For example, in continuous spaces, 3 \nmight be the set of multivariate Gaussians. \n\nThe approximate propagation algorithm modifies equation (1) to maintain the time-slice \ndensities in 3. More precisely, for a policy 71\"0, we can view (1) as defining an operator \ncf>[0] that takes one distribution in !:1s and returns another. For our current policy 71\"0 0 , \nwe can rewrite (1) as: \u00a2(t+1) = cf>[Oo](\u00a2(t)) . In most cases,=: will not be closed under \ncf>; approximate density propagation algorithms use some alternative operator 4>, with the \nproperties that, for \u00a2 E 3: (a) 4>( \u00a2) is also in 3, and (b) 4>( \u00a2) is (hopefully) close to cf>(\u00a2). \nWe use 4>[0] to denote the approximation to cf>[0], and \u00a2(t) to denote (4) [0]) (t) (\u00a2(O)). If \n4> is selected carefully, it is often the case that \u00a2(t) is close to \u00a2(t). Indeed, a standard \ncontraction analysis for stochastic processes can be used to show: \nProposition 1 Assume thatJor all t, 11cf>(\u00a2(t)) - 4>(\u00a2(t))lll ~ c. Then there exists some \nconstant>. such thatJor all t, 1I\u00a2(t) - \u00a2(t) lit ~ c/ >.. \nIn some cases, >. might be arbitrarily small, in which case the proposition is meaningless. \nHowever, there are many systems where>. is reasonable (and independent of c) [3]. Fur(cid:173)\nthermore, empirical results also show that approximate density propagation can often track \nthe exact time slice distributions quite accurately. \n\nApproximate tracking can now be applied to our planning task. Given an optimality crite(cid:173)\nrion V expressed with \u00a2(t) s, we define an approximation V to it by replacing each \u00a2(t) with \n\u00a2(t), e.g., VT,1'[7I\"](so) = Ei'=o ,t\u00a2(t) . R. Accuracy guarantees on approximate tracking \ninduce comparable guarantees on the value approximation; from this, guarantees on the \nperformance of a policy 7I\"iJ found by optimizing V are also possible: \nProposition 2 Assume that,for all t, we have that 11\u00a2(t) - \u00a2(t) lit ~ 6. ThenJor each fixed \nT, ,: IVT,1'[7I\"](So) - VT,1'[7I\"](so)I = 0(6). \n\n\fPolicy Search via Density Estimation \n\n1025 \n\nProposition 3 Let 0* = argmaxo V[O] and 0 \nV[O]I ::; \u20ac, \n\nthen V[O*] - V[O] ::; 2\u20ac\n\n. \n\nargmaxo V[O]. If maxo!V[O] -\n\n4 Differentiating approximate densities \n\nIn this section we discuss two very different techniques for maintaining an approximate \ndensity \u00a2 (t) using an approximate propagation operator <1>, and show when and how they \ncan be combined with gradient ascent to perform policy search. In general, we will assume \nthat :=: is a family of distributions parameterized by e E ffi.l. For example, if :=: is the set \nof d-dimensional multivariate Gaussians with diagonal covariance matrices, e would be a \n2d-dimensional vector, specifying the mean vector and the covariance matrix's diagonal. \n\nNow, consider the task of doing gradient ascent over the space of policies, using some \noptimality criterion V, say VT,.,,[O]. Differentiating it relative to 0, we get '\\7oVT,.,, [O] = \n\n'\u00a3'['=0 ,t ds~t) . R. To avoid introducing new notation, we also use \u00a2 (t) to denote the as(cid:173)\nsociated vector of parameters e E ffi.l . These parameters are a function of O. Hence, the \ninternal gradient term is represented by an \u00a3 x m Jacobian matrix, with entries representing \nthe derivative of a parameter ~i relative to a parameter OJ. This gradient can be computed \nusing a simple recurrence, based on the chain rule for derivatives: \n\nThe first summand (an \u00a3 x m Jacobian) is the derivative of the transition operator <1> relative \nto the policy parameters O. The second is a product of two terms: the derivative of <1> \nrelative to the distribution parameters, and the result of the previous step in the recurrence. \n\n4.1 Deterministic density propagation \n\nConsider a transition operator q, (for simplicity, we omit the dependence on 0). The idea in \nthis approach is to try to get <1>( \u00a2) to be as close as possible to q,(\u00a2), subject to the constraint \nthat <1>( \u00a2) E 3. Specifically, we define a projection operator r that takes a distribution 'ljJ \nnot in 3, and returns a distribution in 3 which is closest (in some sense) to 'ljJ . We then \ndefine <1>(\u00a2) = r(q,(\u00a2)). In order to ensure that gradient descent applies in this setting, \nwe need only ensure that rand q, are differentiable functions. Clearly, there are many \ninstantiations of this idea for which this assumption holds. We provide two examples. \nConsider a continuous-state process with nonlinear dynamics, where q, is a mixture of \nconditional linear Gaussians. We can define 3 to be the set of multivariate Gaussians. \nThe operator r takes a distribution (a mixture of gaussians) 'ljJ and computes its mean \nand covariance matrix. This can be easily computed from 'ljJ's parameters using simple \ndifferentiable algebraic operations. \n\nA very different example is the algorithm of [3] for approximate density propagation in \ndynamic Bayesian networks (DBNs). A DBN is a structured representation of a stochastic \nprocess, that exploits conditional independence properties of the distribution to allow com(cid:173)\npact representation. In a DBN, the state space is defined as a set of possible assignments \nx to a set of random variables Xl , ' .. ,Xn . The transition model P(x' I x) is described \nusing a Bayesian network fragment over the nodes {Xl, ' \" ,Xn , X{, .. . ,X~}. A node \nX i represents xft) and X: represents xft+1). The nodes X i in the network are forced \nto be roots (i.e., have no parents), and are not associated with conditional probability dis(cid:173)\ntributions. Each node X: is associated with a conditional probability distribution (CPO), \nwhich specifies P(X: I Parents(XD) . The transition probability P(X' I X) is defined as \n\n\f1026 \n\nA. Y. Ng, R. Parr and D. Koller \n\n11 P(X: I Parents(Xf)). OBNs support a compact representation of complex transition \nmodels in MOPs [2]. We can extend the OBN to encode the behavior of an MOP with a \nstochastic policy 7l' by introducing a new random variable A representing the action taken \nat the current time. The parents of A will be those variables in the state on which the action \nis allowed to depend. The CPO of A (which may be compactly represented with function \napproximation) is the distribution over actions defined by 7l' for the different contexts. \n\nIn discrete OBNs, the number of states grows exponentially with the number of state vari(cid:173)\nables, making an explicit representation of a joint distribution impractical. The algorithm \nof [3] defines:::: to be a set of distributions defined compactly as a set of marginals over \nsmaller clusters of variables. In the simplest example, :::: is the set of distributions where \nXI, ... ,X n are independent. The parameters ~ defining a distribution in :::: are the param(cid:173)\neters of n multinomials. The projection operator r simply marginalizes distributions onto \nthe individual variables, and is differentiable. One useful corollary of [3]'s analysis is that \nthe decay rate of a structured ~ over:::: can often be much higher than the decay rate of \n~, so that multiple applications of ~ can converge very rapidly to a stationary distribution; \nthis property is very useful when approximating \u00a2(oo) to optimize relative to Vavg . \n\n4.2 Stochastic density propagation \n\nIn many settings, the assumption that we have direct access to ~ is too strong. A weaker \nassumption is that we have access to a generative model -\na black box from which we \ncan generate samples with the appropriate distribution; i.e., for any s, a, we can generate \nsamples s' from P(s' I s, a). In this case, we use a different approximation scheme, \nbased on [5]. The operator ~ is a stochastic operator. It takes the distribution \u00a2, and \ngenerates some number of random state samples Si from it. Then, for each Si and each \naction a, we generate a sample s~ from the transition distribution P(\u00b7 I Si, a). This sample \n(Si' ai, sD is then assigned a weight Wi = 7l'8(ai I Si), to compensate for the fact that not \nall actions would have been selected by 7l'e with equal probability. The resulting set of N \nsamples s~ weighted by the WiS is given as input to a statistical density estimator, which \nuses it to estimate a new density \u00a2'. We assume that the density estimation procedure is a \ndifferentiable function of the weights, often a reasonable assumption. \n\nClearly, this <1> can be used to compute \u00a2(t) for any t, and thereby approximate 7l'e'S value. \nHowever, the gradient computation for ~ is far from trivial. In particular, to compute the \nderivative 8<1> /8\u00a2, we must consider <1>'s behavior for some perturbed \u00a2It) other than the \none (say, \u00a2~t) to which it was applied originally. In this case, an entirely different set of \nsamples would probably have been generated, possibly leading to a very different density. \nIt is hard to see how one could differentiate the result of this perturbation. We propose an \nalternative solution based on importance sampling. Rather than change the samples, we \nmodify their weights to reflect the change in the probability that they would be generated. \n\nSpecifically, when fitting \u00a2it+1) , we now define a sample (Si' ai, sD's weight to be \n\n(3) \n\n. (J.(t) 0) _ \u00a21 (Si)7l'e (ai lSi) \n'1'1' \n\n-\n\nW t \n\n~ (t) \n\n~(t) \n\u00a2o (Si) \n\nWe can now compute <1>'s derivatives at (0o, \u00a2~t)) with respect to any of its parameters, as \nrequired in (2). Let ( be the vector of parameters (0, e). Using the chain rule, we have \n\n8<1> [O](\u00a2) \n\n8( = 8w \n\n8<1> [O](\u00a2) 8w \n. 8[' \n\nThe first term is the derivative of the estimated density relative to the sample weights (an \n\u00a3 x N matrix). The second is the derivative of the weights relative to the parameter vector \n(an N x (m + \u00a3) Jacobian), which can easily be computed from (3). \n\n\fPolicy Search via Density Estimation \n\n1027 \n\n~ \n818 \n~ \n\n0.042 \n\no.~ \n\n038 \n\n0.36 \n\n..... 0.34 \n(J) \n0 \n() 0.32 \n\n0.3 \nI \n\nO~~ \nO~r \n\n, \n, \n\n, , , \n\n, , , , , \n\n(a) \n\no 2.0'----:.';O:------:-:'~:---,:7:~:----:200=\u00b7 --:2~~--::300::---~3~::--::400:----:-!..SO \n\n#Function evaluations \n\n(b) \n\nFigure 1: Driving task: (a) DBN model; (b) policy-search/optimization results (with 1 s.e.) \n\n5 Experimental results \n\nWe tested our approach in two very different domains. The first is an average-reward \nDBN-MDP problem (shown in Figure l(a)), where the task is to find a policy for changing \nlanes when driving on a moderately busy two-lane highway with a slow lane and a fast \nlane. The model is based on the BAT DBN of [4], the result of a separate effort to build a \ngood model of driver behavior. For simplicity, we assume that the car's speed is controlled \nautomatically, so we are concerned only with choosing the LateraL Action - change Lane or \ndrive straight. The observables are shown in the figure: LCLr and RClr are the clearance to \nthe next car in each lane (close, medium or far). The agent pays a cost of 1 for each step \nit is \"blocked\" by (meaning driving close to) the car to its front; it pays a penalty of 0.2 \nper step for staying in the fast lane. Policies are specified by action probabilities for the 18 \npossible observation combinations. Since this is a reasonably small number of parameters, \nwe used the simplex search algorithm described earlier to optimize V[O]. \n\nThe process mixed quite quickly, so \u00a2(20) was a fairly good approximation to \u00a2( = ). Bused \na fully factored representation of the joint distribution except for a single cluster over the \nthree observables. Evaluations are averages of 300 Monte Carlo trials of 400 steps each. \nFigure 1 (b) shows the estimated and actual average rewards, as the policy parameters are \nevolved over time. The algorithm improved quickly, converging to a very natural policy \nwith the car generally staying in the slow lane, and switching to the fast lane only when \nnecessary to overtake. \n\nIn our second experiment, we used the bicycle simulator of [7]. There are 9 actions cor(cid:173)\nresponding to leaning left/center/right and applying negative/zero/positive torque to the \nhandlebar; the six-dimensional state used in [7] includes variables for the bicycle'S tilt an(cid:173)\ngle and orientation, and the handlebar's angle. If the bicycle tilt exceeds 7r /15, it falls over \nand enters an absorbing state. We used policy search over the following space: we selected \ntwelve (simple, manually chosen but not fine-tuned) features of each state; actions were \nthe probability of taking action ai is exp(x,wi)/ E j exp(x ,wj). \nchosen with a softmax -\nAs the problem only comes with a generative model of the complicated, nonlinear, noisy \nbicycle dynamics, we used the stochastic density propagation version of our algorithm, \nwith (stochastic) gradient ascent. Each distribution in B was a mixture of a singleton point \nconsisting of the absorbing-state, and of a 6-D multivariate Gaussian. \n\n\f1028 \n\nA. Y. Ng, R. Pa\" and D. Koller \n\nThe first task in this domain was to balance reliably on the bicycle. Using a horizon of \nT = 200, discount 'Y = 0.995, and 600 Si samples per density propagation step, this was \nquickly achieved. Next, trying to learn to ride to a goal2 10m in radius and 1000m away, \nit also succeeded in finding policies that do so reliably. Formal evaluation is difficult, but \nthis is a sufficiently hard problem that even finding a solution can be considered a success. \nThere was also some slight parameter sensitivity (and the best results were obtained only \nwith ~(O) picked/fit with some care, using in part data from earlier and less successful trials, \nto be \"representative\" of a fairly good rider's state distribution,) but using this algorithm, \nwe were able to obtain solutions with median riding distances under 1.1 km to the goal. This \nis significantly better than the results of [7] (obtained in the learning rather than planning \nsetting, and using a value-function approximation solution), which reported much larger \nriding distances to the goal of about 7km, and a single \"best-ever\" trial of about 1.7km. \n\n6 Conclusions \n\nWe have presented two new variants of algorithms for performing direct policy search in \nthe deterministic and stochastic density propagation settings. Our empirical results have \nalso shown these methods working well on two large problems. \n\nAcknowledgements. We warmly thank Kevin Murphy for use of and help with his Bayes \nNet Toolbox, and Jette Randl~v and Preben Alstr~m for use of their bicycle simulator. A. \nNg is supported by a Berkeley Fellowship. The work of D. Koller and R. Parr is sup(cid:173)\nported by the ARO-MURI program \"Integrated Approach to Intelligent Systems\", DARPA \ncontract DACA 76-93-C-0025 under subcontract to lET, Inc., ONR contract N6600 1-97 -C-\n8554 under DARPA's HPKB program, the Sloan Foundation, and the Powell Foundation. \n\nReferences \n\n[1] L. Baird and A.W. Moore. Gradient descent for general Reinforcement Learning. In NIPS II, \n\n1999. \n\n[2] C. Boutilier, T. Dean, and S. Hanks. Decision theoretic planning: Structural assumptions and \n\ncomputational leverage. 1. Artijiciallntelligence Research, 1999. \n\n[3] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In Proc. VAl, \n\npages 33-42, 1998. \n\n[4] J. Forbes, T. Huang, K. Kanazawa, and S.J. Russell. The BATmobile: Towards a Bayesian \n\nautomated taxi. In Proc. IlCAI, 1995. \n\n[5] D. Koller and R. Fratkina. Using learning for approximation in stochastic processes. In Proc. \n\nICML, pages 287-295, 1998. \n\n[6] N . Meuleau, L. Peshkin, K-E. Kim, and L.P. Kaelbling. Learning finite-state controllers for \n\npartially observable environments. In Proc. VAIlS, 1999. \n\n[7] 1. Randl0v and P. Alstr0m. Learning to drive a bicycle using reinforcement learning and shaping. \n\nIn Proc. ICML, 1998. \n\n[8] J.K. Williams and S. Singh. Experiments with an algorithm which learns stochastic memoryless \n\npolicies for POMDPs. In NIPS 11, 1999. \n\n[9] R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement \n\nlearning. Machine Learning, 8:229-256, 1992. \n\n2For these experiments, we found learning could be accomplished faster with the simulator'S \nintegration delta-time constant tripled for training. This and \"shaping\" reinforcements (chosen to \nreward progress made towards the goal) were both used, and training was with the bike \"infinitely \ndistant\" from the goal. For this and the balancing experiments, sampling from the fallen/absorbing(cid:173)\nstate portion of the distributions J>(t) is obviously inefficient use of samples, so all samples were \ndrawn from the non-absorbing state portion (i.e. the Gaussian, also with its tails corresponding to tilt \nangles greater than 7r /15 truncated), and weighted accordingly relative to the absorbing-state portion. \n\n\f", "award": [], "sourceid": 1748, "authors": [{"given_name": "Andrew", "family_name": "Ng", "institution": null}, {"given_name": "Ronald", "family_name": "Parr", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}