{"title": "Particle Filter-based Policy Gradient in POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 337, "page_last": 344, "abstract": "Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency.", "full_text": "Particle Filter-based Policy Gradient in POMDPs\n\nPierre-Arnaud Coquelin\nCMAP, Ecole Polytechnique\n\nRomain Deguest\u2217\n\nCMAP, Ecole Polytechnique\n\ncoquelin@cmapx.polytechnique.fr\n\ndeguest@cmapx.polytechnique.fr\n\nR\u00b4emi Munos\n\nINRIA Lille - Nord Europe, SequeL project,\n\nremi.munos@inria.fr\n\nAbstract\n\nOur setting is a Partially Observable Markov Decision Process with continuous\nstate, observation and action spaces. Decisions are based on a Particle Filter for\nestimating the belief state given past observations. We consider a policy gradient\napproach for parameterized policy optimization. For that purpose, we investigate\nsensitivity analysis of the performance measure with respect to the parameters of\nthe policy, focusing on Finite Difference (FD) techniques. We show that the naive\nFD is subject to variance explosion because of the non-smoothness of the resam-\npling procedure. We propose a more sophisticated FD method which overcomes\nthis problem and establish its consistency.\n\n1 Introduction\n\nWe consider a Partially Observable Markov Decision Problem (POMDP) (see e.g. (Lovejoy, 1991;\nKaelbling et al., 1998)) de\ufb01ned by a state process (Xt)t\u22651 \u2208 X, an observation process (Yt)t\u22651 \u2208\nY , a decision (or action) process (At)t\u22651 \u2208 A which depends on a policy (mapping from all possible\nobservation histories to actions), and a reward function r : X \u2192 R. Our goal is to \ufb01nd a policy\n\u03c0 that maximizes a performance measure J(\u03c0), function of future rewards, for example in a \ufb01nite\nhorizon setting:\n\nJ(\u03c0) def= E\u00a3 nX\n\nr(Xt)\u00a4.\n\n(1)\n\nt=1\n\nOther performance measures (such as in in\ufb01nite horizon with discounted rewards) could be handled\nas well. In this paper, we consider the case of continuous state, observation, and action spaces.\nThe state process is a Markov decision process taking its values in a (measurable) state space X,\nwith initial probability measure \u00b5 \u2208 M(X) (i.e. X1 \u223c \u00b5), and which can be simulated using a\ntransition function F and independent random numbers, i.e. for all t \u2265 1,\n\nXt+1 = F (Xt, At, Ut), with Ut\n\ni.i.d.\u223c \u03bd,\n\n(2)\n\nwhere F : X \u00d7 A \u00d7 U \u2192 X and (U, \u03c3(U ), \u03bd) is a probability space. In many practical situations\nU = [0, 1]p and Ut is a p-uple of pseudo random numbers. For simplicity, we adopt the notations\nF (x0, a0, u) def= F\u00b5(u), where F\u00b5 is the \ufb01rst transition function (i.e. X1 = F\u00b5(U0) with U0 \u223c \u03bd).\nThe observation process (Yt)t\u22651 lies in a (measurable) space Y and is linked with the state process\nby the conditional probability measure P(Yt \u2208 dyt|Xt = xt) = g(xt, yt) dyt, where g : X \u00d7 Y \u2192\n[0, 1] is the marginal density function of Yt given Xt. We assume that observations are conditionally\nindependent given the state process. Here also, we assume that we can simulate an observation\nusing a transition function G and independent random numbers, i.e. \u2200t \u2265 1, Yt = G(Xt, Vt),\n\n\u2217Also af\ufb01liated to Columbia University\n\n1\n\n\fi.i.d.\u223c \u03bd (for the sake of simplicity we consider the same probability space (U, \u03c3(U ), \u03bd)).\nwhere Vt\nNow, the action process (At)t\u22651 depends on a policy \u03c0 which assigns to each possible observation\nhistory Y1:t (where we adopt the usual notation \u201c1 : t\u201d to denote the collection of integers s such that\n1 \u2264 s \u2264 t), an action At \u2208 A.\nIn this paper we will consider policies that depend on the belief state (also called \ufb01ltering distri-\nbution) conditionally to past observations. The belief state, written bt, belongs to M(X) (the space\nof all probability measures on X) and is de\ufb01ned by bt(dxt, Y1:t) def= P(Xt \u2208 dxt|Y1:t), and will be\nwritten bt(dxt) or even bt for simplicity when there is no risk of confusion. Because of the Markov\nproperty of the state dynamics, the belief state bt(\u00b7, Y1:t) is the most informative representation about\nthe current state Xt given the history of past observations Y1:t. It represents suf\ufb01cient statistics for\ndesigning an optimal policy in the class of observations-based policies.\n\nThe temporal and causal dependencies of the dynamics of a generic POMDP using belief-based\npolicies is summarized in Figure 1 (left): at time t, the state Xt is unknown, only Yt is observed,\nwhich enables (at least in theory) to update bt based on the previous belief bt\u22121. The policy \u03c0 takes\nas input the belief state bt and returns an action At (the policy may be deterministic or stochastic).\nHowever, since the belief state is an in\ufb01nite dimensional object, and thus cannot be represented in\na computer, we \ufb01rst simplify the class of policies that we consider here to be de\ufb01ned over a \ufb01nite\ndimensional space of belief-features f : M(X) \u2192 RK which represents relevant statistics of the\n\ufb01ltering distribution. We write bt(fk) for the value of the k-th feature (among K) (where we use the\n\nusual notation b(f ) def= RX f (x)b(dx) for any function f de\ufb01ned on X and measure b \u2208 M(X)),\n\nand denote bt(f ) the vector (of size K) with components bt(fk). Examples of features are: f (x) = x\n(mean value), f (x) = x\u2032x (for the covariance matrix). Other more complex features (e.g. entropy\nmeasure) could be used as well. Such a policy \u03c0 : RK \u2192 A selects an action At = \u03c0(bt(f )), which\nin turn, yields a new state Xt+1.\nExcept for simple cases, such as in \ufb01nite-state \ufb01nite-observation processes (where a Viterbi algo-\nrithm could be applied (Rabiner, 1989)), and the case of linear dynamics and Gaussian noise (where\na Kalman \ufb01lter could be used), there is no closed-form representation of the belief state. Thus bt\nmust be approximated in our general setting. A popular method for approximating the \ufb01ltering\ndistribution is known as Particle Filters (PF) (also called Interacting Particle Systems or Sequen-\ntial Monte-Carlo). Such particle-based approaches have been used in many applications (see e.g.\n(Doucet et al., 2001) and (Del Moral, 2004) for a Feynman-Kac framework) for example for pa-\nrameter estimation in Hidden Markov Models and control (Andrieu et al., 2004) and mobile robot\nlocalization (Fox et al., 2001). An PF approximates the belief state bt \u2208 M(X) by a set of parti-\ncles (x1:N\n) (points of X), which are updated sequentially at each new observation by a transition-\nselection procedure. In particular, the belief feature bt(f ) is approximated by 1\nt), and\nthe policy is thus a function that takes as input the activation of the feature f at the position of\nthe particles: At = \u03c0( 1\nt)). For such methods, the general scheme for POMDPs using\nParticle Filter-based policies is described in Figure 1 (right).\nIn this paper, we consider a class of policies \u03c0\u03b8 parameterized by a (multi-dimensional) parameter\n\u03b8 and we search for the value of \u03b8 that maximizes the resulting criterion J(\u03c0\u03b8), now written J(\u03b8)\nfor simplicity. We focus on a policy gradient approach: the POMDP is replaced by an optimization\nproblem on the space of policy parameters, and a (stochastic) gradient ascent on J(\u03b8) is considered.\nFor that purpose (and this is the object of this work) we investigate the estimation of \u2207J(\u03b8) (where\nthe gradient \u2207 refers to the derivative w.r.t. \u03b8), with an emphasis on Finite-Difference techniques.\nThere are many works about such policy gradient approach in the \ufb01eld of Reinforcement Learning,\nsee e.g. (Baxter & Bartlett, 1999), but the policies considered are generally not based on the result of\nan PF. Here, we explicitly consider a class of policies that are based on a belief state constructed by a\nPF. Our motivations for investigating this case are based on two facts: (1) the belief state represents\nsuf\ufb01cient statistics for optimality, as mentioned above. (2) PFs are a very popular and ef\ufb01cient tool\nfor constructing the belief state in continuous domains.\n\nN PN\n\ni=1 f (xi\n\nt\n\nN PN\n\ni=1 f (xi\n\nAfter recalling the general approach for evaluating the performance of a PF-based policy (Section 2),\nwe describe (in Section 3.1) a naive Finite-Difference (FD) approach (de\ufb01ned by a step size h) for\nestimating \u2207J(\u03b8). We discuss the bias and variance tradeoff and explain the problem of variance\nexplosion when h is small. This problem is a consequence of the discontinuity of the resampling\noperation w.r.t. the parameter \u03b8. Our contribution is detailed in Section 3.2: We propose a modi\ufb01ed\n\n2\n\n\fFD estimate for \u2207J(\u03b8) which (along the random sample path) has bias O(h2) and variance O(1/N ),\nthus overcomes the drawback of the previous naive method. An algorithm is described and illustrated\nin Section 4 on a simple problem where the optimal policy exhibits a tradeoff between greedy reward\noptimization and localization.\n\nrt\u22121\n\nX\n\nt\u22121\n\nY\nt\u22121\n\nb\n\nt\u22121\n\nb (f)\nt\u22121\n\u03c0\u03b8\nA\n\nt\u22121\n\nr t\n\nXt\n\nYt\n\ntb\n\nt\n\nb (f )\n\u03c0\u03b8\n\nAt\n\nrt+1\n\nReward\n\nX\n\nt+1\n\nState\n\nY\nt+1\n\nObservation\n\nb\n\nt+1\n\nBelief state\n\nr\nt\u22121\n\nX\n\nt\u22121\n\nY\nt\u22121\n\nx\n\n1:N\nt\u22121\n\nt+1b (f )\n\u03c0\u03b8\n\nA\n\nt+1\n\nBelief features\n\nPolicy\n\nAction\n\n1:N\nx\nf( )\nt\u22121\n\u03c0\u03b8\nA\n\nt\u22121\n\nr t\n\nXt\n\nYt\n\nx\n\n1:N\nt\n\n1:N\nx\nf( )\nt\n\u03c0\u03b8\n\nAt\n\nr t+1\n\nReward\n\nX\n\nt+1\n\nState\n\nY\nt+1\n\nx\n\n1:N\nt+1\n\nx\n\n1:N\nf( )\nt+1\n\u03c0\u03b8\nA\n\nt+1\n\nObservation\n\nParticles\n\nFeatures\n\nPolicy\n\nAction\n\nFigure 1: Left \ufb01gure: Causal and temporal dependencies in a POMDP. Right \ufb01gure: PF-based\nscheme for POMDPs where the belief feature bt(f ) is approximated by 1\n2 Particle Filters (PF)\nWe \ufb01rst describe a generic PF for estimating the belief state based on past observations. In Sub-\nsection 2.1 we detail how to control a real-world POMDP and in Subsection 2.2 how to estimate\nthe performance of a given policy in simulation. In both cases, we assume that the models of the\ndynamics (state, observation) are known. The basic PF, called Bootstrap Filter, see (Doucet et al.,\n2001) for details, approximates the belief state bn by an empirical distribution bN\nn\u03b4xi\nn\n(where \u03b4 denotes a Dirac distribution) made of N particles x1:N\nn . It consists in iterating the two\nfollowing steps: at time t, given observation yt,\n\ndef= PN\n\nN PN\n\ni=1 f (xi\n\ni=1 wi\n\nt).\n\nn\n\npopulation ex1:N\n\nt\n\n\u2022 Transition step: (also called importance sampling or mutation) a successor particles\nis generated according to the state dynamics from the previous population\n\nt\u22121. The (importance sampling) weights w1:N\nx1:N\n\n,yt)\nj=1 g(exj\nt ,yt)\n\u2022 Selection step: Resample (with replacement) N particles x1:N\n\ndef= g(ex1:N\n\nPN\n\nt\n\nt\n\nare evaluated,\n\nfrom the setex1:N\n\nt\nare the selection indices.\n\nt\n\naccording\n\nto the weights w1:N\n\nt\n\n. We write x1:N\n\nt\n\nwhere k1:N\n\nt\n\nt\n\ndef= exk1:N\n\nt\n\ni=1 wi\n\nproperty (i.e. PN\n\nResampling is used to avoid the problem of degeneracy of the algorithm, i.e. that most of the weights\ndecreases to zero. It consists in selecting new particle positions such as to preserve a consistency\nt)]). The simplest version introduced in (Gordon\ni=1 \u03c6(xi\nby an independent sampling from the set 1 : N\net al., 1993) chooses the selection indices k1:N\nt , for all 1 \u2264\naccording to a multinomial distribution with parameters w1:N\ni \u2264 N. The idea is to replicate the particles in proportion to their weights. Many variants have been\nproposed in the literature, among which the strati\ufb01ed resampling method (Kitagawa, 1996) which is\noptimal in terms of variance, see e.g. (Capp\u00b4e et al., 2005).\n\nt = j) = wj\n\nN PN\n\nt\u03c6(exi\n\nt) = E[ 1\n\n, i.e. P(ki\n\nt\n\nt\n\nConvergence issues of bN\nn (f ) to bn(f ) (e.g. Law of Large Numbers or Central Limit Theorems) are\ndiscussed in (Del Moral, 2004) or (Douc & Moulines, 2008). For our purpose we note that under\nweak conditions on the feature f, we have the consistency property: bN (f ) \u2192 b(f ), almost surely.\n\n2.1 Control of a real system by an PF-based policy\nWe describe in Algorithm 1 how one may use an PF-based policy \u03c0\u03b8 for the control of a real-world\n\nsystem. Note that from our de\ufb01nition of F\u00b5, the particles are initialized with: ex1:N\n\n2.2 Estimation of J(\u03b8) in simulation\nNow, for the purpose of policy optimization, one should be capable of evaluating the performance\nof a policy in simulation. J(\u03b8), de\ufb01ned by (1), may be estimated in simulation provided that\n\n1\n\niid\u223c \u00b5.\n\n3\n\n\fAlgorithm 1 Control of a real-world POMDP\n\nfor t = 1 to n do\n\nObserve: yt,\nParticle transition step:\n\nt = F (x1:N\n\nt\u22121, at\u22121, u1:N\n\nt\u22121) with u1:N\nt\u22121\n\nSet ex1:N\n\nParticle resampling step:\nSet x1:N\nt\nSelect action: at = \u03c0\u03b8( 1\n\nt = exk1:N\n\nwhere k1:N\n\nt\n\nt\n\nN PN\n\nend for\n\niid\u223c \u03bd. Set w1:N\n\nt = g(ex1:N\n,yt)\nj=1 g(exj\nt ,yt)\n\nPN\n\nt\n\n,\n\nare given by the selection step according to the weights w1:N\ni=1 f (xi\n\nt)),\n\nt\n\n.\n\nthe dynamics of the state and observation are known. Making explicit the dependency w.r.t.\nthe\nrandom sample path, written \u03c9 (which accounts for the state and observation stochastic dynam-\nics and the random numbers used in the PF-based policy), we write J(\u03b8) = E\u03c9[J\u03c9(\u03b8)], where\n\nt=1 r(Xt,\u03c9(\u03b8)), making the dependency of the state w.r.t. \u03c9 and \u03b8 explicit.\n\nJ\u03c9(\u03b8) def= Pn\n\nAlgorithm 2 describes how to evaluate an PF-based policy in simulation. The function returns an\n\u03c9 (\u03b8), of J\u03c9(\u03b8). Using previously mentioned asymptotic convergence results\nestimate, written J N\n\u03c9 (\u03b8) = J\u03c9(\u03b8), almost surely (a.s.). In order to approximate J(\u03b8), one\nfor PF, one has limN\u2192\u221e J N\nwould perform several calls to the algorithm, receiving J N\n\u03c9m(\u03b8) (for 1 \u2264 m \u2264 M), and calculate\ntheir empirical mean 1\n\n\u03c9m(\u03b8), which tends to J(\u03b8) a.s., when M, N \u2192 \u221e.\n\nm=1 J N\n\nM PM\n\nAlgorithm 2 Estimation of J\u03c9(\u03b8) in simulation\n\nfor t = 1 to n do\n\nDe\ufb01ne state:\nxt = F (xt\u22121, at\u22121, ut\u22121) with ut\u22121 \u223c \u03bd,\nDe\ufb01ne observation:\nyt = G(xt, vt) with vt \u223c \u03bd,\nParticle transition step:\n\nt = F (x1:N\n\nt\u22121, at\u22121, u1:N\n\nt\u22121) with u1:N\nt\u22121\n\nSet ex1:N\n\nt\n\nwhere k1:N\n\nParticle resampling step:\nSet x1:N\nt\nSelect action: at = \u03c0\u03b8( 1\n\nt = exk1:N\n\u03c9 (\u03b8) def= Pn\n\nN PN\nt=1 r(xt).\n\nend for\nReturn J N\n\nt\n\niid\u223c \u03bd. Set w1:N\n\nt = g(ex1:N\n,yt)\nj=1 g(exj\nt ,yt)\n\nPN\n\nt\n\n,\n\nare given by the selection step according to the weights w1:N\ni=1 f (xi\n\nt)),\n\nt\n\n,\n\n3 A policy gradient approach\n\nNow we want to optimize the value of the parameter in simulation. Then, once a \u201cgood\u201d parameter\n\u03b8\u2217 is found, we would use Algorithm 1 to control the real system using the corresponding PF-based\npolicy \u03c0\u03b8\u2217. Gradient approaches have been studied in the \ufb01eld of continuous space Hidden Markov\nModels in (Fichoud et al., 2003; C\u00b4erou et al., 2001; Doucet & Tadic, 2003). The authors have\nused a likelihood ratio approach to evaluate \u2207J(\u03b8). Such methods suffer from high variance, in\nparticular for problems with small noise. In order to reduce the variance, it has been proposed in\n(Poyadjis et al., 2005) to use a marginal particle \ufb01lter instead of a simple path-based particle \ufb01lter.\nThis approach is ef\ufb01cient in terms of variance reduction but its computational complexity is O(N 2).\nHere we investigate a pathwise (i.e. along the random sample path \u03c9) sensitivity analysis of J\u03c9(\u03b8)\n(w.r.t. \u03b8) for the purpose of (stochastic) gradient optimization. We start with a naive Finite Difference\n(FD) approach and show the problem of variance explosion. Then we provide an alternative, called\ncommon indices FD, which overcomes this problem.\n\nIn the sequel, we make the assumptions that all relevant functions (F , g, f, \u03c0) are continuously\ndifferentiable w.r.t. their respective variables. Note that although this is not explicitly mentioned, all\nsuch functions may depend on time.\n\n4\n\n\f3.1 Naive Finite-Difference (FD) method\nLet us consider the derivative of J(\u03b8) component-wisely, writing \u2202J(\u03b8) the derivative of J(\u03b8) w.r.t. a\none-dimensional parameter. If the parameter \u03b8 is multi-dimensional, the derivative will be calculated\ndef= J(\u03b8+h)\u2212J(\u03b8\u2212h)\nin each direction. For h > 0 we de\ufb01ne the centered \ufb01nite-difference quotient Ih\n.\nSince J(\u03b8) is differentiable then limh\u21920 Ih = \u2202J(\u03b8). Consequently, a method for approximating\n\u2202J(\u03b8) would consist in estimating Ih for a suf\ufb01ciently small h. We know that J(\u03b8) can be numeri-\ncally estimated by 1\n\n2h\n\nm=1 J N\n\nM PM\ndef=\n\nI N,M\nh\n\n\u03c9m(\u03b8). Thus, it seems natural to estimate Ih by\n\u03c9m\u2032 (\u03b8 \u2212 h)i\n\nJ N\n\u03c9m(\u03b8 + h) \u2212\n\nJ N\n\nMX\n\nMX\n\n1\n\n2hh 1\n\nM\n\n1\nM\n\nm=1\n\nm\u2032=1\n\nwhere we used independent random numbers to evaluate J(\u03b8 + h) and J(\u03b8 \u2212 h). From the con-\nsistency of the PF, we deduce that limh\u21920 limM,N\u2192\u221e I N,M\n= \u2202J(\u03b8). This naive FD estimate\nexhibits the following bias-variance tradeoff (see (Coquelin et al., 2008) for the proof):\nProposition 1 (Bias-variance trade-off). Assume that J(\u03b8) is three times continuously differentiable\nin a small neighborhood of \u03b8, then the asymptotic (when N \u2192 \u221e) bias of the naive FD estimate\nI N,M\nh\n\nis of order O(h2) and its variance is O(N \u22121M \u22121h\u22122).\n\nh\n\nIn order to reduce the bias, one should choose a small h, but then the variance would blow up.\nAdditional computational resource (larger number of particles N) will help controlling the vari-\nance. However, in practice, e.g. for stochastic optimization, this leads to an intractable amount of\ncomputational effort since any consistent FD-based optimization algorithm (e.g. such as the Kiefer-\nWolfowitz algorithm) will need to consider a sequence of steps h that decreases with the number of\ngradient iterations. But if the number of particles is bounded, the variance term will diverge, which\nmay prevent the stochastic gradient algorithm from converging to a local optimum.\n\nIn order to reduce the variance of the previous estimator when h is small, one may use common\nrandom numbers to estimate both J(\u03b8 + h) and J(\u03b8 \u2212 h) (i.e. \u03c9m = \u03c9m\u2032). The variance then\nreduces to O(N \u22121M \u22121h\u22121) (see e.g. (Glasserman, 2003)), which still explodes for small h.\nNow, under the additional assumption that along almost all random sample path \u03c9, the function\n\u03c9 (\u03b8) is a.s. continuous, then the variance would reduce to O(N \u22121M \u22121) (see Section (7.1)\n\u03b8 7\u2192 J N\nof (Glasserman, 2003)). Unfortunately, this is not the case here because of the discontinuity of the\nPF resampling operation w.r.t. \u03b8. Indeed, for a \ufb01xed \u03c9, the selection indices k1:N\n(taking values in\na \ufb01nite set 1 : N) are usually a non-smooth function of the weights w1:N\nTherefore the naive FD method using PF cannot be applied in general because of variance explosion\nof the estimate when h is small, even when using common random number.\n\n, which depend on \u03b8.\n\nt\n\nt\n\n3.2 Common-indices Finite-Difference method\n\nLet us consider J\u03c9(\u03b8) = Pn\n\nt=1 r(Xt,\u03c9(\u03b8)) making explicit the dependency of the state w.r.t. \u03b8 and a\nrandom sample path \u03c9. Under our assumptions, the gradient \u2202J\u03c9(\u03b8) is well de\ufb01ned. Now, let us \ufb01x\n\u03c9. For clarity, we now omit to write the \u03c9 dependency when no confusion is possible. The function\n\u03b8 7\u2192 Xt(\u03b8) (for any 1 \u2264 t < n) is smooth because all transition functions are smooth, the policy is\nsmooth, and the belief state bt is smooth w.r.t. \u03b8. Underlying the belief feature bt,\u03b8(f ) dependency\nw.r.t. \u03b8, we write:\n\nsmooth\n7\u2212\u2192 bt,\u03b8(f )\n\n\u03b8\n\nsmooth\n7\u2212\u2192 Xt(\u03b8)\n\nsmooth\n7\u2212\u2192 J\u03c9(\u03b8).\n\nAs already mentioned, the problem with the naive FD method is that the PF estimate bN\nN PN\n1\n1:t (\u03b8) which, taken as a function of \u03b8 (through the weights), is not continuous. We write\nk1:N\n\nt,\u03b8(f ) =\nt(\u03b8)) of bt,\u03b8(f ) is not smooth w.r.t. \u03b8 because it depends on the selection indices\n\ni=1 f (xi\n\n\u03b8\n\nnon-smooth\n\n7\u2212\u2192 bN\n\nt,\u03b8(f ) =\n\n1\nN\n\nNX\n\ni=1\n\nf (xi\n\nt(\u03b8))\n\nsmooth\n7\u2212\u2192 J N\n\n\u03c9 (\u03b8).\n\nSo a natural idea to recover continuity in a FD method would consists in using exactly the same\nselection indices for quantities related to \u03b8 + h and \u03b8 \u2212 h. However, using the same indices means\nusing the same weights during the selection procedure for both trajectories. But this would lead to\na wrong estimator because the weights strongly depends on \u03b8 through the observation function g.\n\n5\n\n\fOur idea is thus to use the same selection indices but use a likelihood ratio in the belief feature\nestimation. More precisely, let us write k1:N\n(\u03b8) the selection indices obtained for parameter \u03b8, and\nconsider a parameter \u03b8\u2032 in a small neighborhood of \u03b8. Then, an PF estimate for bt,\u03b8\u2032 (f ) is\n\nt\n\nt,\u03b8\u2032 (f ) def=\nbN\n\nNX\n\ni=1\n\nt(\u03b8, \u03b8\u2032)\nli\nPN\nj=1 lj\n\nt (\u03b8, \u03b8\u2032)\n\nf (xi\n\nt(\u03b8\u2032)), with li\n\nt(\u03b8, \u03b8\u2032) def= Qt\ns=1 g(xi\nQt\ns=1 g(xi\n\ns(\u03b8\u2032), ys(\u03b8\u2032))\ns(\u03b8), ys(\u03b8))\n\n(3)\n\n1:t (\u03b8\u2032) have\nbeing the likelihood ratios computed along the particle paths, and where the particles x1:N\n1:t (\u03b8) (and the same random sample path \u03c9) as\nbeen generated using the same selection indices k1:N\nthose used for \u03b8. The next result states the consistency of this estimate and is our main contribution\n(see (Coquelin et al., 2008) for the proof).\nProposition 2. Under weak conditions on f (see e.g. (Del Moral & Miclo, 2000)), there exists a\nneighborhood of \u03b8, such that for any \u03b8\u2032 in this neighborhood, bN\nt,\u03b8\u2032 (f ) de\ufb01ned by (3) is a consistent\nestimator of bt,\u03b8\u2032(f ), i.e. limN\u2192\u221e bN\n\nt,\u03b8\u2032(f ) = bt,\u03b8\u2032 (f ) almost surely.\n\nThus, for any perturbed value \u03b8\u2032 around \u03b8, we may run an PF where in the resampling step, we\nuse the same selection indices k1:N\nt,\u03b8\u2032 (f ) is\nsmooth. We write:\n\n1:n (\u03b8) as those obtained for \u03b8. Thus the mapping \u03b8\u2032 7\u2192 bN\n\n\u03b8\u2032 smooth\n\n7\u2212\u2192 bN\n\nt,\u03b8\u2032(f ) de\ufb01ned by (3) smooth\n\n7\u2212\u2192 J N\n\n\u03c9 (\u03b8\u2032).\n\n\u03c9 (\u03b8) is a consistent estimator for J\u03c9(\u03b8).\n\nFrom the previous proposition we deduce that J N\nA possible implementation for the gradient estimation is described by Algorithm 3. The algo-\nrithm works by updating 3 families of state, observation, and particle populations, denoted by\n\u2019+\u2019, \u2019-\u2019, and \u2019o\u2019 for the values of the parameter \u03b8 + h, \u03b8 \u2212 h, and \u03b8 respectively. For the\nperformance measure de\ufb01ned by (1), the algorithm returns the common indices FD estimator:\n1:n are upper and lower trajectories simulated\n\u2202hJ N\n\u03c9\nunder the random sample path \u03c9. Note that although the selection indices are the same, the particle\npopulations \u2019+\u2019, \u2019-\u2019, and \u2019o\u2019 are different, but very close (when h is small). Hence the likelihood\nratios l1:N\n\nconverge to 1 when h \u2192 0, which avoids a source of variance when h is small.\n\nt ) where x+\n\n2h Pn\n\n1:n and x\u2212\n\nt ) \u2212 r(x\u2212\n\nt=1 r(x+\n\ndef= 1\n\nt\n\ndef= 1\n\nM PM\n\n\u03c9\n\nh J N\n\nm=1 \u2202hJ N\n\u03c9m\n\nThe resulting estimator \u2202M\nfor J(\u03b8) would calculate an average over M\nsample paths \u03c91:M of the return of Algorithm 3 called M times. This estimator overcomes the\ndrawbacks of the naive FD estimate: Its asymptotic bias is of order O(h2) (like any centered FD\nscheme) but its variance is of order O(N \u22121M \u22121) (the Central Limit Theorem applies to the belief\nfeature estimator (3) thus to \u2202hJ N\n\u03c9 as well). Since the variance does not degenerate when h is small,\none should choose h as small as possible to reduce the mean-squared estimation error.\nThe complexity of Algorithm 3 is linear in the number of particles N. Note that in the current\nimplementation we used 3 populations of particles per derivative. Of course, we could consider a\nnon-centered FD scheme approximating the derivative with J(\u03b8+h)\u2212J(\u03b8)\n, which is of \ufb01rst order but\nwhich only requires 2 particle populations. If the parameter is multidimensional, the full gradient\nestimate could be obtained by using K + 1 populations of particles. Of course, in gradient ascent\nmethods, such FD gradient estimate may be advantageously combined with clever techniques such\nas simultaneous perturbation stochastic approximation (Spall, 2000), conjugate or second-order gra-\ndient approaches.\nNote that when h \u2192 0, our estimator converges to an In\ufb01nitesimal Perturbation Analysis (IPA)\nestimator (Glasserman, 1991). The same ideas as those presented above could be used to derive an\nIPA estimator. The advantage of IPA is that it would use one population of particles only (for the\nfull gradient) which may be interesting when the number of parameters K is large. However, the\nmain drawback is that this approach would require to compute analytically the derivatives of all the\nfunctions w.r.t. their respective variables, which may be time consuming for the programmer.\n\nh\n\n4 Numerical Experiment\nBecause of space constraints, our purpose here is simply to illustrate numerically the theoretical\n\ufb01ndings of previous FD methods (in terms of bias-variance contributions) rather than to provide a\nfull example of POMDP policy optimization. We consider a very simple navigation task for a 2d\nrobot. The robot is de\ufb01ned by its coordinates xt \u2208 R2. The observation is a noisy measurement\n\n6\n\n\ft = F (x+\n\nt\u22121, a+\n\nt\u22121, ut\u22121), set x\u2212\n\nt = F (x\u2212\n\nt\u22121, a\u2212\n\nt\u22121, ut\u22121),\n\nt = G(x\u2212\n\nt , vt),\n\nAlgorithm 3 Common-indices Finite Difference estimate of \u2202J\u03c9\n\nInitialize likelihood ratios:\nSet l1:N,+\n= 1,\n0\nfor t = 1 to n do\n\n= 1, l1:N,\u2212\n\n0\n\nt\u22121, ao\n\nt = F (xo\n\nt = G(xo\n\nt , vt), set y+\n\nState processes: Sample ut\u22121 \u223c \u03bd and\nt\u22121, ut\u22121), set x+\nSet xo\nObservation processes: Sample vt \u223c \u03bd and\nt = G(x+\nSet yo\nParticle transition step: Draw u1:N\nt\u22121\n= F (x1:N,o\nt\u22121),\nt\u22121, u1:N\n= F (x1:N,+\n, a+\nt\u22121, u1:N\nt = g(ex1:N,o\n,yo\nt )\n,\nj=1 g(exj,o\n,yo\nt )\nt\n= g(ex1:N,+\n,y+\nt )\nl1:N,+\nt\u22121\ng(ex1:N,o\n,yo\nt )\n\nSet ex1:N,o\nSet ex1:N,+\n\n, set l1:N,\u2212\n\nt\u22121 , ao\n\nSet w1:N\n\nPN\n\nt\u22121\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\u22121), set ex1:N,\u2212\n\nt\n\nt , vt), set y\u2212\niid\u223c \u03bd and\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n,o\n\nSet l1:N,+\nParticle resampling step:\nLet k1:N\nSet x1:N,o\nSet l1:N,+\nActions:\nSet ao\nSet a+\n\n= exk1:N\n= lk1:N\nt = \u03c0\u03b8\u00a1 1\nN PN\nt = \u03c0\u03b8+h\u00a1PN\ndef= Pn\n\n, set x1:N,+\n, set l1:N,\u2212\nt )\u00a2,\ni=1 f (xi,o\nli,+\ntPN\nj=1 lj,+\n\nend for\nReturn: \u2202hJ N\n\u03c9\n\nt=1\n\ni=1\n\n,+\n\nt\n\nt\n\nt\n\nt\n\nt\n\nr(x+\n\nt )\u2212r(x\n\n\u2212\n\nt )\n\n2h\n\n.\n\n= F (x1:N,\u2212\n\nt\u22121\n\n, a\u2212\n\nt\u22121, u1:N\n\nt\u22121),\n\nt\n\n= g(ex1:N,\u2212\ng(ex1:N,o\n\nt\n\n\u2212\n\nt )\n,y\n,yo\nt )\n\nl1:N,\u2212\nt\u22121\n\n,\n\n,\nbe the selection indices obtained from the weights w1:N\n,\u2212\n\n,+\n\nt\n\nt\n\n= exk1:N\n\n= lk1:N\n\nt\n\nt\n\nt\n\n,\u2212\n\n,\n\n, set x1:N,\u2212\n\nt\n\nt\n\n= exk1:N\n\nt\n\n,\n\nf (xi,+\n\nt\n\n)\u00a2, set a\u2212\n\nt = \u03c0\u03b8\u2212h\u00a1PN\n\ni=1\n\nli,\u2212\nj=1 lj,\u2212\n\ntPN\n\nt\n\nf (xi,\u2212\n\nt\n\n)\u00a2,\n\ny) (\u03c32\n\niid\u223c N (0, \u03c32\n\nN PN\n\ndef= ||xt||2 + vt, where vt\n\ni.i.d.\u223c N (0, \u03c32\n\nof the squared distance to the origin (the goal): yt\ny is\nthe variance of the noise). At each time step, the agent may choose a direction at (with ||at|| = 1),\nwhich results in moving the state, of a step d, in the corresponding direction: xt+1 = xt + dat + ut,\nwhere ut\nxI) is an additive noise. The initial state x1 is drawn from \u03bd, a uniform\ndistribution over the square [\u22121, 1]2. We consider a class of policies that depend on a single feature\nbelief: the mean of the belief state (i.e. f (x) = x). The PF-based policy thus uses the barycenter of\ndef= 1\nthe particle population mt\nt. Let us write m\u22a5 the +90o rotation of a vector m. We\ni=1 xi\nconsider policies \u03c0\u03b8(m) = \u2212(1\u2212\u03b8)m+\u03b8m\u22a5\n||\u2212(1\u2212\u03b8)m+\u03b8m\u22a5|| parameterized by \u03b8 \u2208 [0, 1]. The chosen action is thus\nat = \u03c0\u03b8(mt). If the robot was well localized (i.e. mt close to xt), then the policy \u03c0\u03b8=0 would move\nthe robot towards the direction of the goal, whereas \u03c0\u03b8=1 would move it in an orthogonal direction.\nThe performance measure (to be minimized) is de\ufb01ned as J(\u03b8) = E[||xn||2], where n is a \ufb01xed time.\nWe plot in Figure 2 the performance and gradient estimation obtained when running Algorithms 2\nand 3, respectively. We used the numerical values: N = 103, M = 102, h = 10\u22126, n = 10,\n\u03c3x = 0.05, \u03c3y = 0.05, d = 0.1.\nIt is interesting to note that in this problem, the performance is optimal for \u03b8\u2217 \u2243 0.3 (which is slightly\nbetter than for \u03b8 = 0). \u03b8 = 0 would correspond to the best feed-back policy if the state was perfectly\nknown. However, moving in an direction orthogonal to the goal helps improving localization. Here,\nthe optimal policy exhibits a tradeoff between greedy optimization and localization.\n\nBias / Variance NFD\nBias / Variance CIFD\n\nh = 100\n\n0.57 / 6.05 \u00d7 10\u22123\n\nh = 10\u22122\n0.31 / 0.13\n\n0.428 / 0.022\n\n0.00192 / 0.019\n\nh = 10\u22124\n\nh = 10\u22126\n\nunreliable / 25.3\n0.00247 / 0.02\n\nunreliable / 6980\n0.00162 / 0.0188\n\nThe table above shows the (empirically measured) bias and variance of the naive FD (NFD) (using\ncommon random numbers) method and the common indices FD (CIFD) method, for a speci\ufb01c value\n\u03b8 = 0.5 (with N = 103, M = 500). As predicted, the variance of the NFD approach makes this\nmethod inapplicable, whereas that of the CIFD is reasonable.\n\n7\n\n\f0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\nt\n\ne\na\nm\n\n \n\ni\nt\ns\ne\ne\nc\nn\na\nm\nr\no\nf\nr\ne\nP\n\n0.1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\ne\n\nt\n\na\nm\n\ni\nt\ns\ne\n\n \nt\n\ni\n\nn\ne\nd\na\nr\nG\n\n1.4\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22120.2\n\n\u22120.4\n\n\u22120.6\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.4\n\n0.5\n\nparameter \u03b8\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.4\n\n0.5\n\nparameter \u03b8\n\n0.6\n\nFigure 2: Left: Estimator 1\nRight: Estimator 1\n\nM PM\n\n\u03c9 (\u03b8)]/M.\n\nM PM\nm=1 \u2202hJ N\n\n\u03c9m(\u03b8) of J(\u03b8) and con\ufb01dence intervals \u00b1pVar[J N\nm=1 J N\n\u03c9m(\u03b8) of \u2202J(\u03b8) and con\ufb01dence intervals \u00b1pVar[\u2202hJ N\n\n\u03c9 (\u03b8)]/M.\n\nReferences\nAndrieu, C., Doucet, A., Singh, S., & Tadic, V. (2004). Particle methods for change detection, identi\ufb01cation\n\nand control. Proceedings of the IEEE, 92, 423\u2013438.\n\nBaxter, J., & Bartlett, P. (1999). Direct gradient-based reinforcement learning. Journal of Arti\ufb01cial Inteligence\n\nReseach.\n\nCapp\u00b4e, O., Douc, R., & Moulines, E. (2005). Comparaison of resampling schemes for particle \ufb01ltering. 4th\n\nInternational Symposium on Image and Signal Processing and Analysis.\n\nC\u00b4erou, F., LeGland, F., & Newton, N. (2001). Stochastic particle methods for linear tangent \ufb01ltering equations,\n\n231\u2013240. IOS Press, Amsterdam.\n\nCoquelin, P., Deguest, R., & Munos, R. (2008). Sensitivity analysis in particle \ufb01lters. Application to policy\n\noptimization in POMDPs (Technical Report). INRIA, RR-6710.\n\nDel Moral, P. (2004). Feynman-kac formulae, genealogical and interacting particle systems with applications.\n\nSpringer.\n\nDel Moral, P., & Miclo, L. (2000). Branching and interacting particle systems. approximations of feynman-kac\n\nformulae with applications to non-linear \ufb01ltering. S\u00b4eminaire de probabilit\u00b4es de Strasbourg, 34, 1\u2013145.\n\nDouc, R., & Moulines, E. (2008). Limit theorems for weighted samples with applications to sequential monte\n\ncarlo methods. To appear in Annals of Statistics.\n\nDoucet, A., Freitas, N. D., & Gordon, N. (2001). Sequential monte carlo methods in practice. Springer.\nDoucet, A., & Tadic, V. (2003). Parameter estimation in general state-space models using particle methods.\n\nAnn. Inst. Stat. Math.\n\nFichoud, J., LeGland, F., & Mevel, L. (2003). Particle-based methods for parameter estimation and tracking :\n\nnumerical experiments (Technical Report 1604). IRISA.\n\nFox, D., Thrun, S., Burgard, W., & Dellaert, F. (2001). Particle \ufb01lters for mobile robot localization. Sequential\n\nMonte Carlo Methods in Practice. New York: Springer.\n\nGlasserman, P. (1991). Gradient estimation via perturbation analysis. Kluwer.\nGlasserman, P. (2003). Monte carlo methods in \ufb01nancial engineering. Springer.\nGordon, N., Salmond, D., & Smith, A. F. M. (1993). Novel approach to nonlinear and non-gaussian bayesian\n\nstate estimation. Proceedings IEE-F (pp. 107\u2013113).\n\nKaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable\n\nstochastic domains. Arti\ufb01cial Intelligence, 101, 99\u2013134.\n\nKitagawa, G. (1996). Monte-Carlo \ufb01lter and smoother for non-Gaussian nonlinear state space models. J.\n\nComput. Graph. Stat., 5, 1\u201325.\n\nLovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes.\n\nAnnals of Operations Research, 28, 47\u201366.\n\nPoyadjis, G., Doucet, A., & Singh, S. (2005). Particle methods for optimal \ufb01lter derivative: Application to\n\nparameter estimation. IEEE ICASSP.\n\nRabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition.\n\nProceedings of the IEEE, 77, 257\u2013286.\n\nSpall, J. C. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE trans-\n\naction on automatic control, 45, 1839\u20131853.\n\n8\n\n\f", "award": [], "sourceid": 628, "authors": [{"given_name": "Pierre-arnaud", "family_name": "Coquelin", "institution": null}, {"given_name": "Romain", "family_name": "Deguest", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}