{"title": "Sensitivity analysis in HMMs with application to likelihood maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 387, "page_last": 395, "abstract": "This paper considers a sensitivity analysis in Hidden Markov Models with continuous state and observation spaces. We propose an Infinitesimal Perturbation Analysis (IPA) on the filtering distribution with respect to some parameters of the model. We describe a methodology for using any algorithm that estimates the filtering density, such as Sequential Monte Carlo methods, to design an algorithm that estimates its gradient. The resulting IPA estimator is proven to be asymptotically unbiased, consistent and has computational complexity linear in the number of particles. We consider an application of this analysis to the problem of identifying unknown parameters of the model given a sequence of observations. We derive an IPA estimator for the gradient of the log-likelihood, which may be used in a gradient method for the purpose of likelihood maximization. We illustrate the method with several numerical experiments.", "full_text": "Sensitivity analysis in HMMs\n\nwith application to likelihood maximization\n\nPierre-Arnaud Coquelin,\n\nVekia, Lille, France\n\npacoquelin@vekia.fr\n\nRomain Deguest\u2044\n\nColumbia University, New York City, NY 10027\n\nrd2304@columbia.edu\n\nINRIA Lille - Nord Europe, Sequel Project, France\n\nR\u00e9mi Munos\n\nremi.munos@inria.fr\n\nAbstract\n\nThis paper considers a sensitivity analysis in Hidden Markov Models with con-\ntinuous state and observation spaces. We propose an In\ufb01nitesimal Perturbation\nAnalysis (IPA) on the \ufb01ltering distribution with respect to some parameters of the\nmodel. We describe a methodology for using any algorithm that estimates the \ufb01l-\ntering density, such as Sequential Monte Carlo methods, to design an algorithm\nthat estimates its gradient. The resulting IPA estimator is proven to be asymptoti-\ncally unbiased, consistent and has computational complexity linear in the number\nof particles.\nWe consider an application of this analysis to the problem of identifying unknown\nparameters of the model given a sequence of observations. We derive an IPA\nestimator for the gradient of the log-likelihood, which may be used in a gradient\nmethod for the purpose of likelihood maximization. We illustrate the method with\nseveral numerical experiments.\n\n1 Introduction\n\nWe consider a parameterized hidden Markov model (HMM) de\ufb01ned on continuous state and ob-\nservation spaces. The HMM is de\ufb01ned by a state process (Xt)t\u201a0 \u2208 X and an observation process\n(Yt)t\u201a1 \u2208 Y that are parameterized by a continuous parameter \u03b8 = (\u03b81, . . . , \u03b8d) \u2208 \u0398, where \u0398 is a\ncompact subset of Rd.\nThe state process is a Markov chain taking its values in a (measurable) state space X, with initial\nprobability measure \u00b5 \u2208 M(X) (i.e. X0 \u223c \u00b5) and Markov transition kernel K(\u03b8, xt, dxt+1). We\nassume that we can sample this Markov chain using a transition function F and independent random\nnumbers, i.e. for all t \u2265 0,\n\ni.i.d.\u223c \u03bd,\n\nXt+1 = F (\u03b8, Xt, Ut), with Ut\n\n(1)\nwhere F : \u0398 \u00d7 X \u00d7 U \u2192 X and (U, \u03c3(U), \u03bd) is a probability space. In many practical situations\nU = [0, 1]p, \u03bd is uniform, thus Ut is a p-uple of uniform random numbers. For simplicity, we\nadopt the notations F (\u03b8, x\u00a11, u) , F\u00b5(\u03b8, u), where F\u00b5 is the \ufb01rst transition function (i.e. X0 =\nF\u00b5(\u03b8, U\u00a11) with U\u00a11 \u223c \u03bd).\nThe observation process (Yt)t\u201a1 lies in a (measurable) space Y and is linked with the state process\nby the conditional probability measure P(Yt \u2208 dyt|Xt = xt) = g(\u03b8, xt, yt) dyt, where g : \u0398 \u00d7\n\n\u2217\n\nalso af\ufb01liated with CMAP, Ecole Polytechnique, France\n\n1\n\n\fX \u00d7 Y \u2192 [0, 1] is the marginal density function of Yt given Xt. We assume that observations are\nconditionally independent given the state.\nSince the transition and observation processes are parameterized by the parameter \u03b8, the state Xt\nand the observation Yt processes depend explicitly on \u03b8. For notation simplicity we will omit to\nwrite the dependence of \u03b8 (in K, F , g, Xt, Yt, ...) when there is no possible ambiguity.\nOne of the main interest in HMMs is to recover the state at time n given a sequence of past obser-\nvations (y1, . . . , yn) (written y1:n). The \ufb01ltering distribution (or belief state)\n\n\u03c0n(dxn) , P(Xn \u2208 dxn|Y1:n = y1:n)\n\nis the distribution of Xn conditioned on the information y1:n. We de\ufb01ne analogously the predictive\ndistribution\n\n\u03c0n+1jn(dxn+1) , P(Xn+1 \u2208 dxn+1|Y1:n = y1:n).\n\n\u222b\n\nOur contribution is an In\ufb01nitesimal Perturbation Analysis (IPA) that estimates the gradient \u2207\u03c0n\n(where \u2207 refers to the derivative with respect to the parameter \u03b8) of the \ufb01ltering distribution \u03c0n.\nMore precisely, we estimate \u2207\u03c0n(f) (where \u03c0(f) ,\nX f(x)\u03c0(dx)) for any integrable function f\nunder the \ufb01ltering distribution \u03c0n.\nWe also consider as application, the problem of parameter identi\ufb01cation in HMMs which consists\n\u2044 of the model that has served to generate the sequence\nin estimating the (unknown) parameter \u03b8\nof observations. In a Maximum Likelihood (ML) approach, one searches for the parameter \u03b8 that\nmaximizes the likelihood (or its logarithm) given the sequence of observations. The log-likelihood\nof parameter \u03b8 is de\ufb01ned by ln(\u03b8) , log p\u03b8(y1:n) where p\u03b8(y1:n) dy1:n , P(Y1:n(\u03b8) \u2208 dy1:n).\nThe Maximum Likelihood (ML) estimator \u02c6\u03b8n , arg max\u03b82\u0398 ln(\u03b8) is asymptotically consistent\n\u2044 when n \u2192 \u221e under some\n(in the sense that \u02c6\u03b8n converges almost surely to the true parameter \u03b8\nidenti\ufb01ably conditions and mild assumptions on the model, see Theorem 2 of [DM01]). Thus, using\nthe ML approach, the parameter identi\ufb01cation problem reduces to an optimization problem.\nOur second contribution is a sensitivity analysis of the predictive distribution \u2207\u03c0t+1jt, for t <\nn, which enables to estimate the gradient \u2207ln(\u03b8) of the log-likelihood function, which may be\nused in a (stochastic) gradient method for the purpose of optimizing the likelihood. The approach\nis numerically illustrated on two parameter identi\ufb01cation problems (autoregressive model and a\nstochastic volatility model) and compared to other approaches (EM algorithm, the Kalman \ufb01lter,\nand the Likelihood ratio approach) when these latter apply.\n\n2 Links with other works\n\nFirst, let us mention that we are interested in the continuous state case since numerous applications\nin signal processing, \ufb01nance, robotics, or telecommunications naturally \ufb01t in this framework. In the\ngeneral setting there exists no closed-form expression of the \ufb01ltering distribution (unlike in \ufb01nite\nspaces where the Viterbi algorithm may apply or in linear-Gaussian models where the Kalman \ufb01lter\ncan be used). Thus, in this paper, we will make use of the so-called Sequential Monte Carlo\nmethods (SMC) (also known as Particle Filters) which are numerical tools that can be applied to\na large class of models, see e.g. [DFG01]. For illustration, a challenging example in \ufb01nance is\nthe problem of parameter estimation in the stochastic volatility model, which is a non-linear non-\nGaussian continuous space HMM parameterized by three continuous parameters (see e.g. [ME07])\nwhich will be described in the experimental section.\nA usual approach for parameter estimation consists in performing a maximum likelihood estimation\n(MLE), i.e. search for the most likely value of the parameter, given the observed data. For \ufb01nite state\nspace problems, the Expectation Maximization (EM) algorithm is a popular method for solving the\nMLE problem. However, in continuous space problems, see [CM05], the EM algorithm is dif\ufb01cult to\nuse mainly because the Expectation part relies on the estimation of the posterior path measure which\nis intractable in many situations. The Maximization part may also be very complicated and time-\nconsuming when the model does not belong to a linear or exponential family. An alternative method\nconsists in using brute force optimization methods based on the evaluation of the likelihood such\nas grid-based or simulated annealing methods. These approaches, which can be seen as black-box\noptimization are not very ef\ufb01cient in high dimensional parameter spaces.\n\n2\n\n\fAnother approach is to treat the parameter as part of the state variable and then compute the optimal\n\ufb01lter (see [DFG01] and [Sto02]). In this case, the Bayesian posterior distribution of the parameter\nis a marginal of the optimal \ufb01lter. It is well known that those methods are stable only under certain\nconditions, see [Pap07], and do not perform well in practice for a large number of time steps.\nA last solution consists in using an optimization procedure based on the evaluation of the gradient\nof the log-likelihood function with respect to the parameter. These approaches have been studied in\nthe \ufb01eld of continuous space HMMs e.g. in [DT03, FLM03, PDS05, Poy06]. The idea was to use a\nlikelihood ratio approach (also called score method) to evaluate the gradient of the likelihood. This\napproach suffers from high variance of the estimator, in particular for problems with small noise in\nthe dynamic. To tackle this issue, [PDS05] proposed to use a marginal particle \ufb01lter instead of a\nsimple path-based particle \ufb01lter as Monte Carlo approximation method. This approach is ef\ufb01cient\nin terms of variance reduction but its computational complexity becomes quadratic in the number of\nparticles instead of being linear, like in path-based particle methods.\nThe IPA approach proposed in this paper is an alternative gradient-based maximum likelihood ap-\nproach. Compared with works on gradient approaches previously cited, the IPA provides usually a\nlower variance estimators than the likelihood ratio methods, and its numerical complexity is linear\nin the number of particles.\nOther works related to ours are the so-called tangent \ufb01lter approach described in [CGN01] for dy-\nnamics coming from a discretization of a diffusion process, and the Finite-Difference (FD) approach\ndescribed in a different setting (i.e. policy gradient in Partially Observable Markov Decision Pro-\ncesses) in [CDM08]. A similar FD estimator could be designed in our setting too but the resulting\nFD estimator would be biased (like usual FD schemes) whereas the IPA estimator is not.\n\n3 Sequential Monte Carlo methods (SMC)\nGiven a measurable test function f : X \u2192 R, we have:\n\u03c0n(f) , E[f(Xn)|Y1:n = y1:n] =\n\n\u222b\u220f\n\nf(xn)\nn\n\n\u220f\n\n\u222b\n\nn\n\nt=0 K(xt\u00a11, dxt)Gt(xt)\n\nt=0 K(xt\u00a11, dxt)Gt(xt)\n\n=\n\n\u220f\n\n\u220f\n\nE[f(Xn)\n\nE[\n\nwhere we used the simpli\ufb01ed notation: Gt(xt) , g(xt, yt) and G0(x0) , 1.\nIn general, it is impossible to write \u03c0n(f) analytically except for speci\ufb01c cases (such as lin-\near/Gaussian with Kalman \ufb01ltering). In this paper, we consider a numerical approximation of \u03c0n(f)\nbased on a SMC method. But it should be mentioned that other methods (such as Extended Kalman\n\ufb01lter, quantization methods, Markov Chain Monte Carlo methods) may be used as well to build the\nIPA estimator that we propose in the next section.\nThe basic SMC method, called Bootstrap Filter, see [DFG01] for details, approximates \u03c0n(f) by an\nempirical distribution \u03c0N\n\nn) made of N particles x1:N\nn .\n\nn (f) , 1\n\n\u2211\n\ni=1 f(xi\n\nN\n\nN\n\nn\n\nt=0 Gt(Xt)]\n\n.\n\nn\n\nt=0 Gt(Xt)]\n\n(2)\n\nAlgorithm 1 Generic Sequential Monte Carlo\n\nfor t = 1 to n do\n\nt\u00a11\n\nSampling: Sample ui\nimportance sampling weights wi\nResampling: Set xi\nw1:N\nend for\nRETURN: \u03c0N\n\nt =exki\n\u2211\n\nn (f) = 1\n\nN\n\n.\n\nt\n\ni=1 f(xi\nn)\n\niid\u223c \u03bd and setexi\nt = F (xi\nt\u00a11, ui\nt = Gt(exi\nPN\nj=1 Gt(exj\n,\nt ,\u2200i \u2208 {1, . . . , N}, where k1:N are indices selected from the weights\n\nt\u00a11),\u2200i \u2208 {1, . . . , N}. Then de\ufb01ne the\n\nt )\n\nt)\n\nN\n\nThe sampling (or transition) step generates a successor particle population ex1:N\nthe setex1:N\n\naccording to the\nare eval-\nfrom\n. Resampling is used to avoid the problem of degeneracy\nof the algorithm, i.e. that most of the weights decreases to zero. It consists in selecting new parti-\n\nstate dynamics from the previous population x1:N\nuated, and the resampling (or selection) step resamples (with replacement) N particles x1:N\n\nt\u00a11. The importance sampling weights w1:N\n\naccording to the weights w1:N\n\nt\n\nt\n\nt\n\nt\n\nt\n\n3\n\n\f\u2211\n\nt\u03c6(exi\n\n\u2211\n\nt) = E[ 1\n\nN\ni=1 wi\n\nt = j) = wj\n\nt)]).\ni=1 \u03c6(xi\ncle positions such as to preserve a consistency property (i.e.\nby an independent\nThe simplest version introduced in [GSS93] chooses the selection indices k1:N\nsampling from the set {1, . . . , N} according to a multinomial distribution with parameters w1:N\n,\nt , for all 1 \u2264 i \u2264 N. The idea is to replicate the particles in proportion to their\ni.e. P(ki\nweights. Many variants have been proposed in the literature, among which the strati\ufb01ed resampling\nmethod [Kit96] which is optimal in terms of variance minimization.\nn (f) to \u03c0n(f) (e.g. Law of Large Numbers or Central Limit Theorems) are\nConvergence issues of \u03c0N\nn (f) is\ndiscussed in [Del04] or [DM08]. For our purpose we note that under mild conditions on f, \u03c0N\nan asymptotically unbiased (see [DMDP07] for the asymptotic expression of the bias) and consistent\nestimator of \u03c0n(f).\n\nN\n\nN\n\nt\n\nt\n\n4 In\ufb01nitesimal Perturbation Analysis in HMMs\n\n\u220f\n\u220f\n\u2207E[\nE[\n\n4.1 Sensitivity analysis of the \ufb01ltering distribution\n\n]\n\n\u220f\n\n\u220f\n\u2207E[f(Xn)\n\n[\n\n\u220f\n\u220f\n\u220f\n\nn\n\nn\n\nn\n\nn\n\nn\n\n=\n\nE[\n\nE[\n\nE[f(Xn)\n\n\u2212\u03c0n(f)\n\nt=0 Gt(Xt)]\n\nt=0 Gt(Xt)]\n\nt=0 Gt(Xt)]\n\nt=0 Gt(Xt)]\n\nThe following decomposition of the gradient of the \ufb01ltering distribution \u03c0n applied to a function f:\n\u2207[\u03c0n(f)] = \u2207\nt=0 Gt(Xt)]\nt=0 Gt(Xt)]\n(3)\nshows that the problem of \ufb01nding an estimator of \u2207\u03c0n(f) is reduced to the problem of \ufb01nding an\nestimator of \u2207E[f(Xn)\nt=0 Gt(Xt)]. There are two dominant in\ufb01nitesimal methods for estimat-\ning the gradient of an expectation in a Markov chain: the In\ufb01nitesimal Perturbation Analysis (IPA)\nmethod and the Score Function (SF) method (also called likelihood ratio method), see for instance\n[Gla91] and [P\ufb0296] for a detailed presentation of both methods. SF has been used in [DT03, FLM03]\nto estimate \u2207\u03c0n. Although IPA is known for having a lower variance than SF in general, as far as\nwe know, it has never been used in this context. This is therefore the object of this Section.\nUnder appropriate smoothness assumptions (see Proposition 1 below), the gradient of an expectation\nover a random variable X is equal to an expectation involving the pair of random variables (X,\u2207X)\n\nn\n\nn\n\n(where 0 refers to the derivative with respect to the state variable). Applying this property to estimate\n\u2207E [f(Xn)\n\nn\n\n\u220f\n\nt=0 Gt(Xt)], we deduce\n\u2207E\n\nf(Xn)\n\n[\n\n]\n\n0(X)\u2207X],\n\u2207E[f(X)] = E[\u2207[f(X)]] = E[f\n]]\nn\u220f\nn\u220f\n)\nn\u220f\n\n[\n\u2207\nn\u2211\nf(Xn)\n\u2207[Gt(Xt)]\nn\u2211\nGt(Xt)\n\nGt(Xt)\n\nGt(Xt)\n\n= E\n\nt=0\n\nt=0\n\nt=0\n\n\u2207[f(Xn)] + f(Xn)\n\n[\n[(\n[(\n\n= E\n\n= E\n\nf\n\n0(Xn)\u2207Xn + f(Xn)\n\n]\n\n{\nNow we de\ufb01ne an augmented Markov chain (Xt, Zt, Rt)t\u201a0 by the following recursive relations\n(where Zt , \u2207Xt)\n\nX0 = F\u00b5(U\u00a11), U\u00a11 \u223c \u03bd\nZ0 = \u2207F\u00b5(U\u00a11),\nR0 = 0,\n\n(4)\n\n.\n\nt=0\n\nt=0\n\n)\n\nGt(Xt)\n\nGt(Xt)\n\nGt(Xt)\n\nn\u220f\n\nt(Xt)\u2207Xt + \u2207Gt(Xt)\n0\nG\n\n]\n\uf8f1\uf8f2\uf8f3 Xt+1 = F (Xt, Ut), where Ut \u223c \u03bd\n\u220f\n\nZt+1 = \u2207F (Xt, Ut) + F\nRt+1 = Rt + G\n\u220f\n\u220f\n\nt=0 Gt(Xt)]\n\nGt+1(Xt+1)\n\nt=0 Gt(Xt)]\n\nn\n\nn\n\nt=0 Gt(Xt)]\n\n0(Xt, Ut)Zt,\n\nt+1(Xt+1)Zt+1+rGt+1(Xt+1)\n0\n\n,\n\n.\n\n(5)\n\nt=0\n\n\u2200t \u2265 0,\n\n\u220f\n\nBy introducing this augmented Markov Chain in Equation (4) and using Equation (3) we can rewrite\n\u2207\u03c0n(f) as:\n\n\u2207\u03c0n(f) =\n\nE[(f\n\nE[(f\n\n=\n\n0(Xn)Zn + f(Xn)Rn)\nt=0 Gt(Xt)]\n0(Xn)Zn + Rn(f(Xn) \u2212 \u03c0n(f)))\n\nE[\n\nn\n\nn\n\n\u220f\n\n\u220f\n\nE[Rn\nE[\n\n\u2212 \u03c0n(f)\nt=0 Gt(Xt)]\n\nn\n\nE[\n\nn\n\nt=0 Gt(Xt)]\n\nWe now state some suf\ufb01cient conditions under which the previous derivations are sound.\n\n4\n\n\fProposition 1. Equation (5) is valid on \u0398 whenever the following conditions are satis\ufb01ed:\n\n\u2022 for all \u03b8 \u2208 \u0398, the path \u03b8 7\u2192 (X0, X1,\u00b7\u00b7\u00b7 , Xn)(\u03b8) is almost surely (a.s.) differentiable,\n\u2022 for all \u03b8 \u2208 \u0398, f is a.s. continuously differentiable at Xn(\u03b8), and for all 1 \u2264 t \u2264 n, Gt is\na.s. continuously differentiable at (\u03b8, Xt(\u03b8)),\n\u2022 \u03b8 7\u2192 f(Xn(\u03b8)) and for all 1 \u2264 t \u2264 n, \u03b8 7\u2192 Gt(\u03b8, Xt(\u03b8)) are a.s. continuous and piecewise\ndifferentiable throughout \u0398,\n\u2022 Let D be the random subset of \u0398 at which f(Xn(\u03b8)) or one Gt(\u03b8, Xt(\u03b8)) fails to be differ-\nt=0 Gt(Xt)] < \u221e,\nentiable. We require that E[sup\n\u03b8 /2D\n\n0(Xn) Zn + Rn (f(Xn) \u2212 \u03c0n(f))|\u220f\n\n|f\n\nn\n\nThe proof of this Proposition is a direct application of Theorem 1.2 from [Gla91]. We notice that\nrequiring the a.s. differentiability of the path \u03b8 7\u2192 (X0, X1,\u00b7\u00b7\u00b7 , Xn)(\u03b8) is equivalent to requiring\nthat for all \u03b8 \u2208 \u0398, the transition function F is a.s. continuously differentiable with respect to \u03b8.\nFrom Equation (5), we can derive the IPA estimator of \u2207\u03c0n(f) by using a SMC algorithm:\n\nN\u2211\n\n[\n\nf\n\ni=1\n\n(\n\n)]\n\nrj\nn\n\n,\n\nN\u2211\n\nj=1\n\nI N\nn\n\n, 1\nN\n\n0(xi\n\nn)zi\n\nn + f(xi\nn)\n\nri\nn\n\n\u2212 1\nN\n\n(6)\n\nn) are particles derived by using a SMC algorithm on the augmented Markov chain\n\nn, ri\n\nn, zi\n\nwhere (xi\n(Xt, Zt, Rt) described in Algorithm 2.\nAlgorithm 2 IPA estimation of \u2207\u03c0n\nfor t = 1 to n do\n\nt\u00a11\n\nFor all i \u2208 {1, . . . , N} do\nSample ui\nSet \u02dczi\nSet \u02dcri\nSet (xi\n\nt = \u2207F (xi\nt\u00a11 + G\nt = ri\nt, zi\nt, ri\n\niid\u223c \u03bd and set \u02dcxi\n0(xi\nt\u00a11) + F\nt\u00a11, ui\nt+rGt(\u02dcxi\n0\nt)\u02dczi\nt(\u02dcxi\nt)\nGt(\u02dcxi\nt)\nt , \u02dczki\n\nt) = (\u02dcxki\n\nend for\nRETURN: I N\n\nn = 1\n\nN\n\nN\ni=1\n\nt = F (xi\n\nt\u00a11, ui\n\nt\u00a11),\nt\u00a11)zi\n\nt\u00a11,\n\nt\u00a11, ui\nPj Gt(\u02dcxj\n, and compute the weights wi,t = Gt(\u02dcxi\nt)\nt )\nt , \u02dcrki\nt ), where k1:N are the indices selected from w1:N\n0(xi\n\n)]\n\n\u2211\n\n\u2212 1\n\n(\n\nn + f(xi\nn)\n\nN\nj=1 rj\nn\n\nn)zi\n\n[\n\nri\nn\n\nf\n\nt\n\nN\n\n,\n\n\u2211\n\nProposition 2. Under the assumptions of Proposition 1, the estimator I N\nn ] = \u2207\u03c0n(f) + O(N\nO(N\n\u2207\u03c0n(f) almost surely. In addition, its (asymptotic) variance is O(N\n\u00a11).\n\n\u00a11) and is consistent with \u2207\u03c0n(f), i.e. E[I N\n\nn de\ufb01ned by (6) has a bias\n\u00a11), and limN!1 I N\nn =\n\nE[H(X0:n)Qn\n\nProof. We use the general SMC convergence properties for Feynman-Kac (FK) models (see [Del04]\nor [DM08]) which, applied to a FK \ufb02ow with Markov chain X0:n, (random) potential func-\ntions G(X0:n), and test function H(X0:n), states that the SMC estimate:\n0:n) is\n. Moreover, an asymptotic expression of the bias, given\nconsistent with\n\u00a11). Applying those results to the test function\nin [DMDP07], shows that it is of order O(N\n0(Xn)Zn + Rn(f(Xn)\u2212 \u03c0n(f)), using the representation (5) of the gradient, we deduce that\nH , f\nthe SMC estimator (6) is asymptotically unbiased and consistent with \u2207\u03c0n(f). Now the asymptotic\n\u00a11) since the Central Limit Theorem (see e.g. [Del04, DM08]) applies to the IPA\nvariance is O(N\nestimator (6) of (5).\n\ni=1 H(xi\n\nE[Qn\n\nt=0 G(Xt)]\n\nt=0 G(Xt)]\n\n1\nN\n\nN\n\n\u2211\n\nRemark 1. Notice that the computation of the gradient estimator requires O(nN md) (where m is\nthe dimension of X) elementary operations, which is linear in the number of particles N and linear\nin the number of parameters d, and has memory requirement O(N md).\n\n5\n\n\f\u2207ln(\u03b8) =\n\n\u2207\u03c0t+1jt(Gt+1)\n\u03c0t+1jt(Gt+1)\n\nn\u00a11\u2211\n\nt=0\n\n\u220f\n\u220f\n\n4.2 Gradient of the log-likelihood\n\nIn the Maximum Likelihood approach for the problem of parameter identi\ufb01cation, one may follow\na stochastic gradient method for maximizing the log-likelihood ln(\u03b8) where the gradient\n\nis obtained by estimating each term \u2207\u03c0t+1jt(Gt+1) of the sum using a similar decomposition as in\n(5) and (4) for the predictive distribution applied to Gt+1:\n\n]\n\n\u2207\u03c0t+1jt(Gt+1) = \u2207\n\nt\n\nk=0 Gk(Xk)]\n\nk=0 Gk(Xk)]\n\n=\n\nt\n\nk=0 Gk(Xk)]\n\n\u2212 \u03c0t+1jt(Gt+1)\n\nk=0 Gk(Xk)]\n\n\u220f\n\u220f\n\u2207E[\nE[\n\nt\n\nk=0 Gk(Xk)]\nk=0 Gk(Xk)]\n\nt\n\n[\n\nt\n\nE[Gt+1(Xt+1)\n\nE[\n\n\u220f\n\u2207E[Gt+1(Xt+1)\n[(\n\n\u220f\n\nE[\n\nt\n\nGk(Xk)] = E\n\nwith\n\u2207E[Gt+1(Xt+1)\n\nt\u220f\n\nk=0\n\n) t\u220f\n\n]\n\nGk(Xk)\n\n.\n\n+Gt+1(Xt+1)\n\nt\u2211\n\n\u2207Gt+1(Xt+1) + G\nt+1(Xt+1)\u2207Xt+1\n0\nk(Xk)\u2207Xk + \u2207Gk(Xk)\n0\nG\n)\n\nGk(Xk)\n\nk=0\n\n\u2211\n\n\u2212 1\n\nN\n\nj rj\n\nt\u00a11)\n\nt)(ri\nt + Gt(\u02dcxi\nt)\u02dczi\nt)\ni=1 Gt(\u02dcxi\n\nN\n\nt\u00a11\n\nk=0\n\n,\n\nWe deduce the IPA estimator of \u2207ln(\u03b8)\n\u2207Gt(\u02dcxi\n\nN\ni=1\n\n, n\u2211\n\n\u2211\n\n(\n\nJ N\nn\n\n0\nt(\u02dcxi\nt) + G\n\n\u2211\n\nt=1\n\nn, ri\n\nn, zi\n\nn, \u02dcri\n\nn, \u02dczi\n\nn) (and (\u02dcxi\n\nwhere (xi\nn)) are particles derived by using a SMC algorithm on the aug-\nmented Markov chain (Xt, Zt, Rt) described in the previous subsection. Using similar arguments\nas those detailed in proofs of Propositions 1 and 2, we have that this estimator is asymptotically\nunbiased and consistent with \u2207ln(\u03b8).\n\u2211\nThe resulting gradient algorithm is described in Algorithm 3. The steps \u03b3k are chosen appropriately\nk < \u221e), see e.g. [KY97]\nk\u201a1 \u03b3k = \u221e and\nso that local convergence occurs (e.g. such that\nfor a detailed analysis of Stochastic Approximation algorithms.\nAlgorithm 3 Likelihood Maximization by gradient ascent using the IPA estimator of \u2207ln(\u03b8)\nfor k = 1, 2, . . . , Number of gradient steps do\n\n\u2211\n\nk\u201a1 \u03b32\n\n0 = 0\nInitialize J N\nfor t = 1 to n do\n\nt\u00a11\n\nFor all i \u2208 {1, . . . , N} do\niid\u223c \u03bd and set \u02dcxi\nSample ui\nt\u00a11 + PN\nSet \u02dczi\nt\u00a11) + F\nt\u00a11, ui\ni=1(rGt(\u02dcxi\nSet J N\nt+rGt(\u02dcxi\nt)\u02dczi\nt)\nGt(\u02dcxi\nt)\nt , \u02dczki\n\nt = \u2207F (xi\nt = J N\nt = ri\nt, zi\n\nt\u00a11 + G\nt, ri\n\nSet \u02dcri\nSet (xi\n\nt) = (\u02dcxki\n\nt , \u02dcrki\n\n0\nt(\u02dcxi\n\nt = F (xi\n\n0(xi\n\nt\u00a11),\nt\u00a11, ui\nt\u00a11)zi\nt\u00a11,\nt\u00a11, ui\nPN\n0\nt+Gt(\u02dcxi\nt)\u02dczi\nt)(ri\nt(\u02dcxi\ni=1 Gt(\u02dcxi\nt)\n\nt)+G\n\nN Pj rj\n\nt\u00a11\u00a1 1\n\nt\u00a11))\n\n,\n\nand compute the weights wi\n\nPj Gt(\u02dcxj\nt = Gt(\u02dcxi\nt)\nt )\nt ), where k1:N are indices selected from w1:N\n.\n\nt\n\nend for\nPerform a gradient ascent step: \u03b8k = \u03b8k\u00a11 + \u03b3k J N\n\nn (\u03b8k\u00a11)\n\nend for\n\n6\n\n\fFigure 1: Box-and-whiskers plots of the three parameters (\u03c6, \u03c3, \u03b2) estimates for the AR1 model\nwith \u03b8? = (0.8, 1.0, 1.0). We compare three methods: (1) Kalman, (2) EM and (3) IPA. Here we\nused n = 500 observations and N = 102 particles.\n\n5 Numerical experiments\n\nWe consider two typical problems and report our results focussing on the variance of the estimator:\nAutoregressive model AR1 is a simple linear-Gaussian HMMs thus may be solved by other meth-\nods (such as Kalman \ufb01ltering and EM algorithms) which enables to compare the performances of\nseveral algorithms for parameter identi\ufb01cation. The dynamics are\n\nX0 \u223c N (0, \u03c32),\n\nand for t \u2265 1, Xt = \u03c6Xt\u00a11 + \u03c3Ut,\n\ni.i.d.\u223c N (0, 1) and Vt\n\n(7)\ni.i.d.\u223c N (0, 1) are independent sequences of random variables, and\n\nYt = Xt + \u03b2Vt,\n\nwhere Ut\n\u03b8 = (\u03c6, \u03c3, \u03b2) is a three-dimensional parameter in (R+)3.\nStochastic volatility model is very popular in the \ufb01eld of quantitative \ufb01nance [ME07] to evaluate\nderivative securities, such as options. This is a non-linear non-Gaussian model, so the Kalman\nmethod cannot be used anymore. The dynamics are\n\nX0 \u223c N (0, \u03c32),\n\nand for t \u2265 1, Xt = \u03c6Xt\u00a11 + \u03c3Ut,\n\nYt = \u03b2 exp (Xt/2) Vt,\n\n(8)\n\ni.i.d.\u223c N (0, 1) and Vt\n\ni.i.d.\u223c N (0, 1) and the parameter \u03b8 = (\u03c6, \u03c3, \u03b2) \u2208 (R+)3.\n\nwhere again Ut\n5.1 Parameter identi\ufb01cation\nFigure 1 shows the results of our IPA gradient estimator for the AR1 parameter identi\ufb01cation prob-\nlem and compares those with two other methods: Kalman \ufb01lter (K) and EM (which apply since the\n\u2044 = (0.8, 1.0, 1.0). Notice the apparent\nmodel is linear-Gaussian). The unknown parameter used is \u03b8\n\u2044 (even for Kalman which provides here the exact\nbias of the three methods in the estimation of \u03b8\n\ufb01ltering distribution) since the number of observations n = 500 is \ufb01nite. For IPA, we used N = 102\nparticles and 150 gradient iterations. Algorithm 3 was run 50 times with random starting points uni-\nformly drawn between [\u03b8, \u00af\u03b8], where \u03b8 = (0.5, 0.5, 0.5) and \u00af\u03b8 = (1.0, 1.5, 1.5) in order to illustrate\nthat the method is not sensitive to the starting point.\nWe observe that in terms of estimation accuracy, IPA is very competitive to the other methods,\nKalman and EM, which are designed for speci\ufb01c models (here linear-Gaussian). The IPA method\napplies to general models, for example, to the stochastic volatility model. Figure 2 shows the sets of\nestimates of \u03b8? = (0.8, 1.0, 1.0) using IPA with n = 103 observations and N = 102 particles (no\ncomparison is made here since Kalman does not apply and EM becomes more complicated).\n5.2 Variance study for Score and IPA algorithms\nIPA and Score methods provide gradient estimators for general models. We compare the variance\nof the corresponding estimators of the gradient \u2207ln for the AR1 since for this model we know its\nexact value (using Kalman).\n\n7\n\n1230.810.820.830.840.850.860.870.880.89\u03c6Method1230.750.80.850.90.951\u03c3Method1231.061.081.11.121.141.161.181.21.221.24\u03b2Method\fFigure 2: Box-and-whiskers plots of the three parameters (\u03c6, \u03c3, \u03b2) estimates for the IPA method\napplied to the stochastic volatility model with \u03b8? = (0.8, 1.0, 1.0). We used n = 103 observations\nand N = 102 particles.\n\nFigure 3 shows the variance of the IPA and Score estimators of the partial derivative \u2202\u03c3ln (we\nfocused our study on \u03c3 since the problem of volatility estimation is challenging, and also because\nthe value of \u03c3 in\ufb02uences the respective performances of the two algorithms, which is not the case\nfor the other parameters \u03c6, \u03b2). We used n = N = 103. The IPA estimator performs better than the\nScore estimator for small values of \u03c3. On the other hand, in case of huge variance in the state model,\nit is better to use the Score estimator.\n\nFigure 3: Variance of the log-likelihood derivative \u2202\u03c3ln computed with both the IPA and Score meth-\n\u2044 = (\u03c6?, \u03c3?, \u03b2?) = (0.8, 1.0, 1.0) and the estimations are computed at\nods. The true parameter is \u03b8\n\u03b8 = (0.7, \u03c3, 0.9).\n\nLet us mention that the variance of the IPA (as well as Score) estimator increases when the number\nof observations n increases. However, under weak conditions on the HMM [LM00], the \ufb01ltering\ndistribution and its gradient forget exponentially fast the initial distribution. This property has al-\nready been used for EM estimators in [CM05] to show that \ufb01xed-lag smoothing drastically reduces\nthe variance without signi\ufb01cantly raising the bias. Similar smoothing (either \ufb01xed-lag or discounted)\nwould provide ef\ufb01cient variance reduction techniques for the IPA estimator as well.\n\n6 Conclusions\nWe proposed a sensitivity analysis in HMMs based on an In\ufb01nitesimal Perturbation Analysis and\nprovided a computationally ef\ufb01cient gradient estimator that provides an interesting alternative to\nthe usual Score method. We showed how this analysis may be used for estimating the gradient of\nthe log-likelihood in a gradient-based likelihood maximization approach for the purpose of param-\neter identi\ufb01cation. Finally let us mention that estimators of higher-order derivatives (e.g. Hessian)\ncould be derived as well along this IPA approach, which would enable to use more sophisticated\noptimization techniques (e.g. Newton method).\n\n8\n\n1230.80.850.90.9511.051.11.151.21.25ValuesParameter number0.20.40.60.811.21.4102030405060708090100110\u03c3vnN Score methodIPA method\fReferences\n[CDM08] P.A. Coquelin, R. Deguest, and R. Munos. Particle \ufb01lter-based policy gradient in\n\nPOMDPs. In Neural Information Processing Systems, 2008.\nF. C\u00e9rou, F. Le Gland, and N. J. Newton. Stochastic particle methods for linear tangent\n\ufb01ltering equations. In J.-L. Menaldi, E. Rofman, and A. Sulem, editors, Optimal Con-\ntrol and PDE\u2019s - Innovations and Applications, in honor of Alain Bensoussan\u2019s 60th\nanniversary, pages 231\u2013240. IOS Press, 2001.\nO. Capp\u00e9 and E. Moulines. On the use of particle \ufb01ltering for maximum likelihood\nparameter estimation. European Signal Processing Conference, 2005.\nP. Del Moral. Feynman-Kac Formulae, Genealogical and Interacting Particle Systems\nwith Applications. Springer, 2004.\n\n[CGN01]\n\n[CM05]\n\n[Del04]\n\n[DM01]\n\n[DM08]\n\n[DFG01] A. Doucet, N. De Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice.\n\nSpringer, 2001.\nR. Douc and C. Matias. Asymptotics of the maximum likelihood estimator for general\nhidden markov models. Bernouilli, 7:381\u2013420, 2001.\nR. Douc and E. Moulines. Limit theorems for weighted samples with applications to\nsequential monte carlo methods. Annals of Statistics, 36:5:2344\u20132376, 2008.\n\n[DT03]\n\n[FLM03]\n\n[Gla91]\n[GSS93]\n\n[DMDP07] P. Del Moral, A. Doucet, and GW Peters. Sharp Propagation of Chaos Estimates for\nFeynman\u2013Kac Particle Models. SIAM Theory of Probability and its Applications, 51\n(3):459\u2013485, 2007.\nA. Doucet and V.B. Tadic. Parameter estimation in general state-space models using\nparticle methods. Ann. Inst. Stat. Math, 2003.\nJ. Fichoud, F. LeGland, and L. Mevel. Particle-based methods for parameter estimation\nand tracking : numerical experiments. Technical Report 1604, IRISA, 2003.\nP. Glasserman. Gradient estimation via perturbation analysis. Kluwer, 1991.\nN. Gordon, D. Salmond, and A. F. M. Smith. Novel approach to nonlinear and non-\ngaussian bayesian state estimation. In Proceedings IEE-F, volume 140, pages 107\u2013113,\n1993.\nG. Kitagawa. Monte-Carlo \ufb01lter and smoother for non-Gaussian nonlinear state space\nmodels. J. Comput. Graph. Stat., 5:1\u201325, 1996.\nH. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications.\nSpringer-Verlag, Berlin and New York, 1997.\nF. LeGland and L. Mevel. Exponential forgetting and geometric ergodicity in hidden\nmarkov models. mathematic and control sugnal and systems, 13:63\u201393, 2000.\nR. Mamon and R.J. Elliott. Hidden markov models in \ufb01nance. International Series in\nOperations Research and Management Science, 104, 2007.\nA. Papavasiliou. A uniformly convergent adaptive particle \ufb01lter. Journal of Applied\nProbability, 42 (4):1053\u20131068, 2007.\nG. Poyadjis, A. Doucet, and S.S. Singh. Particle methods for optimal \ufb01lter derivative:\nApplication to parameter estimation. In IEEE ICASSP, 2005.\nG. P\ufb02ug. Optimization of Stochastic Models: The Interface Between Simulation and\nOptimization. Kluwer Academic Publishers, 1996.\nG. Poyiadjis. Particle Method for Parameter Estimation in General State Space Models.\nPhD thesis, University of Cambridge, 2006.\nG. Storvik. Particle \ufb01lters for state-space models with the presence of unknown static\nparameters. IEEE Transactions on Signal Processing, 50:281\u2013289, 2002.\n\n[PDS05]\n\n[Poy06]\n\n[Sto02]\n\n[Kit96]\n\n[KY97]\n\n[LM00]\n\n[ME07]\n\n[Pap07]\n\n[P\ufb0296]\n\n9\n\n\f", "award": [], "sourceid": 286, "authors": [{"given_name": "Pierre-arnaud", "family_name": "Coquelin", "institution": null}, {"given_name": "Romain", "family_name": "Deguest", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}