{"title": "Weighted Likelihood Policy Search with Model Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 2357, "page_last": 2365, "abstract": "Reinforcement learning (RL) methods based on direct policy search (DPS) have been actively discussed to achieve an efficient approach to complicated Markov decision processes (MDPs). Although they have brought much progress in practical applications of RL, there still remains an unsolved problem in DPS related to model selection for the policy. In this paper, we propose a novel DPS method, {\\it weighted likelihood policy search (WLPS)}, where a policy is efficiently learned through the weighted likelihood estimation. WLPS naturally connects DPS to the statistical inference problem and thus various sophisticated techniques in statistics can be applied to DPS problems directly. Hence, by following the idea of the {\\it information criterion}, we develop a new measurement for model comparison in DPS based on the weighted log-likelihood.", "full_text": "Weighted Likelihood Policy Search\n\nwith Model Selection\n\nTsuyoshi Ueno (cid:3)\n\nJapan Science and Technology Agency\nueno@ar.sanken.osaka-u.ac.jp\n\nTakashi Washio\nOsaka University\n\nKohei Hayashi\n\nUniversity of Tokyo\n\nhayashi.kohei@gmail.com\n\nYoshinobu Kawahara\n\nOsaka University\n\nwashio@ar.sanken.osaka-u.ac.jp\n\nkawahara@ar.sanken.osaka-u.ac.jp\n\nAbstract\n\nReinforcement learning (RL) methods based on direct policy search (DPS) have\nbeen actively discussed to achieve an ef\ufb01cient approach to complicated Markov\ndecision processes (MDPs). Although they have brought much progress in prac-\ntical applications of RL, there still remains an unsolved problem in DPS related\nto model selection for the policy. In this paper, we propose a novel DPS method,\nweighted likelihood policy search (WLPS), where a policy is ef\ufb01ciently learned\nthrough the weighted likelihood estimation. WLPS naturally connects DPS to the\nstatistical inference problem and thus various sophisticated techniques in statis-\ntics can be applied to DPS problems directly. Hence, by following the idea of the\ninformation criterion, we develop a new measurement for model comparison in\nDPS based on the weighted log-likelihood.\n\n1 Introduction\nIn the last decade, several direct policy search (DPS) methods have been developed in the \ufb01eld of\nreinforcement learning (RL) [1, 2, 3, 4, 5, 6, 7, 8, 9] and have been successfully applied to practical\ndecision making applications [5, 7, 9]. Unlike classical approaches [10], DPS characterizes a policy\nas a parametric model and explores parameters such that the expected reward is maximized in a\ngiven model space. Hence, if one employs a model with a reasonable number of DoF (degrees of\nfreedom), DPS could \ufb01nd a reasonable policy ef\ufb01ciently even when the target decision making task\nhas a huge number of DoF. Therefore, the development of an ef\ufb01cient model selection methodology\nfor the policy is crucial in RL research.\nIn this paper, we propose weighted likelihood policy search (WLPS): an ef\ufb01cient iterative policy\nsearch algorithm that allows us to select an appropriate model automatically from a set of candidate\nmodels. To this end, we \ufb01rst introduce a log-likelihood function weighted by the discounted sum of\nfuture rewards as the cost function for DPS. In WLPS, the policy parameters are updated by itera-\ntively maximizing the weighted log-likelihood for the obtained sample sequence. A key property of\nWLPS is that the maximization of weighted log-likelihood corresponds to that of the lower bound of\nthe expected reward and thus, WLPS is guaranteed to increase the expected reward monotonically at\neach iteration. This can be shown to converge to the same solution as the expectation-maximization\npolicy search (EMPS) [1, 4, 9]. In this way, our framework gives a natural connection between\nDPS and the statistical inference problem for maximum likelihood estimation. One bene\ufb01t of this\napproach is that we can directly apply the information criterion scheme [11, 12], which is a familiar\ntheory in statistics, to the weighted log-likelihood. This enables us to construct a model selection\nstrategy for the policy by comparing the information criterion based on the weighted log-likelihood.\nThe contribution of this paper is summarized as follows:\n\n(cid:3)\n\nhttps://sites.google.com/site/tsuyoshiueno/\n\n1\n\n\f1. We prove that each update to the policy parameters resulting from the maximization of the\nweighted log-likelihood has consistency and asymptotic normality, which have been not\nelucidated yet in DPS, and converges to the same solution as EMPS.\n\n2. We introduce prior distribution on the policy parameter, and analyze the asymptotic behav-\nior of the marginal weighted likelihood given by marginalizing out the policy parameter.\nWe then propose a measure of the goodness of the policy model based on the posterior\nprobability of the model in a similar way as Bayesian information criterion [12].\n\nThe rest of the paper is organized as follows. We \ufb01rst give a formulation of the DPS problem in RL,\nand a short overview of EMPS (Section 2). Next, we present our new DPS framework, WLPS, and\ninvestigate the theoretical aspects thereof (Section 3). In addition, we construct the model selection\nstrategy for the policy (Section 4). Finally, we present a statistical interpretation of WLPS and\ndiscuss future directions of study in this regard (Section 5).\nRelated Works Several approaches have been proposed for the model selection problem in the\nestimation of a state-action value function [13, 14].\n[14] derived the PAC-Bayesian bounds for\nestimating state-action value functions. [13] developed a complexity regularization based model\nselection algorithm from the viewpoint of the minimization of the Bellman error, and investigated its\ntheoretical aspects. Although these studies allow us to select a good model for a state-value function\nwith theoretical supports, they cannot be applied to model selection for DPS directly. [15] developed\na model selection strategy for DPS by reusing the past observed sample sequences through the\nimportance weighted cross-validation (IWCV). However, IWCV requires heavy computational costs\nand includes computational instability when estimating the importance sampling weights.\nRecently, there are a number of studies that reformulate stochastic optimal control and RL as a\nminimization problem of Kullback-Leiblar (KL) divergence [16, 17, 18]. Our approach is closely\nrelated to these works; in fact, WLPS can also be interpreted as the minimization problem of the\nreverse form of KL divergence compared with that used in [16, 17, 18].\n2 Preliminary: EMPS\nWe consider discrete-time in\ufb01nite horizon Markov Decision Processes (MDPs), de\ufb01ned as the\nquadruple (X ;U; p; r): X (cid:18) Rdx is a state space; U (cid:18) Rdu is an action space; p(x\n\u2032jx; u) is a station-\n\u2032 when taking action u at state x; and r : X (cid:2) U 7! R+\nary transition distribution to the next state x\nis a positive reward received with the state transition. Let (cid:25)(cid:18)(ujx) := p(ujx; (cid:18)) be the stochastic\nparametrized policy followed by the agent, where an m-dimensional vector (cid:18) 2 (cid:2); (cid:2) (cid:18) Rm means\nan adjustable parameter. Given initial state x1 and parameter vector (cid:18), the joint distribution of the\nsample sequence, fx2:n; u1:ng, of the MDP is described as\n\np(cid:18)(x2:n; u1:njx1) = (cid:25)(cid:18)(u1jx1)\n\np(xijxi(cid:0)1; ui(cid:0)1)(cid:25)(cid:18)(uijxi):\n\n(1)\n\nn\u220f\n\ni=2\n\nWe further impose the following assumptions on MDPs.\nAssumption 1. For any (cid:18) 2 (cid:2), the MDP given by Eq. (1), is aperiodic and Harris recurrent. Hence,\nMDP (1) is ergodic and has a unique invariant stationary distribution (cid:22)(cid:18)(x), for any (cid:18) 2 (cid:2) [19].\nAssumption 2. For any x 2 X and u 2 U, reward r(x; u) is uniformly bounded.\nAssumption 3. Policy (cid:25)(cid:18)(ujx) is thrice continuously differentiable with respect to parameter (cid:18).\nThe general goal of DPS is to \ufb01nd an optimal policy parameter (cid:18)\nreward de\ufb01ned by\n\n(cid:3) 2 (cid:2) that maximizes the expected\n\n\u2211\np(cid:18)(x2:n; u1:njx1)fRng dx2:ndu1:n;\nwhere Rn := Rn(x1:n; u1:n) = (1=n)\ni=1 r(xi; ui). In general, the direct maximization of ob-\njective function (2) forces us to solve a non-convex optimization problem with a high non-linearity.\nThus, instead of maximizing Eq. (2), many of the DPS methods maximize the lower bound on\nEq. (2), which may be much more tractable than the original objective function.\nLemma 1 shows that there is a tight lower bound on objective function (2).\nLemma 1. [1, 4, 9] The following inequality holds for any distribution q(x2:n; u1:njx1):\np(cid:18)(x2:n; u1:njx1)Rn\nq(x2:n; u1:njx1)\n\nln (cid:17)n((cid:18)) (cid:21) Fn(q; (cid:18)) :=\n\ndx2:ndu1:n; 8n\n\nq(x2:n; u1:njx1)\n\nln\n\n(cid:17)((cid:18)) := lim\nn!1\n\n\u222b\u222b\n\n(2)\n\n(3)\n\n\u222b\u222b\n\nn\n\n}\n\n{\n\n2\n\n\f\u222b\u222b\n\nwhere (cid:17)n((cid:18)) =\nmaximizer of Fn(q; (cid:18)) for some (cid:18), i.e., q\nwhen q\n\n(x2:n; u1:njx1) / p(cid:18)(x2:n; u1:njx1)fRng.\n\np(x2:n; u1:njx1)fRng dx2:ndu1:n. The equality holds if q(x2:n; u1:njx1) is a\nFn(q; (cid:18)), which is satis\ufb01ed\n\n(x2:n; u1:njx1) = argmaxq\n\n(cid:3)\n\n(cid:3)\n\n\u222b\u222b\n\n(\n\nn\u2211\n\n)\n\nThe proof is given in Section 1 in the supporting material. Lemma 1 leads to an effective iterative\nalgorithm, the so-called EMPS, which breaks down the potentially dif\ufb01cult maximization problem\n(cid:18)\u2032(x2:n; u1:njx1) /\n(cid:3)\nfor the expected reward into two stages: 1) evaluation of the path distribution q\np(cid:18)\u2032(x2:n; u1:njx1)fRng at the current policy parameter (cid:18)\n(cid:3)\n(cid:18)\u2032; (cid:18)) with\nrespect to parameter (cid:18). This EMPS cycle is guaranteed to increase the value of the expected reward\nunless the parameters already correspond to a local maximum [1, 4, 9].\nTaking the partial derivative of the policy with respect to parameter (cid:18), a new parameter vector ~(cid:18) that\nmaximizes Fn(q\n\n(cid:3)\n(cid:18)\u2032 ; (cid:18)) is found by solving the following equation:\n\n\u2032, and 2) maximization of Fn(q\n\np(cid:18)\u2032(x2:n; u1:njx1)\n\ni=1\n\n ~(cid:18)(xi; ui)\n\nRndx2:ndu1:n = 0;\n\n(4)\nwhere : X (cid:2) U (cid:2) (cid:2) denotes a partial derivative of the logarithm of the policy with respect to\nparameter (cid:18), i.e., (cid:18)(x; u) := (@)=(@(cid:18)) ln (cid:25)(cid:18)(ujx).\n\u2032jx; u) is known, we can easily derive parameter\nNote that if the state transition distribution p(x\n\u2032jx; u) is generally unknown, and it is a non-trivial\n(cid:22)(cid:18) analytically or numerically. However, p(x\nproblem to identify this distribution in large-scale applications. Thus, it is desirable to \ufb01nd parameter\n(cid:22)(cid:18) in model-free ways, i.e., parameter is estimated from the sample sequences alone, instead of using\n\u2032jx; u). Although many variants of model-free EMPS algorithms [4, 6, 9, 15] have hitherto been\np(x\ndeveloped, their fundamental theoretical properties such as consistency and asymptotic normality at\neach iteration have not yet been elucidated. Moreover, it is not even obvious whether they have such\ndesirable statistical properties.\n\n3 Proposed framework: WLPS\nIn this section, we newly introduce a weighted likelihood as the objective function for DPS (Def-\ninition 1), and derive the WLPS algorithm, executed by iterating two steps: evaluation and maxi-\nmization of the weighted log-likelihood function (Algorithm 1). Then, in Section 3.2, we show that\nWLPS is guaranteed to increase the expected reward at each iteration, and to converge to the same\nsolution as EMPS (Theorem 1).\n3.1 Overview of WLPS\nIn this study, we consider the following weighted likelihood function.\nDe\ufb01nition 1. Suppose that given initial state x1, a random sequence fx2:n; u1:ng is gener-\nated from model p(cid:18)\u2032 (x2:n; u1:njx1) of the MDP. Then, we de\ufb01ne a weighted likelihood function\n^p(cid:18)\u2032;(cid:18)(x2:n; u1:njx1) and a weighted log-likelihood function L(cid:18)\n(cid:25)(cid:18)(uijxi)Q(cid:12)\n\n^p(cid:18)\u2032;(cid:18)(x2:n; u1:njx1) := (cid:25)(cid:18)(u1jx1)Q(cid:12)\n\n(5)\n\n1\n\n\u2032\n\nn ((cid:18)) := ln ^p(cid:18)\u2032;(cid:18)(x2:n; u1:njx1) :=\nL(cid:18)\n\ni ln (cid:25)(cid:18)(uijxi) +\nQ(cid:12)\n\n(6)\n\n\u2032\nn ((cid:18)), respectively, as\ni p(xijxi(cid:0)1; ui(cid:0)1)\nln p(xijxi(cid:0)1; ui(cid:0)1);\n\u2211\n\nn\u2211\n\ni=2\n\nn\u220f\nn\u2211\n\ni=2\n\ni=1\n\ni is the discounted sum of the future rewards given by Q(cid:12)\n\nwhere Q(cid:12)\nis a discounted factor such that (cid:12) 2 [0; 1).\nNow, let us consider the maximization of weighted log-likelihood function (6). Taking the partial\nderivative of weighted log-likelihood (6) with respect to parameter (cid:18), we can obtain the maximum\nweighted log-likelihood estimator ^(cid:18)n := ^(cid:18)(x1:n; u1:n) as a solution of the following estimation\nequation:\n\n:=\n\ni\n\nn\n\nj=i (cid:12)j(cid:0)ir(xj; uj), and (cid:12)\n\nn\u2211\n\nn\u2211\n\nn\u2211\n\n\u2032\nn (^(cid:18)n) :=\n\nG(cid:18)\n\n ^(cid:18)n\n\n(xi; ui)Q(cid:12)\n\ni =\n\n(cid:12)j(cid:0)i ^(cid:18)n\n\n(xi; ui)r(xj; uj) = 0:\n\n(7)\n\ni=1\n\ni=1\n\nj=i\n\n3\n\n\fNote that when policy (cid:25)(cid:18) is modeled by an exponential family, estimating equation (7) can easily\nbe solved analytically or numerically using convex optimization techniques. In WLPS, the update\nof the policy parameter is performed by evaluating estimating equation (7) and \ufb01nding estimator ^(cid:18)n\niteratively from this equation. Algorithm 1 gives an outline of the WLPS procedure.\nAlgorithm 1 (WLPS).\n\n1. Generate a sample sequence fx1:n; u1:ng by employing the current policy parameter (cid:18),\n\nand evaluate estimating equation (7).\n\n2. Find a new estimator by solving estimating equation (7) and check for convergence. If\n\nconvergence is not satis\ufb01ed, return to step 1.\n\nIt should be noted that WLPS guarantees to monotonically increase the expected reward (cid:17)((cid:18)), and to\nconverge asymptotically under certain conditions to the same solution as EMPS, given by Eq. (4). In\nthe next subsection, we discuss the reason why WLPS satis\ufb01es such desirable statistical properties.\n3.2 Convergence of WLPS\nTo begin with, we show consistency and asymptotic normality of estimator ^(cid:18)n given by Eq. (7)\nwhen (cid:12) is any constant between 0 and 1. To this end, we \ufb01rst introduce the notion of uniform\nmixing, which plays an important role when discussing statistical properties in stochastic processes\n[19]. The de\ufb01nition of uniform mixing is given below.\nDe\ufb01nition 2. Let fYi : i = f(cid:1)(cid:1)(cid:1) ;(cid:0)1; 0; 1;(cid:1)(cid:1)(cid:1)gg be a strictly stationary process on a probabilistic\nk be the (cid:27)-algebra generated by fYk;(cid:1)(cid:1)(cid:1) ; Ymg. Then, process fYig is said\nspace (\u2126; F; P ), and F m\nto be uniform mixing (\u03c6-mixing) if \u03c6(s) ! 0 as s ! 1, where\n\n\u03c6(s) :=\n\nsup\n\nA2F k(cid:0)1;B2F1\n\nk+s\n\njP (BjA) (cid:0) P (B)j = 0; P (A) \u0338= 0:\n\nFunction \u03c6(s) is called the mixing coef\ufb01cient, and if the mixing coef\ufb01cient converges to zero ex-\nponentially fast, i.e., there exist constants D > 0 and (cid:26) 2 [0; 1) such that \u03c6(s) < D(cid:26)s, then the\nstochastic process is called geometrically uniform mixing. Note that if a stochastic process is a\nstrictly stationary \ufb01nite-state Markov process and ergodic, the process satis\ufb01es the geometrically\nuniform mixing conditions [19].\nNow, we impose certain conditions for proving the consistency and asymptotic normality of estima-\ntor ^(cid:18)n, summarized as follows.\nAssumption 4. For any (cid:18) 2 (cid:2), MDP p(cid:18)(x2:n; u1:njx1) is geometrically uniform mixing.\nAssumption 5. For any x 2 X , u 2 U, and (cid:18) 2 (cid:2), function (cid:18)(x; u) is uniformly bounded.\nAssumption 6. For any (cid:18) 2 (cid:2), there exists a parameter value (cid:22)(cid:18) 2 (cid:2) such that\n\n35 = 0;\n\n(cid:12)j(cid:0)1r(xj; uj)\n\n[(cid:1)] denotes the expectation over fx2:1; u1:1g with respect\n\nx1(cid:24)(cid:22)(cid:18)\n\nwhere E(cid:25)(cid:18)\nn!1 (cid:22)(cid:18)(x1)(cid:25)(cid:18)(u1jx1)\nlim\n[\nAssumption 7. For any (cid:18) 2 (cid:2) and \u03f5 > 0,\n\ni=2 p(xijxi; ui)(cid:25)(cid:18)(uijxi).\n\nn\n\nsup\n\n(cid:18)\u2032:j(cid:18)\u2032(cid:0) (cid:22)(cid:18)j>\u03f5\n\n (cid:18)\u2032 (x1; u1)\n\nAssumption 8. For any (cid:18) 2 (cid:2), matrix A := A((cid:22)(cid:18)) = E(cid:25)(cid:18)\ninvertible, where K(cid:18)(x; u) := @(cid:18) (cid:18)(x; u) = @2=(@(cid:18)@(cid:18)\n\n\u22a4\n\nE(cid:25)(cid:18)\nx1(cid:24)(cid:22)(cid:18)\n\n\u220f\n\n1\u2211\n\nj=1\n\n24 (cid:22)(cid:18)(x1; u1)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)E(cid:25)(cid:18)\n\nx1(cid:24)(cid:22)(cid:18)\n\n(8)\n\nto distribution\n\n1\u2211\n\nj=1\n\n](cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > 0:\n]\n\u22111\nj=1 (cid:12)j(cid:0)1r(xj; uj)\n\nis\n\n(cid:12)j(cid:0)1r(xj; uj)\n\n[\nx1(cid:24)(cid:22)1\n) ln (cid:25)(cid:18)(ujx).\n\nK (cid:22)(cid:18)(x1; u1)\n\nUnder the conditions given in Assumptions 1-7, estimator ^(cid:18)n converges to (cid:22)(cid:18) in probability, as shown\nin the following lemma.\nLemma 2. Suppose that given initial state x1, a random sequence fx2:n; u1:ng is generated from\nmodel fp(cid:18)(x2:n; u1:njx1)j(cid:18)g of the MDP. If Assumptions 1-7 are satis\ufb01ed, then estimator ^(cid:18)n given by\nestimating equation (7) shows consistency, i.e., estimator ^(cid:18)n converges to parameter (cid:22)(cid:18) in probability.\n\n4\n\n\fThe proof is given in Section 2 in the supporting material. Note that if the policy is characterized as\nan exponential family, we can replace Assumption 7 with Assumption 8 to prove the result in Lemma\n3. Next, we show the asymptotic convergence rate of the estimator given a consistent estimator.\nLemma 3 shows that the estimator converges at the rate Op(n\nLemma 3. Suppose that given initial state x1, a random sequence fx2:n; u1:ng is generated from\nmodel p(cid:18)\u2032(x2:n; u1:njx1), and Assumptions 1-6 and 8 are satis\ufb01ed. If estimator ^(cid:18)n, given by esti-\nmating equation (7) converges to (cid:22)(cid:18) in probability, then we have\n\n(cid:0)1=2).\n\np\n\nn\n\nn(^(cid:18)n (cid:0) (cid:22)(cid:18)) = (cid:0) 1p\n)(\u22111\n\nand\n=\n\nhand\n\nright\n\nthe\n\nFurthermore,\nbution whose mean\nwhere (cid:6)\n(cid:6)((cid:22)(cid:18))\nE(cid:25)(cid:18)\u2032\nx1(cid:24)(cid:22)(cid:18)\u2032\n\n[(\u22111\n\n:=\n\nj=1 (cid:12)j(cid:0)1r(xj; uj)\n\n(cid:0)1\n\nA\n\n(cid:12)j(cid:0)i (cid:22)(cid:18)(xi; ui)r(xj; uj) + op(1):\n\n(9)\n\nside\n\ncovariance\n(cid:0)((cid:22)(cid:18)) +\n\nj\u2032=1+i (cid:12)j\n\n\u22111\n\nof Eq.\n)\nare,\ni=2 (cid:0)i((cid:22)(cid:18)) +\n\u2032(cid:0)1r(xj\u2032 ; uj\u2032 )\n\nconverges\n\n(9)\nrespectively,\n\n\u22111\n\nj=2 (cid:0)j((cid:22)(cid:18))\n\nto\nzero\n\u22a4.\n\n (cid:22)(cid:18)(x1; u1) (cid:22)(cid:18)(xi; ui)\n\n.\n\na Gaussian\nand A\n\n(cid:0)1(cid:6)(A\nHere, (cid:0)i((cid:22)(cid:18))\n\u22a4\n\n]\n\ndistri-\n(cid:0)1)\n\u22a4,\n:=\n\nn\u2211\n\nn\u2211\n\ni=1\n\nj=i\n\nThe proof is given in Section 3 in the supporting material.\nNow we consider the relation between WLPS and EMPS. The following theorem shows that the\nestimator ^(cid:18)n given by Eq. (7) converges to the same solution as that of EMPS asymptotically, when\ntaking the limit of (cid:12) to 1.\nTheorem 1. Suppose that Assumptions 1-7 are satis\ufb01ed. If (cid:12) approaches to 1 from below, WLPS\nleads to the same solution with EMPS given by Eq. (4) as n ! 11.\n\nProof. We introduce the following support lemma.\nLemma 4. Suppose that Assumptions 1-6 are satis\ufb01ed. Then, the partial derivative of the lower\nbound with q\n\n[\n\n(cid:3)\n(cid:18)\u2032 satis\ufb01es\n@\n@(cid:18)\n\nlim\nn!1\n\nFn(q\n\n(cid:3)\n(cid:18)\u2032 ; (cid:18)) = lim\n(cid:12)!1(cid:0)\n\nE(cid:25)(cid:18)\u2032\nx1(cid:24)(cid:22)(cid:18)\u2032\n\n (cid:18)(x1; u1)\n\nwhere (cid:12) ! 1\n\n(cid:0) denotes that (cid:12) converges to 1 from below.\n\n]\n(cid:12)j(cid:0)1r(xj; uj)\n\n;\n\n1\u2211\n\nj=1\n\nThe proof is given in Section 4 in the supporting material. From the results in Lemmas 2 and 4, it\nis obvious that the estimator ^(cid:18)n given by Eq. (7) converges to the same solution as that of EMPS as\n(cid:12) ! 1 from bellow.\n\nTheorem 1 implies that WLPS monotonically increases the expected reward. It should be empha-\nsized that WLPS provides us with an important insight into DPS, i.e., the parameter update of EMPS\ncan be interpreted as a well-studied maximum (weighted) likelihood estimation problem. This al-\nlows us to naturally apply various sophisticated techniques for model selection, which are well\nestablished in statistics, to DPS. In the next section, we discuss model selection for policy (cid:25)(cid:18)(ujx).\n4 Model selection with WLPS\nCommon model selection strategies are carried out by comparing candidate models, which are spec-\ni\ufb01ed in advance, based on a criterion that evaluates the goodness of \ufb01t of the model estimated from\nthe obtained samples. Since the motivation for RL is to maximize the expected reward given in (2),\nit would be natural to seek an appropriate model for the policy through the computation of some\nreasonable measure to evaluate the expected reward from the sample sequences. However, since dif-\nferent policy models give different generative models for sample sequences, we need to obtain new\nsample sequences to evaluate the measure each time the model is changed. Therefore, employing a\nstrategy of model selection based directly on the expected reward would be hopelessly inef\ufb01cient.\n\n1In practice, the constant (cid:12) is set to an arbitrary value close to one. If we can analyze the \ufb01nite sample\nbehavior of the expected reward with the WLPS estimator, we may obtain a better estimator by \ufb01nding an\noptimal (cid:12) in the sense of the maximization of the expected reward. Some researches have recently tackled to\nestablish the \ufb01nite sample analysis for RL based on statistical learning theory [20, 21]. These works might\nprovide us with some insights into the \ufb01nite sample analysis of WLPS.\n\n5\n\n\fInstead, to develop a tractable model selection, we focus on the weighted likelihood given by Eq. (5).\nAs mentioned before, the policy with the maximum weighted log-likelihood must satisfy the max-\nimum of the lower bound of the expected reward asymptotically. Moreover, since the weighted\nlikelihood is de\ufb01ned under a certain \ufb01xed generative process for the sample sequences, unlike the\nexpected reward case, the weighted likelihood can be evaluated using unique sample sequences\neven when the model has been changed. These observations lead to the fact that if it were possible\nto choose a good model from the candidate models in the sense of the weighted likelihood at each\niteration in WLPS, we could realize an ef\ufb01cient DPS algorithm with model selection that achieves a\nmonotonic increase in the expected reward.\nIn this study, we develop a criterion for choosing a suitable model by following the analogy of the\nBayesian information criterion (BIC) [12], designed through asymptotic analysis of the posterior\nprobability of the models given the data. Let M1; M2;(cid:1)(cid:1)(cid:1) ; Mk be k candidate policy models, and\nassume that each model Mj is characterized by a parametric policy (cid:25)(cid:18)j (ujx) and the prior distri-\nbution p((cid:18)jjMj) of the policy parameter. Also, de\ufb01ne the marginal weighted likelihood of the j-th\ncandidate model ^p(cid:18)\u2032;j(x2:n; u1:njx1) as\n\n^p(cid:18)\u2032;j(x2:n; u1:njx1) :=\n\n(cid:25)(cid:18)j (u1jx1)Q(cid:12)\n\n1\n\n(cid:25)(cid:18)j (uijxi)Q(cid:12)\n\ni p(xijxi(cid:0)1; ui(cid:0)1)p((cid:18)jjMj)d(cid:18)j:\n\n(10)\n\n\u222b\n\nn\u220f\n\ni=1\n\nIn a similar manner to the BIC, we now consider the posterior probability of the j-th model given the\nsample sequence by introducing the prior probability of the j-th model p(Mj). From the generalized\nBayes\u2019 rule, the posterior distribution of the j-th model is given by\n\np(Mjjx1:n; u1:n) :=\n\n\u2211\n\n^p(cid:18)\u2032;j(x2:n; u1:njx1)p(Mj)\nj\u2032=1 ^p(cid:18)\u2032;j\u2032(x2:n; u1:njx1)p(Mj\u2032)\n\nk\n\n:\n\n(11)\n\nand in our model selection strategy, we adopt the model with the largest posterior probability.\nFor notational simplicity, in the following discussion we omit the subscript that represents the index\nindicating the number of models. Assuming that the prior probability is uniform in all models,\nthe model with the maximum posterior probability corresponds to that of the marginal weighted\nlikelihood. The behavior of the marginal weighted likelihood can be evaluated when the integrand of\nmarginal weighted likelihood (10) is concentrated in a neighborhood of the weighted log-likelihood\nestimator given by estimating equation (7), as described in the following theorem.\nTheorem 2. Suppose that, given an initial state x1, a random sequence fx2:n; u1:ng is generated\nfrom the model p(cid:18)\u2032(x2:n; u1:njx1) of the MDP. Suppose that Assumptions 1-3 and 5 are satis\ufb01ed. If\nthe following conditions\n\n(a) The estimator ^(cid:18)n given by Eq. (7) converges to (cid:18) at the rate of Op(n\n(b) The prior distribution p((cid:18)jM ) satis\ufb01es p(^(cid:18)njM ) = Op(1).\n(c) The matrix A((cid:18)) := E(cid:25)(cid:18)\u2032\n(d) For any x 2 X , u 2 U and (cid:18) 2 (cid:2), K(cid:18)(x; u) is uniformly bounded.\n\n\u22111\nj=1 (cid:12)j(cid:0)ir(xj; uj)] is invertible.\n\nx1(cid:24)(cid:22)(cid:18)\u2032 [K(cid:18)(x1; u1)\n\n(cid:0)1=2).\n\nare satis\ufb01ed, the log marginal weighted likelihood can be calculated as\n\nln ^p(cid:18)\u2032(x2:n; u1:njx1) = L(cid:18)\n\n\u2032\n\nn (^(cid:18)n) (cid:0) 1\n\nm ln n + Op(1);\n\n2\n\nwhere recall that m denotes the number of dimensional of the model (policy parameter).\n\n\u2211\n\nn\n\nis given in Section 5 in the supporting material.\n\nthe term,\nThe proof\ni=2 ln p(xijxi(cid:0)1; ui(cid:0)1) in L(cid:18)\n\u2032\nn (^(cid:18)n), does not depend on the model. Therefore, when evaluat-\ning the posterior probability of the model, it is suf\ufb01cient to compute the following model selection\ncriterion:\n\nNote that\n\nn\u2211\n\nIC =\n\n(uijxi)Q(cid:12)\n\ni (cid:0) 1\n2\n\nln (cid:25)^(cid:18)n\n\ni=1\n\nm ln n:\n\n(12)\n\nAs can be seen, this model selection criterion consists of two terms, where the \ufb01rst term is the\nweighted log-likelihood of the policy and the second is a penalty term that penalizes highly complex\nmodels. Also, since the \ufb01rst term is larger than the second term, this criterion gives the model with\nthe maximum weighted log-likelihood asymptotically. Algorithm 2 describes the algorithm \ufb02ow of\nWLPS including the model selection strategy.\n\n6\n\n\fAlgorithm 2 (WLPS with model selection).\n\n1. Generate a sample sequence fx1:n; u1:ng by employing the current policy parameter (cid:18).\n2. For all models, \ufb01nd estimator ^(cid:18)n by solving estimating equation (7) and evaluate model\n\nselection criterion (12).\n\n3. Choose the best model based on model selection criterion (12) and check for convergence.\n\nIf convergence is not satis\ufb01ed, return to 1.\n\nEmpirical Example We evaluated the performance of the proposed model-selection method us-\ning a simple one-dimensional linear quadratic Gaussian (LQG) problem. This problem is known\nto be suf\ufb01ciently dif\ufb01cult as an empirical evaluation, while it is analytically solvable.\nIn this\nproblem, we characterized the state transition distribution p(xijxi(cid:0)1; ui(cid:0)1) as a Gaussian distri-\nbution N (xij(cid:22)xi; (cid:27)) with mean (cid:22)xi = xi(cid:0)1 + ui(cid:0)1 and variance (cid:27) = 0:52. The reward function\nwas set to a quadratic function r(xi; ui) = (cid:0)x2\ni + c, where c is a positive scalar value for\npreventing the reward r(x; u) being negative. The control signal ui was generated from a Gaus-\nsian distribution N (uij(cid:22)ui; (cid:27)\n= 0:5. We used a linear model\nwith polynomial basis functions de\ufb01ned as (cid:22)ui =\nj + (cid:18)0, where k is the order of the\npolynomial. Note that, in this LQG setting, the optimal controller can be represented as a linear\nmodel, i.e., the optimal policy can be obtained when the order of polynomial is selected as k = 1.\n\n\u2032\n) with mean (cid:22)ui and variance (cid:27)\nj=1 (cid:18)jxj\n\n\u2032\n\ni\n\n(cid:0) u2\n\u2211\n\nk\n\nFigure 1: Distribution of order k se-\nlected by our model selection criterion\n(left bar) and the weighted likelihood\n(right bar).\n\nIn this experiment, we validated whether the proposed model\nselection method can detect the true order of the polyno-\nmial. To illustrate how our proposed model selection crite-\nrion works, we compared the performance of the proposed\nmodel selection method with a na\u00a8\u0131ve method based on the\nweighted log-likelihood (6). The weighted-log-likelihood-\nbased selection, similarly to the proposed method, was per-\nformed by computing the weighted log-likelihood scores (6)\nover all candidate models and selecting the model with the\nmaximum score among the candidates.\nFigure 1 shows the distribution on the scores of the selected\npolynomial orders k in the learned policies from \ufb01rst to \ufb01fth\norder by using the weighted log-likelihood and our model se-\nlection criterion. The distributions of the scores were obtained by repeating random 1000 trials. A\nlearning process was performed by 200 iterations of WLPS, each of which contained 200 samples\ngenerated by the current policy. The discounted factor (cid:12) was set to 0:99. As shown in Figure 1, in\nthe proposed method, the peak of the selected order was located at the true order k = 1. On the\nother hand, in the weighted log-likelihood method, the distribution of the orders converged to a one\nwith two peaks at k = 1 and k = 4. This result seems to show that the penalized term in our model\nselection criterion worked well.\n5 Discussion\nIn this study, we have discussed a DPS problem in the framework of weighted likelihood estimation.\nWe introduced a weighted likelihood function as the objective function of DPS, and proposed an\nincremental algorithm, WLPS, based on the iteration of maximum weighted log-likelihood estima-\ntion. WLPS shows desirable theoretical properties, namely, consistency, asymptotic normality, and\na monotonic increase in the expected reward at each iteration. Furthermore, we have constructed a\nmodel selection strategy based on the posterior probability of the model given a sample sequence\nthrough asymptotic analysis of the marginal weighted likelihood.\nWLPS framework has a potential to bring a new theoretical insight to DPS, and derive more ef\ufb01cient\nalgorithms based on the theoretical considerations. In the rest of this paper, we summarize some key\nissues that need to be addressed in future research.\n5.1 Statistical interpretation of model-free and model-based WLPS\nOne of the important open issues in RL is how to combine model-free and model-based approaches\nwith theoretical support. To this end, it is necessary to clarify the difference between model-based\nand model-free approaches in the theoretical sense. WLPS provides us with an interesting insight\ninto the relation between model-free and model-based DPS from the viewpoint of statistics.\n\n7\n\n 700600500400300200100012345proposed model selectionweighted log-likelihoodThe oder of basis functionsThe order of basis functions\t\r\f\u2032\n\ni=2\n\n\u2032jx; u) as a parametric model p(cid:20)(x\n\nWe begin by introducing the model-based WLPS method. Let us specify the state transition distri-\n\u2032-dimensional\nbution p(x\n\u2032jx; u) with respect to parameter (cid:20) and taking the partial derivative\nn\u2211\nparameter vector. Assuming p(cid:20)(x\nof the log weighted likelihood (6), we obtain the estimating equation for parameter (cid:20):\n\n\u2032jx; u; (cid:20)), where (cid:20) is an m\n\n\u2032jx; u) := p(x\n\n(cid:24)^(cid:20)n(xi(cid:0)1; ui(cid:0)1; xi) = 0;\n\n) is the partial derivative of the state transition distribution p(cid:20)(x\n\n(13)\n\u2032jx; u) with respect\nwhere (cid:24)(cid:20)(x; u; x\nto (cid:20). As can be seen, estimating equation (13) corresponds to the likelihood equation, i.e., the esti-\nmator, ^(cid:20)n = ^(cid:20)n(x1:n; u1:n(cid:0)1), given by (13) is the maximum likelihood estimator. This observation\nindicates that the weighted likelihood integrates two different objective functions: one for learning\npolicy (cid:25)(cid:18)(ujx), and the other for the state predictor, p(cid:20)(x\n\u2032jx; u). Having obtained estimator ^(cid:20)n\nfrom estimating equation (13), the model-based WLPS estimates the policy parameter by \ufb01nding\nthe solution, (cid:20)(cid:18)n := (cid:20)(cid:18)(x1:n; u1:n), of the following estimating equation:\n(cid:12)j(cid:0)i (cid:20)(cid:18)n (xi; ui)r(xj; uj)\n\np(cid:18)\u2032;^(cid:20)n (x2:n; u1:njx1)\n\n{\nn\u2211\n\ndx2:ndu1:n = 0:\n\n(14)\n\n\u222b\u222b\n\nn\u2211\n\n}\n\ni=1\n\nj=i\n\nNote that estimating equation (14) is derived by taking the integral of Eq. (7) over the sample se-\nquence fx2:n; u1:ng based on the current estimated model p(cid:18)\u2032;^(cid:20)n (x2:n; u1:njx1). Thus, the model-\n\u2032jx; u) is well\nbased WLPS converges to the same parameter as the model-free WLPS, if model p(cid:20)(x\nspeci\ufb01ed2.\nWe now consider the general treatment for model-free and model-based WLPS from a statistical\nviewpoint. Model-based WLPS fully speci\ufb01es the weighted likelihood by using the parametric\npolicy and parametric state transition models, and estimates all the parameters that appear in the\nparametric weighted likelihood. Hence, model-based WLPS can be framed as a parametric statis-\ntical inference problem. Meanwhile, model-free WLPS partially speci\ufb01es the weighted likelihood\nby only using the parametric policy model. This can be seen as a semiparametric statistical model\n[22, 23], which includes not only parameters of interest, but also additional nuisance parameters\nwith possibly in\ufb01nite DoF, where only the policy is modeled parametrically and the other unspec-\ni\ufb01ed part corresponds to the nuisance parameters. Therefore, model-free WLPS can be framed as\na semiparametric statistical inference problem. Hence, the difference between model-based and\nmodel-free WLPS methods can be interpreted as the difference between parametric and semipara-\nmetric statistical inference. The theoretical aspects of both parametric and semiparametric inference\nhave been actively investigated and several approaches for combining their estimators have been\nproposed [23, 24, 25]. Therefore, by following these works, we have developed a novel hybrid DPS\nalgorithm that combines model-free and model-based WLPS with desirable statistical properties.\n5.2 Variance reduction technique for WLPS\nIn order to perform fast learning of the policy, it is necessary to \ufb01nd estimators that can reduce the\nestimation variance of the policy parameters in DPS. Although variance reduction techniques have\nbeen proposed in DPS [26, 27, 28], these employ indirect approaches, i.e., instead of considering\nthe estimation variance of the policy parameters, they reduce the estimation variance of the mo-\nments necessary to learn the policy parameter. Unfortunately, these variance reduction techniques\ndo not guarantee decreasing the estimation variance of the policy parameters, and thus it is desir-\nable to develop a direct approach that can evaluate and reduce the estimation variance of the policy\nparameters.\nAs stated above, we can interpret model-free WLPS as a semiparametric statistical inference prob-\nlem. This interpretation allows us to apply the estimating function method [22, 23], which has been\nwell established in semiparametric statistics, directly to WLPS. The estimating function method is\na powerful tool for the design of consistent estimators and the evaluation of the estimation variance\nof parameters in a semiparametric inference problem. The advantage of considering the estimating\nfunction is the ability 1) to characterize an entire set of consistent estimators, and 2) to \ufb01nd the opti-\nmal estimator with the minimum parameter estimation variance from the set of estimators [23, 29].\nTherefore, by applying this to WLPS, we can characterize an entire set of estimators, which maxi-\nmizes the expected reward without identifying the state transition distribution, and \ufb01nd the optimal\nestimator with the minimum estimation variance.\n\n2In the following discussion, in order to clarify the difference between the model-free and the model-based\n\nmanners, we write original WLPS as model-free WLPS.\n\n8\n\n\fReferences\n[1] P. Dayan and G. Hinton, \u201cUsing expectation-maximization for reinforcement learning,\u201d Neural Compu-\n\ntation, vol. 9, no. 2, pp. 271\u2013278, 1997.\n\n[2] J. Baxter and P. L. Bartlett, \u201cIn\ufb01nite-horizon policy-gradient estimation,\u201d Journal of Arti\ufb01cial Intelligence\n\nResearch, vol. 15, no. 4, pp. 319\u2013350, 2001.\n\n[3] V. R. Konda and J. N. Tsitsiklis, \u201cOn actor-critic algorithms,\u201d SIAM Journal on Control and Optimization,\n\nvol. 42, no. 4, pp. 1143\u20131166, 2003.\n\n[4] J. Peters and S. Schaal, \u201cReinforcement learning by reward-weighted regression for operational space\n\ncontrol,\u201d in Proceedings of the 24th International Conference on Machine Learning, 2007.\n\n[5] \u2014\u2014, \u201cNatural actor-critic,\u201d Neurocomputing, vol. 71, no. 7-9, pp. 1180\u20131190, 2008.\n[6] N. Vlassis, M. Toussaint, G. Kontes, and S. Piperidis, \u201cLearning model-free robot control by a monte\n\ncarlo em algorithm,\u201d Autonomous Robots, vol. 27, no. 2, pp. 123\u2013130, 2009.\n\n[7] E. Theodorou, J. Buchli, and S. Schaal, \u201cA generalized path integral control approach to reinforcement\n\nlearning,\u201d Journal of Machine Learning Research, vol. 11, pp. 3137\u20133181, 2010.\n\n[8] J. Peters, K. M\u00a8ulling, and Y. Alt\u00a8un, \u201cRelative entropy policy search,\u201d in Proceedings of the 24-th National\n\nConference on Arti\ufb01cial Intelligence, 2010.\n\n[9] J. Kober and J. Peters, \u201cPolicy search for motor primitives in robotics,\u201d Machine Learning, vol. 84, no.\n\n1-2, pp. 171\u2013203, 2011.\n\n[10] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.\n[11] H. Akaike, \u201cA new look at the statistical model identi\ufb01cation,\u201d IEEE Transactions on Automatic Control,\n\nvol. 19, no. 6, pp. 716\u2013723, 1974.\n\n[12] G. Schwarz, \u201cEstimating the dimension of a model,\u201d The Annals of Statistics, vol. 6, no. 2, pp. 461\u2013464,\n\n1978.\n\n[13] A. Farahmand and C. Szepesv\u00b4ari, \u201cModel selection in reinforcement learning,\u201d Machine Learning, pp.\n\n1\u201334, 2011.\n\n[14] M. M. Fard and J. Pineau, \u201cPAC-Bayesian model selection for reinforcement learning,\u201d in Advances in\n\nNeural Information Processing Systems 22, 2010.\n\n[15] H. Hachiya, J. Peters, and M. Sugiyama, \u201cReward-weighted regression with sample reuse for direct policy\n\nsearch in reinforcement learning,\u201d Neural Computation, vol. 23, no. 11, pp. 2798\u20132832, 2011.\n\n[16] M. G. Azar and H. J. Kappen, \u201cDynamic policy programming,\u201d Tech. Rep. arXiv:1004.202, 2010.\n[17] H. Kappen, V. G\u00b4omez, and M. Opper, \u201cOptimal control as a graphical model inference problem,\u201d Machine\n\nlearning, pp. 1\u201324, 2012.\n\n[18] K. Rawlik, M. Toussaint, and S. Vijayakumar, \u201cOn stochastic optimal control and reinforcement learning\n\nby approximate inference,\u201d in International Conference on Robotics Science and Systems, 2012.\n\n[19] R. C. Bradley, \u201cBasic properties of strong mixing conditions. A survey and some open questions,\u201d Prob-\n\nability Surveys, vol. 2, pp. 107\u2013144, 2005.\n\n[20] R. Munos and C. Szepesv\u00b4ari, \u201cFinite-time bounds for \ufb01tted value iteration,\u201d Journal of Machine Learning\n\nResearch, vol. 9, pp. 815\u2013857, 2008.\n\n[21] A. Lazaric, M. Ghavamzadeh, and R. Munos, \u201cFinite-sample analysis of least-squares policy iteration,\u201d\n\nJournal of Machine Learning Research, vol. 13, p. 30413074, 2012.\n\n[22] V. P. Godambe, Ed., Estimating Functions. Oxford University Press, 1991.\n[23] S. Amari and M. Kawanabe, \u201cInformation geometry of estimating functions in semi-parametric statistical\n\nmodels,\u201d Bernoulli, vol. 3, no. 1, pp. 29\u201354, 1997.\n\n[24] P. J. Bickel, C. A. Klaassen, Y. Ritov, and J. A. Wellner, Ef\ufb01cient and Adaptive Estimation for Semipara-\n\nmetric Models. Springer, 1998.\n\n[25] G. Bouchard and B. Triggs, \u201cThe tradeoff between generative and discriminative classi\ufb01ers,\u201d in Proceed-\n\nings 1998 16th IASC International Symposium on Computational Statistics, 2004, pp. 721\u2013728.\n\n[26] E. Greensmith, P. L. Bartlett, and J. Baxter, \u201cVariance reduction techniques for gradient estimates in\n\nreinforcement learning,\u201d Journal of Machine Learning Research, vol. 5, pp. 1471\u20131530, 2004.\n\n[27] R. Munos, \u201cGeometric variance reduction in markov chains: application to value function and gradient\n\nestimation,\u201d Journal of Machine Learning Research, vol. 7, pp. 413\u2013427, 2006.\n\n[28] T. Zhao, H. Hachiya, G. Niu, and M. Sugiyama, \u201cAnalysis and improvement of policy gradient estima-\n\ntion,\u201d Neural Networks, 2011.\n\n[29] T. Ueno, S. Maeda, M. Kawanabe, and S. Ishii, \u201cGeneralized TD learning,\u201d Journal of Machine Learning\n\nResearch, vol. 12, pp. 1977\u20132020, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1145, "authors": [{"given_name": "Tsuyoshi", "family_name": "Ueno", "institution": null}, {"given_name": "Kohei", "family_name": "Hayashi", "institution": null}, {"given_name": "Takashi", "family_name": "Washio", "institution": null}, {"given_name": "Yoshinobu", "family_name": "Kawahara", "institution": null}]}