{"title": "Difference of Convex Functions Programming for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2519, "page_last": 2527, "abstract": "Large Markov Decision Processes (MDPs) are usually solved using Approximate Dynamic Programming (ADP) methods such as Approximate Value Iteration (AVI) or Approximate Policy Iteration (API). The main contribution of this paper is to show that, alternatively, the optimal state-action value function can be estimated using Difference of Convex functions (DC) Programming. To do so, we study the minimization of a norm of the Optimal Bellman Residual (OBR) $T^*Q-Q$, where $T^*$ is the so-called optimal Bellman operator. Controlling this residual allows controlling the distance to the optimal action-value function, and we show that minimizing an empirical norm of the OBR is consistant in the Vapnik sense. Finally, we frame this optimization problem as a DC program. That allows envisioning using the large related literature on DC Programming to address the Reinforcement Leaning (RL) problem.", "full_text": "Di\ufb00erence of Convex Functions Programming\n\nfor Reinforcement Learning\n\nBilal Piot1,2, Matthieu Geist1, Olivier Pietquin2,3\n\n1MaLIS research group (SUPELEC) - UMI 2958 (GeorgiaTech-CNRS), France\n\n2LIFL (UMR 8022 CNRS/Lille 1) - SequeL team, Lille, France\n\n3 University Lille 1 - IUF (Institut Universitaire de France), France\n\nbilal.piot@lifl.fr, matthieu.geist@supelec.fr, olivier.pietquin@univ-lille1.fr\n\nAbstract\n\nLarge Markov Decision Processes are usually solved using Approximate Dy-\nnamic Programming methods such as Approximate Value Iteration or Ap-\nproximate Policy Iteration. The main contribution of this paper is to show\nthat, alternatively, the optimal state-action value function can be estimated\nusing Di\ufb00erence of Convex functions (DC) Programming. To do so, we\nstudy the minimization of a norm of the Optimal Bellman Residual (OBR)\nT \u2217Q \u2212 Q, where T \u2217 is the so-called optimal Bellman operator. Control-\nling this residual allows controlling the distance to the optimal action-value\nfunction, and we show that minimizing an empirical norm of the OBR is\nconsistant in the Vapnik sense. Finally, we frame this optimization problem\nas a DC program. That allows envisioning using the large related literature\non DC Programming to address the Reinforcement Leaning problem.\n\n1\n\nIntroduction\n\nThis paper addresses the problem of solving large state-space Markov Decision Processes\n(MDPs)[16] in an in\ufb01nite time horizon and discounted reward setting. The classical methods\nto tackle this problem, such as Approximate Value Iteration (AVI) or Approximate Policy\nIteration (API) [6, 16]1, are derived from Dynamic Programming (DP). Here, we propose\nan alternative path. The idea is to search directly a function Q for which T \u2217Q \u2248 Q,\nwhere T \u2217 is the optimal Bellman operator, by minimizing a norm of the Optimal Bellman\nResidual (OBR) T \u2217Q\u2212 Q. First, in Sec. 2.2, we show that the OBR Minimization (OBRM)\nis interesting, as it can serve as a proxy for the optimal action-value function estimation.\nThen, in Sec. 3, we prove that minimizing an empirical norm of the OBR is consistant in\nthe Vapnick sense (this justi\ufb01es working with sampled transitions). However, this empirical\nnorm of the OBR is not convex. We hypothesize that this is why this approach is not\nstudied in the literature (as far as we know), a notable exception being the work of Baird [5].\nTherefore, our main contribution, presented in Sec. 4, is to show that this minimization can\nbe framed as a minimization of a Di\ufb00erence of Convex functions (DC) [11]. Thus, a large\nliterature on Di\ufb00erence of Convex functions Algorithms (DCA) [19, 20](a rather standard\napproach to non-convex programming) is available to solve our problem. Finally in Sec. 5,\nwe conduct a generic experiment that compares a naive implementation of our approach to\nAPI and AVI methods, showing that it is competitive.\n\n1Others methods such as Approximate Linear Programming (ALP) [7, 8] or Dynamic Policy\n\nProgramming (DPP) [4] address the same problem. Yet, they also rely on DP.\n\n1\n\n\f2 Background\n\nx\u2208X \u03bd(x)|\u03b1(x)|p) 1\n\nk\u03b1kp,\u03bd = (P\n\n2.1 MDP and ADP\nBefore describing the framework of MDPs in the in\ufb01nite-time horizon and discounted reward\nsetting, we give some general notations. Let (R,|.|) be the real space with its canonical\nnorm and X a \ufb01nite set, RX is the set of functions from X to R. The set of probability\ndistributions over X is noted \u2206X. Let Y be a \ufb01nite set, \u2206Y\nX is the set of functions from Y to\n\u2206X. Let \u03b1 \u2208 RX, p \u2265 1 and \u03bd \u2208 \u2206X, we de\ufb01ne the Lp,\u03bd-semi-norm of \u03b1, noted k\u03b1kp,\u03bd, by:\np . In addition, the in\ufb01nite norm is noted k\u03b1k\u221e and de\ufb01ned\nas k\u03b1k\u221e = maxx\u2208X |\u03b1(x)|. Let v be a random variable which takes its values in X, v \u223c \u03bd\nmeans that the probability that v = x is \u03bd(x).\nNow, we provide a brief summary of some of the concepts from the theory of MDP and\nADP [16]. Here, the agent is supposed to act in a \ufb01nite MDP 2 represented by a tuple\nM = {S, A, R, P, \u03b3} where S = {si}1\u2264i\u2264NS is the state space, A = {ai}1\u2264i\u2264NA is the action\nspace, R \u2208 RS\u00d7A is the reward function, \u03b3 \u2208]0, 1[ is a discount factor and P \u2208 \u2206S\u00d7A\nis the Markovian dynamics which gives the probability, P(s0|s, a), to reach s0 by choosing\naction a in state s. A policy \u03c0 is an element of AS and de\ufb01nes the behavior of an agent.\nThe quality of a policy \u03c0 is de\ufb01ned by the action-value function. For a given policy \u03c0, the\nt=0 \u03b3tR(st, at)], where E\u03c0\nis the expectation over the distribution of the admissible trajectories (s0, a0, s1, \u03c0(s1), . . . )\nobtained by executing the policy \u03c0 starting from s0 = s and a0 = a. Moreover, the function\nQ\u2217 \u2208 RS\u00d7A de\ufb01ned as Q\u2217 = max\u03c0\u2208AS Q\u03c0 is called the optimal action-value function. A\npolicy \u03c0 is optimal if \u2200s \u2208 S, Q\u03c0(s, \u03c0(s)) = Q\u2217(s, \u03c0(s)). A policy \u03c0 is said greedy with\nrespect to a function Q if \u2200s \u2208 S, \u03c0(s) \u2208 argmaxa\u2208A Q(s, a). Greedy policies are important\nbecause a policy \u03c0 greedy with respect to Q\u2217 is optimal. In addition, as we work in the\n\ufb01nite MDP setting, we de\ufb01ne, for each policy \u03c0, the matrix P\u03c0 of size NSNA \u00d7 NSNA with\nelements P\u03c0((s, a), (s0, a0)) = P(s0|s, a)1{\u03c0(s0)=a0}. Let \u03bd \u2208 \u2206S\u00d7A, we note \u03bdP\u03c0 \u2208 \u2206S\u00d7A\n(s0,a0)\u2208S\u00d7A \u03bd(s0, a0)P\u03c0((s0, a0), (s, a)). Finally, Q\u03c0\nand Q\u2217 are known to be \ufb01xed points of the contracting operators T \u03c0 and T \u2217 respectively:\n\naction-value function Q\u03c0 \u2208 RS\u00d7A is de\ufb01ned as Q\u03c0(s, a) = E\u03c0[P+\u221e\n\nthe distribution such that (\u03bdP\u03c0)(s, a) =P\n\nS\n\n\u2200Q \u2208 RS\u00d7A,\u2200(s, a) \u2208 S \u00d7 A, T \u03c0Q(s, a) = R(s, a) + \u03b3\n\n\u2200Q \u2208 RS\u00d7A,\u2200(s, a) \u2208 S \u00d7 A, T \u2217Q(s, a) = R(s, a) + \u03b3\n\nP(s0|s, a)Q(s, \u03c0(s0)),\n\nP(s0|s, a) max\nb\u2208A\n\nQ(s, b).\n\nX\nX\n\ns0\u2208S\n\ns0\u2208S\n\nWhen the state space becomes large, two important problems arise to solve large MDPs.\nThe \ufb01rst one, called the representation problem, is that an exact representation of the values\nof the action-value functions is impossible, so these functions need to be represented with\na moderate number of coe\ufb03cients. The second problem, called the sample problem, is that\nthere is no direct access to the Bellman operators but only samples from them. One solution\nfor the representation problem is to linearly parameterize the action-value functions thanks\ni=1 where \u03c6i \u2208 RS\u00d7A. In addition, we de\ufb01ne for each\nto a basis of d \u2208 N\u2217 functions \u03c6 = (\u03c6i)d\nstate-action couple (s, a) the vector \u03c6(s, a) \u2208 Rd such that \u03c6(s, a) = (\u03c6i(s, a))d\ni=1. Thus, the\naction-value functions are characterized by a vector \u03b8 \u2208 Rd and noted Q\u03b8 :\n\n\u2200\u03b8 \u2208 Rd,\u2200(s, a) \u2208 S \u00d7 A, Q\u03b8(s, a) =\n\n\u03b8i\u03c6i(s, a) = h\u03b8, \u03c6(s, a)i,\n\ndX\n\ni=1\n\nwhere h., .i is the canonical dot product of Rd.\nThe usual frameworks to solve large MDPs are for instance AVI and API. AVI consists in\n. API consists\nprocessing a sequence (QAVI\n\u03b8n\n\u2248\nin processing two sequences (QAPI\n\u03b8n\n2This work could be easily extended to measurable state spaces as in [9]; we choose the \ufb01nite\n\n)n\u2208N where \u03b80 \u2208 Rd and \u2200n \u2208 N, QAVI\n)n\u2208N where \u03c0API\n0\n\n\u2208 AS, \u2200n \u2208 N, QAPI\n\n\u03b8n+1 \u2248 T \u2217QAVI\n\n)n\u2208N and (\u03c0API\n\n\u03b8n\n\n\u03b8n\n\nn\n\ncase for the ease and clarity of exposition.\n\n2\n\n\fand \u03c0API\n\nn+1 is greedy with respect to QAPI\nn = T \u2217QAVI\n\n. The approximation steps in AVI and\nT \u03c0n QAPI\n\u03b8n\n\u2212\nAPI generate the sequences of errors (\u0001AVI\n)n\u2208N respectively. Those approximation errors are due to both the representation\nQAPI\n\u03b8n\nand the sample problems and can be made explicit for speci\ufb01c implementations of those\nmethods [14, 1]. These ADP methods are legitimated by the following bound [15, 9]:\n\n\u03b8n+1)n\u2208N and (\u0001API\n\nn = T \u03c0n QAPI\n\n\u2212 QAVI\n\n\u03b8n\n\n\u03b8n\n\n\u03b8n\n\nlim sup\nn\u2192\u221e\n\nkQ\u2217 \u2212 Q\u03c0API\\AVI\n\nn\n\nkp,\u03bd \u2264\n\n2\u03b3\n\n(1 \u2212 \u03b3)2 C2(\u03bd, \u00b5) 1\n\np \u0001API\\AVI,\n\nwhere \u03c0\n\nAPI\\AVI\nn\n\nis greedy with respect to Q\n\nC2(\u03bd, \u00b5) is a second order concentrability coe\ufb03cient, C2(\u03bd, \u00b5) = (1 \u2212 \u03b3)P\n\nAPI\\AVI\nkp,\u00b5 and\nn\nm\u22651 m\u03b3m\u22121c(m),\nIn the next section, we compare\nwhere c(m) = max\u03c01,...,\u03c0m,(s,a)\u2208S\u00d7A\nthe bound Eq. (1) with a similar bound derived from the OBR minimization approach in\norder to justify it.\n\n, \u0001API\\AVI = supn\u2208N k\u0001\n\n(\u03bdP\u03c01 P\u03c02 ...P\u03c0m )(s,a)\n\nAPI\\AVI\n\u03b8n\n\n\u00b5(s,a)\n\n.\n\n(1)\n\n2.2 Why minimizing the OBR?\nThe aim of Dynamic Programming (DP) is, given an MDP M, to \ufb01nd Q\u2217 which is equivalent\nto minimizing a certain norm of the OBR Jp,\u00b5(Q) = kT \u2217Q\u2212 Qkp,\u00b5 where \u00b5 \u2208 \u2206S\u00d7A is such\nthat \u2200(s, a) \u2208 S \u00d7 A, \u00b5(s, a) > 0 and p \u2265 1. Indeed, it is trivial to verify that the only\nminimizer of Jp,\u00b5 is Q\u2217. Moreover, we have the following bound given by Th. 1.\nTheorem 1. Let \u03bd \u2208 \u2206S\u00d7A, \u00b5 \u2208 \u2206S\u00d7A, \u02c6\u03c0 \u2208 AS and C1(\u03bd, \u00b5, \u02c6\u03c0) \u2208 [1, +\u221e[\u222a{+\u221e} the\n\nt\u22650 \u03b3tP t\n\n(cid:18) C1(\u03bd, \u00b5, \u03c0) + C1(\u03bd, \u00b5, \u03c0\u2217)\n\n\u02c6\u03c0 \u2264 C1(\u03bd, \u00b5, \u02c6\u03c0)\u00b5, then:\n(cid:19) 1\n\np kT \u2217Q \u2212 Qkp,\u00b5,\n\n(2)\n\nsmallest constant verifying (1 \u2212 \u03b3)\u03bdP\n\u2200Q \u2208 RS\u00d7A,kQ\u2217 \u2212 Q\u03c0kp,\u03bd \u2264 2\n1 \u2212 \u03b3\n\n2\n\n2\n\nwhere \u03c0 is greedy with respect to Q and \u03c0\u2217 is any optimal policy.\n\nProof. A proof is given in the supplementary \ufb01le. Similar results exist [15].\n\nIn Reinforcement Leaning (RL), because of the representation and the sample problems,\nminimizing kT \u2217Q \u2212 Qkp,\u00b5 over RS\u00d7A is not possible (see Sec. 3 for details), but we can\nconsider that our approach provides us a function Q such that T \u2217Q \u2248 Q and de\ufb01ne the\nerror \u0001OBRM = kT \u2217Q \u2212 Qkp,\u00b5. Thus, via Eq. (2), we have:\n\n(cid:18) C1(\u03bd, \u00b5, \u03c0) + C1(\u03bd, \u00b5, \u03c0\u2217)\n\n(cid:19) 1\n\np\n\n\u0001OBRM,\n\n(3)\n\nkQ\u2217 \u2212 Q\u03c0kp,\u03bd \u2264 2\n1 \u2212 \u03b3\n\nwhere \u03c0 is greedy with respect to Q. This bound has the same form as the one of API\nand AVI described in Eq. (1) and the Tab. 1 allows comparing them. This bound has two\n\nAlgorithms Horizon term Concentrability term Error term\nAPI\\AVI\n\u0001API\\AVI\nOBRM\n\u0001OBRM\n\nC1(\u03bd,\u00b5,\u03c0)+C1(\u03bd,\u00b5,\u03c0\u2217)\n\nC2(\u03bd, \u00b5)\n\n2\u03b3\n\n(1\u2212\u03b3)2\n2\n1\u2212\u03b3\nTable 1: Bounds comparison.\n\n2\n\nadvantages over API\\AVI. First, the horizon term 2\n1\u2212\u03b3 is better than the horizon term\n2\u03b3\n(1\u2212\u03b3)2 as long as \u03b3 > 0.5, which is the usual case. Second, the concentrability term\nC1(\u03bd,\u00b5,\u03c0)+C1(\u03bd,\u00b5,\u03c0\u2217)\nis considered better that C2(\u03bd, \u00b5), mainly because if C2(\u03bd, \u00b5) < +\u221e\nthen C1(\u03bd,\u00b5,\u03c0)+C1(\u03bd,\u00b5,\u03c0\u2217)\n< +\u221e, the contrary being not true (see [17] for a discussion about\nthe comparison of these concentrability coe\ufb03cients). Thus, the bound Eq. (3) justi\ufb01es the\nminimization of a norm of the OBR, as long as we are able to control the error term \u0001OBRM.\n\n2\n\n2\n\n3\n\n\fN\n\nN\n\nPN\n\nPN\ni=1 |T \u2217Q(Si, Ai) \u2212 Q(Si, Ai)|p) 1\n\n3 Vapnik-Consistency of the empirical norm of the OBR\nWhen the state space is too large, it is not possible to minimize directly kT \u2217Q \u2212 Qkp,\u00b5,\nas we need to compute T \u2217Q(s, a) for each couple (s, a) (sample problem). However, we\ncan consider the case where we choose N samples represented by N independent and iden-\ntically distributed random variables (Si, Ai)1\u2264i\u2264N such that (Si, Ai) \u223c \u00b5 and minimize\nkT \u2217Q \u2212 Qkp,\u00b5N where \u00b5N is the empirical distribution \u00b5N(s, a) = 1\ni=1 1{(Si,Ai)=(s,a)}.\nAn important question (answered below) is to know if controlling the empirical norm allows\ncontrolling the true norm of interest (consistency in the Vapnik sense [22]), and at what\nrate convergence occurs.\nComputing kT \u2217Q \u2212 Qkp,\u00b5N = ( 1\np is tractable if we con-\nsider that we can compute T \u2217Q(Si, Ai) which means that we have a perfect knowledge of\nthe dynamics P and that the number of next states for the state-action couple (Si, Ai)\nis not too large. In Sec. 4.3, we propose di\ufb00erent solutions to evaluate T \u2217Q(Si, Ai) when\nthe number of next states is too large or when the dynamics is not provided. Now, the\nnatural question is to what extent minimizing kT \u2217Q \u2212 Qkp,\u00b5N corresponds to minimizing\nkT \u2217Q \u2212 Qkp,\u00b5. In addition, we cannot minimize kT \u2217Q \u2212 Qkp,\u00b5N over RS\u00d7A as this space\nis too large (representation problem) but over the space {Q\u03b8 \u2208 RS\u00d7A, \u03b8 \u2208 Rd}. Moreover,\nas we are looking for a function such that Q\u03b8 = Q\u2217, we can limit our search to the func-\ntions satisfying kQ\u03b8k\u221e \u2264 kRk\u221e\n1\u2212\u03b3 . Thus, we search for a function Q in the hypothesis space\nQ = {Q\u03b8 \u2208 RS\u00d7A, \u03b8 \u2208 Rd,kQ\u03b8k\u221e \u2264 kRk\u221e\n1\u2212\u03b3 }, in order to minimize kT \u2217Q \u2212 Qkp,\u00b5N . Let\nQN \u2208 argminQ\u2208Q kT \u2217Q \u2212 Qkp,\u00b5N be a minimizer of the empirical norm of the OBR, we\nwant to know to what extent the empirical error kT \u2217QN \u2212 QNkp,\u00b5N is related to the real\nerror \u0001OBRM = kT \u2217QN \u2212 QNkp,\u00b5. The answer for deterministic-\ufb01nite MPDs relies in Th. 2\n(the continuous-stochastic MDP case being discussed shortly after).\nTheorem 2. Let \u03b7 \u2208]0, 1[ and M be a \ufb01nite deterministic MDP, with probability at least\n1 \u2212 \u03b7, we have:\n\n\u2200Q \u2208 Q,kT \u2217Q \u2212 Qkp\n\n\u0001OBRM = kT \u2217QN \u2212 QNkp,\u00b5 \u2264\n\nwhere \u03b5(N) = h(ln( 2N\n\nh )+1)+ln( 4\n\u03b7 )\n\nN\n\nand h = 2NA(d + 1). With probability at least 1 \u2212 2\u03b7:\n\np,\u00b5 \u2264 kT \u2217Q \u2212 Qkp\n \n\u0001B + 2kRk\u221e\n1 \u2212 \u03b3\n\np,\u00b5N\n\np\u03b5(N),\n+ 2kRk\u221e\n1 \u2212 \u03b3\nr\n\n p\u03b5(N) +\n\nln(1/\u03b7)\n\n2N\n\n!! 1\n\np\n\n,\n\nwhere \u0001B = minQ\u2208Q kT \u2217Q \u2212 Qkp\nProof. The complete proof is provided in the supplementary \ufb01le.\ncomputing the Vapnik-Chervonenkis dimension of the residual.\n\np,\u00b5 is the error due to the choice of features.\n\nIt mainly consists in\n\nThus, if we were able to compute a function such as QN, we would have, thanks to Eq .(2)\nand Th. 2:\n\nkQ\u2217\u2212Q\u03c0Nkp,\u03bd \u2264\n\n(cid:18) C1(\u03bd, \u00b5, \u03c0N) + C1(\u03bd, \u00b5, \u03c0\u2217)\n\n1 \u2212 \u03b3\n\nr\n\n p\u03b5(N) +\np \n(cid:19) 1\n\u0001B + 2kRk\u221e\n1 \u2212 \u03b3\n(cid:19)\n(cid:18)p\u03b5(N) +\nq ln(1/\u03b7)\n\n!! 1\n\np\n\n.\n\nln(1/\u03b7)\n\n2N\n\nwhere \u03c0N is greedy with respect to QN. The error term \u0001OBRM is explicitly controlled by\ntwo terms \u0001B, a term of bias, and 2kRk\u221e\na term of variance. The\n1\u2212\u03b3\nterm \u0001B = minQ\u2208Q kT \u2217Q \u2212 Qkp\np,\u00b5 is relative to the representation problem and is \ufb01xed by\nthe choice of features. The term of variance is decreasing at the speed\nA similar bound can be obtained for non-deterministic continuous-state MDPs with \ufb01nite\nnumber of actions where the state space is a compact set in a metric space, the features\n\nq 1\n\nN .\n\n2N\n\n4\n\n\fi=1 are Lipschitz and for each state-action couple the next states belongs to a ball of\n(\u03c6i)d\n\ufb01xed radius. The proof is a simple extension of the one given in the supplementary material.\nThose continuous MDPs are representative of real dynamical systems. Now that we know\nallows controlling kQ\u2217 \u2212 Q\u03c0Nkp,\u03bd, the question is how do\nthat minimizing kT \u2217Q \u2212 Qkp\nwe frame this optimization problem. Indeed kT \u2217Q \u2212 Qkp\nis a non-convex and a non-\ndi\ufb00erentiable function with respect to Q, thus a direct minimization could lead us to bad\nsolutions. In the next section, we propose a method to alleviate those di\ufb03culties.\n\np,\u00b5N\n\np,\u00b5N\n\n4 Reduction to a DC problem\nHere, we frame the minimization of the empirical norm of the OBR as a DC problem and\ninstantiate a general algorithm, DCA [20], that tries to solve it. First, we provide a short\nintroduction to di\ufb00erence of convex functions.\n\n4.1 DC background\nLet E be a \ufb01nite dimensional Hilbert space and h., .iE, k.kE its dot product and norm\nrespectively. We say that a function f \u2208 RE is DC if there exists g, h \u2208 RE which are\nconvex and lower semi-continuous such that f = g \u2212 h. The set of DC functions is noted\nDC(E) and is stable to most of the operations that can be encountered in optimization,\ni=1 be a sequence of K \u2208 N\u2217 DC\ncontrary to the set of convex functions. Indeed, let (fi)K\ni=1 fi, min1\u2264i\u2264K fi, max1\u2264i\u2264K fi and |fi|\nfunctions and (\u03b1i)K\nare DC functions [11]. In order to minimize a DC function f = g \u2212 h, we need to de\ufb01ne a\nnotion of di\ufb00erentiability for convex and lower semi-continuous functions. Let g be such a\nfunction and e \u2208 E, we de\ufb01ne the sub-gradient \u2202eg of g in e as:\n\ni=1 \u2208 RK then PK\n\ni=1 \u03b1ifi, QK\n\n\u2202eg = {\u03b4 \u2208 E,\u2200e0 \u2208 E, g(e0) \u2265 g(e) + he0 \u2212 e, \u03b4iE}.\n\nFor a convex and lower semi-continuous g \u2208 RE, the sub-gradient \u2202eg is non empty for all\ne \u2208 E [11]. This observation leads to a minimization method of a function f \u2208 DC(E)\ncalled Di\ufb00erence of Convex functions Algorithm (DCA). Indeed, as f is DC, we have:\n\n\u2200(e, e0) \u2208 E2, f(e0) = g(e0) \u2212 h(e0) \u2264\n(a)\n\ng(e0) \u2212 h(e) \u2212 he0 \u2212 e, \u03b4iE,\n\nwhere \u03b4 \u2208 \u2202eh and inequality (a) is true by de\ufb01nition of the sub-gradient. Thus, for all\ne \u2208 E, the function f is upper bounded by a function fe \u2208 RE de\ufb01ned for all e0 \u2208 E by\nfe(e0) = g(e0) \u2212 h(e) \u2212 he0 \u2212 e, \u03b4iE. The function fe is a convex and lower semi-continuous\nfunction (as it is the sum of two convex and lower semi-continuous functions which are g\nand the linear function \u2200e0 \u2208 E,he \u2212 e0, \u03b4iE \u2212 h(e)). In addition, those functions have the\nparticular property that \u2200e \u2208 E, f(e) = fe(e). The set of convex functions (fe)e\u2208E that\nupper-bound the function f plays a key role in DCA.\nThe algorithm DCA [20] consists in constructing a sequence (en)n\u2208N such that the sequence\n(f(en))n\u2208N decreases. The \ufb01rst step is to choose a starting point e0 \u2208 E, then we minimize\nthe convex function fe0 that upper-bounds the function f. We note e1 a minimizer of fe0,\ne1 \u2208 argmine\u2208E fe0. This minimization can be realized by any convex optimization solver.\nAs f(e0) = fe0(e0) \u2265 fe0(e1) and fe0(e1) \u2265 f(e1), then f(e0) \u2265 f(e1). Thus, if we construct\nthe sequence (en)n\u2208N such that \u2200n \u2208 N, en+1 \u2208 argmine\u2208E fen and e0 \u2208 E, then we obtain a\ndecreasing sequence (f(en))n\u2208N. Therefore, the algorithm DCA solves a sequence of convex\noptimization problems in order to solve a DC optimization problem. Three important\nchoices can radically change the DCA performance: the \ufb01rst one is the explicit choice of\nthe decomposition of f, the second one is the choice of the starting point e0 and \ufb01nally the\nchoice of the intermediate convex solver. The DCA algorithm hardly guarantee convergence\nto the global optima, but it usually provides good solutions. Moreover, it has some nice\nproperties when one of the functions g or h is polyhedral. A function g is said polyhedral\nwhen \u2200e \u2208 E, g(e) = max1\u2264i\u2264K[h\u03b1i, eiH + \u03b2i], where (\u03b1i)K\ni=1 \u2208 RK. If\none of the function g, h is polyhedral, f is under bounded and the DCA sequence (en)n\u2208N\nis bounded, the DCA algorithm converges in \ufb01nite time to a local minima. The \ufb01nite time\naspect is quite interesting in term of implementation. More details about DC programming\nand DCA are given in [20] and even conditions for convergence to the global optima.\n\ni=1 \u2208 EK and (\u03b2i)K\n\n5\n\n\f4.2 The OBR minimization framed as a DC problem\nA \ufb01rst important result is that for any choice of p \u2265 1, the OBRM is actually a DC problem.\nTheorem 3. Let J p\n(\u03b8) is\na DC functions when p \u2208 N\u2217.\n\nbe a function from Rd to reals, J p\n\n(\u03b8) = kT \u2217Q\u03b8 \u2212 Q\u03b8kp\n\np,\u00b5N\n\np,\u00b5N\n\np,\u00b5N\n\nX\n\ns0\u2208S\n\nProof. Let us write J p\n\nas:\n\np,\u00b5N\n\nJ p\np,\u00b5N\n\n(\u03b8) = 1\n\nN\n\nNX\n\ni=1\n\n|h\u03c6(Si, Ai), \u03b8i \u2212 R(Si, Ai) \u2212 \u03b3\n\nP(s0|Si, Ai) max\na\u2208A\n\nh\u03c6(s0, a), \u03b8i|p.\n\ntinuous functions. In addition, the function hi = \u03b3P\n\nFirst, as for each (Si, Ai) the linear function h\u03c6(Si, Ai), .i is convex and continuous, the a\ufb03ne\nfunction gi = h\u03c6(Si, Ai), .i + R(Si, Ai) is convex and continuous. Therefore, the function\nmaxa\u2208Ah\u03c6(s0, a), .i is also convex and continuous as a \ufb01nite maximum of convex and con-\ns0\u2208S P(s0|Si, Ai) maxa\u2208Ah\u03c6(s0, a), .i| is\nconvex and continuous as a positively weighted \ufb01nite sum of convex and continuous func-\ntions. Thus, the function fi = gi \u2212 hi is a DC function. As an absolute value of a DC\nfunction is DC, a \ufb01nite product of DC functions is DC and a weighted sum of DC functions\nis DC, then J p\n\nPN\ni=1 |fi|p is a DC function.\n\n= 1\n\np,\u00b5N\n\nN\n\np,\u00b5N\n\np,\u00b5N\n\np,\u00b5N\n\nis DC is not su\ufb03cient in order to use the DCA algorithm.\nas a di\ufb00erence of two convex functions.\n\nHowever, knowing that J p\nIndeed, we need an explicit decomposition of J p\nWe present two polyhedral explicit decompositions of J p\nTheorem 4. There exists explicit polyhedral decompositions of J p\nFor p = 1:\nand H1,\u00b5N = 1\n\nwhen p = 1 and p = 2.\nPN\nJ1,\u00b5N = G1,\u00b5N \u2212 H1,\u00b5N , where G1,\u00b5N = 1\ni=1 2 max(gi, hi)\ni=1(gi + hi), with gi = h\u03c6(Si, Ai), .i + R(Si, Ai) and hi =\n= G2,\u00b5N \u2212 H2,\u00b5N , where G2,\u00b5N = 1\n\nwhen p = 1 and when p = 2.\n\n2\ni + h\ni ] and H2,\u00b5N =\n\n\u03b3P\ns0\u2208S P(s0|Si, Ai) maxa\u2208Ah\u03c6(s0, a), .i.\nPN\ni=1[gi + hi]2 with:\ngi = max(gi, hi) + gi \u2212\n\nFor p = 2: J2\n1\nN\n\nh\u03c6(Si, Ai) + \u03b3\n\nPN\n\nPN\n\ni=1[g2\n\n2,\u00b5N\n\np,\u00b5N\n\nN\n\nN\n\nN\n\n \n \n\nP(s0|Si, Ai)\u03c6(s0, a1), .i \u2212 R(Si, Ai)\n\nP(s0|Si, Ai)\u03c6(s0, a1), .i \u2212 R(Si, Ai)\n\n!\n!\n\n,\n\n.\n\nX\nX\n\ns0\u2208S\n\ns0\u2208S\n\nhi = max(gi, hi) + hi \u2212\n\nh\u03c6(Si, Ai) + \u03b3\n\nProof. The proof is provided in the supplementary material.\n\n= Gp,\u00b5N \u2212 Hp,\u00b5N\nUnfortunately, there is currently no guarantee that DCA applied to J p\noutputs QN \u2208 argminQ\u2208Q kT \u2217Q \u2212 Qkp,\u00b5N . The error between the output \u02c6QN of DCA and\nQN is not studied here but it is a nice theoretical perspective for future works.\n\np,\u00b5N\n\nthat\n\n\u03b3P\nN0PN0\n\n4.3 The batch scenario\nit was possible to calculate T \u2217Q(s, a) = R(s, a) +\nPreviously, we admit\ns0\u2208S P(s0|s, a) maxb\u2208A Q(s0, b). However, if the number of next states s0 for a given\ncouple (s, a) is too large or if T \u2217 is unknown, this can be intractable. A solution,\nwhen we have a simulator, is to generate for each couple (Si, Ai) a set of N0 samples\nj=1 and provide a non-biased estimation of T \u2217Q(Si, Ai): \u02c6T \u2217Q(Si, Ai) = R(Si, Ai) +\n(S0\ni,j, a). Even if | \u02c6T \u2217Q(Si, ai) \u2212 Q(Si, Ai)|p is a biased estimator of\nj=1 maxa\u2208A Q(S0\n\u03b3 1\n|T \u2217Q(Si, Ai) \u2212 Q(Si, Ai)|p, this biais can be controlled by the number of samples N0.\nIn the case where we do not have such a simulator, but only sampled transitions\n(Si, Ai, S0\ni)N\ni=1 (the batch scenario), it is possible to provide a non-biased estimation of\n\ni,j)N0\n\n6\n\n\f\u02c6T \u2217Q(Si, Ai) = R(Si, Ai) + \u03b3 maxb\u2208A Q(S0\n\nT \u2217Q(Si, Ai) via:\ni, b). However in that case,\n| \u02c6T \u2217Q(Si, Ai) \u2212 Q(Si, Ai)|p is a biased estimator of |T \u2217Q(Si, Ai) \u2212 Q(Si, Ai)|p and the\nbiais is uncontrolled [2].\nIn order to alleviate this typical problem from the batch sce-\nnario, several techniques have been proposed in the literature to provide a better es-\ntimator | \u02c6T \u2217Q(Si, Ai) \u2212 Q(Si, Ai)|p, such as embeddings in Reproducing Kernel Hilbert\nSpaces (RKHS)[13] or locally weighted averager such as Nadaraya-Watson estimators[21].\nIn both cases, the non-biased estimation of T \u2217Q(Si, Ai) takes the form \u02c6T \u2217Q(Si, Ai) =\nR(Si, Ai) + \u03b3 1\nj) represents the weight of the\nPN\nj in the estimation of T \u2217Q(Si, Ai). To obtain an explicit DC decomposition,\nsamples S0\ni=1 | \u02c6T \u2217Q\u03b8(Si, Ai) \u2212 Q\u03b8(Si, Ai)|p it is su\ufb03-\nwhen p = 1 or p = 2, of \u02c6J p\nj, a)\n.\n\nPN\nj=1 \u03b2i(S0\ncient to replace P\nN0PN0\n\nj) maxa\u2208A Q(S0\ni,j, a) if we have a simulator) in the DC decomposition of J p\n\ns0\u2208S P(s0|Si, Ai) maxa\u2208Ah\u03c6(s0, a), \u03b8i by 1\n\nPN\nj=1 \u03b2i(S0\n\nj, a), where \u03b2i(S0\n\nj=1 maxa\u2208A Q(S0\n\nj) maxa\u2208A Q(S0\n\n(or 1\n\np,\u00b5N\n\nN\n\n(\u03b8) = 1\n\nN\n\np,\u00b5N\n\nN\n\n5 Illustration\n\nks\u2212 s0k2 =Pi=d\n\ni=1(si \u2212 s0i)2. Thus, we obtain MDPs with a state space size ofQd\n\nThis experiment focuses on stationary Garnet problems, which are a class of randomly\nconstructed \ufb01nite MDPs representative of the kind of \ufb01nite MDPs that might be en-\ncountered in practice [3]. A stationary Garnet problem is characterized by 3 parameters:\nGarnet(NS, NA, NB). The parameters NS and NA are the number of states and actions\nrespectively, and NB is a branching factor specifying the number of next states for each\nstate-action pair. Here, we choose a particular type of Garnets which presents a topolog-\nical structure relative to real dynamical systems and aims at simulating the behavior of\na smooth continuous-state MDPs (as described in Sec. 3). Those systems are generally\nmulti-dimensional state spaces MDPs where an action leads to di\ufb00erent next states close\nto each other. The fact that an action leads to close next states can model the noise in\na real system for instance. Thus, problems such as the highway simulator [12], the moun-\ntain car or the inverted pendulum (possibly discretized) are particular cases of this type\nof Garnets. For those particular Garnets, the state space is composed of d dimensions\n(d = 2 in this particular experiment) and each dimension i has a \ufb01nite number of elements\nxi (xi = 10). So, a state s = [s1, s2, .., si, .., sd] is a d-uple where each composent si can\ntake a \ufb01nite value between 1 and xi. In addition, the distance between two states s, s0 is\ni=1 xi. The\nnumber of actions is NA = 5. For each state action couple (s, a), we choose randomly NB\nnext states (NB = 5) via a Gaussian distribution of d dimensions centered in s where the\ncovariance matrix is the identity matrix of size d, Id, multiply by a term \u03c3 (here \u03c3 = 1).\nThis allows handling the smoothness of the MDP: if \u03c3 is small the next states s0 are close\nto s and if \u03c3 is large, the next states s0 can be very far form each other and also from s.\nThe probability of going to each next state s0 is generated by partitioning the unit interval\nat NB \u2212 1 cut points selected randomly. For each couple (s, a), the reward R(s, a) is drawn\nuniformly between \u22121 and 1. For each Garnet problem, it is possible to compute an optimal\npolicy \u03c0\u2217 thanks to the policy iteration algorithm.\nIn this experiment, we construct 50 Garnets {Gp}1\u2264p\u226450 as explained before. For each Gar-\nnet Gp, we build 10 data sets {Dp,q}1\u2264q\u226410 composed of N sampled transitions (si, ai, s0\ni)N\ni=1\ndrawn uniformly and independently. Thus, we are in the batch scenario. The minimiza-\ntion of J1,N and J2,N via the DCA algorithms, where the estimation of T \u2217Q(si, ai) is done\nvia R(si, ai) + \u03b3 maxb\u2208A Q(s0\ni, b) (so uncontrolled biais), are called DCA1 and DCA2 re-\nspectively. The initialisation of DCA is \u03b80 = 0 and the intermediary optimization convex\nproblems are solved by a sub-gradient descent [18]. Those two algorithms are compared\nwith state-of the art Reinforcement Learning algorithms which are LSPI (API implemen-\ntation) and Fitted-Q (AVI implementation). The four algorithms uses the tabular ba-\nsis. Each algorithm outputs a function Qp,q\nA is\nA (s, a). In order to quantify the performance of a given algorithm,\nA (s) = argmaxa\u2208A Qp,q\n\u03c0p,q\nA = E\u03c1[V \u03c0\u2217\u2212V \u03c0\nA is computed via the policy\n, where V \u03c0p,q\nwe calculate the criterion T p,q\nE\u03c1[|V \u03c0\u2217|]\nevaluation algorithm. The mean performance criterion TA is\nA . We also\n\nA \u2208 RS\u00d7A and the policy associated to Qp,q\n\nP50\n\nP10\n\nq=1 T p,q\n\np,q\nA ]\n\np=1\n\n1\n500\n\n7\n\n\fq=1(T p,q\nA )2\ncalculate, for each algorithm, the variance criterion stdp\nand the resulting mean variance criterion is stdA = 1\nA. In Fig. 1(a), we plot the\n50\nperformance versus the number of samples. We observe that the 4 algorithms have similar\nperformances, which shows that our alternative approach is competitive. In Fig. 1(b), we\n\nA = 1\n10\np=1 stdp\n\nA \u2212 1\n10\n\nq=1 T p,q\n\nP10\n\nP50\n\nP10\n\n(a) Performance\n\n(b) Standard deviation\n\nFigure 1: Garnet Experiment\n\nplot the standard deviation versus the number of samples. Here, we observe that DCA\nalgorithms have less variance which is an advantage. This experiment shows us that DC\nprogramming is relevant for RL but still has to prove its e\ufb03ciency on real problems.\n\n6 Conclusion and Perspectives\n\nIn this paper, we presented an alternative approach to tackle the problem of solving large\nMDPs by estimating the optimal action-value function via DC Programming. To do so, we\n\ufb01rst showed that minimizing a norm of the OBR is interesting. Then, we proved that the\nempirical norm of the OBR is consistant in the Vapnick sense (strict consistency). Finally,\nwe framed the minimization of the empirical norm as DC minimization which allows us\nto rely on the literature on DCA. We conduct a generic experiment with a basic setting\nfor DCA as we choose a canonical explicit decomposition of our DC functions criterion\nand a sub-gradient descent in order to minimize the intermediary convex minimization\nproblems. We obtain similar results to AVI and API. Thus, an interesting perspective would\nbe to have a less naive setting for DCA by choosing di\ufb00erent explicit decompositions and\n\ufb01nd a better convex solver for the intermediary convex minimization problems. Another\ninteresting perspective is that our approach can be non-parametric.\nIndeed, as pointed\nin [10] a convex minimization problem can be solved via boosting techniques which avoids\nthe choice of features. Therefore, each intermediary convex problem of DCA could be\nsolved via a boosting technique and hence make DCA non-parametric. Thus, seeing the RL\nproblem as a DC problem provides some interesting perspectives for future works.\n\nAcknowledgements\n\nThe research leading to these results has received partial funding from the European Union\nSeventh Framework Program (FP7/2007-2013) under grant agreement number 270780 and\nthe ANR ContInt program (MaRDi project, number ANR- 12-CORD-021 01). We also\nwould like to thank professors Le Thi Hoai An and Pham Dinh Tao for helpful discussions\nabout DC programming.\n\n8\n\n020040060080010000.40.50.60.70.80.911.1Number of samplesPerformance LSPIDCA1DCA2randFitted\u2212Q0200400600800100000.020.040.060.080.10.120.14Number of samplesStandard deviation LSPIDCA1DCA2Fitted\u2212Q\fReferences\n[1] A. Antos, R. Munos, and C. Szepesv\u00b4ari. Fitted-Q iteration in continuous action-space\n\nMDPs. In Proc. of NIPS, 2007.\n\n[2] A. Antos, C. Szepesv\u00b4ari, and R. Munos. Learning near-optimal policies with Bellman-\nresidual minimization based \ufb01tted policy iteration and a single sample path. Machine\nLearning, 2008.\n\n[3] T. Archibald, K. McKinnon, and L. Thomas. On the generation of Markov decision\n\nprocesses. Journal of the Operational Research Society, 1995.\n\n[4] M.G. Azar, V. G\u00b4omez, and H.J Kappen. Dynamic policy programming. The Journal\n\nof Machine Learning Research, 13(1), 2012.\n\n[5] L. Baird. Residual algorithms: reinforcement learning with function approximation. In\n\nProc. of ICML, 1995.\n\n[6] D.P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena Scien-\n\nti\ufb01c, Belmont, MA, 1995.\n\n[7] D.P. de Farias and B. Van Roy. The linear programming approach to approximate\n\ndynamic programming. Operations Research, 51, 2003.\n\n[8] Vijay Desai, Vivek Farias, and Ciamac C Moallemi. A smoothed approximate linear\n\nprogram. In Proc. of NIPS, pages 459\u2013467, 2009.\n\n[9] A. Farahmand, R. Munos, and Csaba. Szepesv\u00b4ari. Error propagation for approximate\n\npolicy and value iteration. Proc. of NIPS, 2010.\n\n[10] A. Grubb and J.A. Bagnell. Generalized boosting algorithms for convex optimization.\n\nIn Proc. of ICML, 2011.\n\n[11] J.B Hiriart-Urruty. Generalized di\ufb00erentiability, duality and optimization for problems\ndealing with di\ufb00erences of convex functions. In Convexity and duality in optimization.\nSpringer, 1985.\n\n[12] E. Klein, M. Geist, B. Piot, and O. Pietquin. Inverse reinforcement learning through\n\nstructured classi\ufb01cation. In Proc. of NIPS, 2012.\n\n[13] G. Lever, L. Baldassarre, A. Gretton, M. Pontil, and S. Gr\u00a8unew\u00a8alder. Modelling tran-\n\nsition dynamics in MDPs with RKHS embeddings. In Proc. of ICML, 2012.\n\n[14] O. Maillard, R. Munos, A. Lazaric, and M. Ghavamzadeh. Finite-sample analysis of\n\nBellman residual minimization. In Proc. of ACML, 2010.\n\n[15] R. Munos. Performance bounds in Lp-norm for approximate value iteration. SIAM\n\njournal on control and optimization, 2007.\n\n[16] M.L. Puterman. Markov decision processes: discrete stochastic dynamic programming.\n\nJohn Wiley & Sons, 1994.\n\n[17] B. Scherrer. Approximate policy iteration schemes: a comparison. In Proc. of ICML,\n\n2014.\n\n[18] N.Z. Shor, K.C. Kiwiel, and A. Ruszcaynski. Minimization methods for non-\n\ndi\ufb00erentiable functions. Springer-Verlag, 1985.\n\n[19] P.D. Tao and L.T.H. An. Convex analysis approach to DC programming: theory,\n\nalgorithms and applications. Acta Mathematica Vietnamica, 22:289\u2013355, 1997.\n\n[20] P.D. Tao and L.T.H. An. The DC programming and DCA revisited with DC models of\nreal world nonconvex optimization problems. Annals of Operations Research, 133:23\u2013\n46, 2005.\n\n[21] G. Taylor and R. Parr. Value function approximation in noisy environments using\n\nlocally smoothed regularized approximate linear programs. In Proc. of UAI, 2012.\n\n[22] V. Vapnik. Statistical learning theory. Wiley, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1314, "authors": [{"given_name": "Bilal", "family_name": "Piot", "institution": "Universit\u00e9 Lille 3"}, {"given_name": "Matthieu", "family_name": "Geist", "institution": "SUPELEC"}, {"given_name": "Olivier", "family_name": "Pietquin", "institution": "Universit\u00e9 de Lille 1"}]}