{"title": "Planning in entropy-regularized Markov decision processes and games", "book": "Advances in Neural Information Processing Systems", "page_first": 12404, "page_last": 12413, "abstract": "We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the SmoothCruiser. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order $\\tilde{\\mathcal{O}}(1/\\epsilon^4)$ for a desired accuracy $\\epsilon$, whereas for non-regularized settings there are no known algorithms with guaranteed polynomial sample complexity in the worst case.", "full_text": "Planning in entropy-regularized\n\nMarkov decision processes and games\n\nJean-Bastien Grill\u2217\nDeepMind Paris\n\njbgrill@google.com\n\nOmar D. Domingues\u2217\nSequeL team, Inria Lille\n\nomar.darwiche-domingues@inria.fr\n\nPierre M\u00e9nard\n\nSequeL team, Inria Lille\n\npierre.menard@inria.fr\n\nR\u00e9mi Munos\nDeepMind Paris\n\nmunos@google.com\n\nMichal Valko\nDeepMind Paris\n\nvalkom@deepmind.com\n\nAbstract\n\nWe propose SmoothCruiser, a new planning algorithm for estimating the value\nfunction in entropy-regularized Markov decision processes and two-player games,\ngiven a generative model of the environment. SmoothCruiser makes use of the\nsmoothness of the Bellman operator promoted by the regularization to achieve\n\nproblem-independent sample complexity of order (cid:101)O(1/\u03b54) for a desired accuracy \u03b5,\n\nwhereas for non-regularized settings there are no known algorithms with guaranteed\npolynomial sample complexity in the worst case.\n\n1\n\nIntroduction\n\nPlanning with a generative model is thinking before acting. An agent thinks using a world model that\nit has built from prior experience [Sutton, 1991, Sutton and Barto, 2018]. In the present paper, we\nstudy planning in two types of environments, Markov decision processes (MDPs) and two-player\nturn-based zero-sum games. In both settings, agents interact with an environment by taking actions\nand receiving rewards. Each action changes the state of the environment and the agent aims to choose\nactions to maximize the sum of rewards. We assume that we are given a generative model of the\nenvironment, that takes as input a state and an action and returns a reward and a next state as output.\nSuch generative models, called oracles, are typically built from known data and involve simulations,\nfor example, a physics simulation. In many cases, simulations are costly. For example, simulations\nmay require the computation of approximate solutions of differential equations or the discretization\nof continuous state spaces. Therefore, a smart algorithm makes only a small the number of oracles\ncalls required to estimate the value of a state. The total number of oracle calls made by an algorithm\nis referred to as sample complexity.\nThe value of a state s, denoted by V (s), is the maximum of the sum of discounted rewards that can\nbe obtained from that state. We want an algorithm that returns an estimate of precision \u03b5 of the V (s)\nfor any \ufb01xed s and has a low sample complexity, which should naturally be a function of \u03b5. An agent\ncan then use this algorithm to predict the value of the possible actions at any given state and choose\nthe best one. The main advantage in estimating the value of a single given state s at a time instead of\nthe complete value function2 s (cid:55)\u2192 V (s) is that we can have algorithms whose sample complexity\ndoes not depend on the size of the state space, which is important when our state space is very large\nor continuous. On the other hand, the disadvantage is that the algorithm must be run each time a new\nstate is encountered.\n\u2217equal contribution\n2as done by approximate dynamic programming\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOur main contribution is an algorithm that estimates the value function in a given state in planning\nproblems that satisfy speci\ufb01c smoothness conditions, which is the case when the rewards are regular-\nized by adding an entropy term. We exploit this smoothness property to obtain a polynomial sample\n\ncomplexity of order (cid:101)O(cid:0)1/\u03b54(cid:1) that is problem independent.\nO(cid:0)(1/\u03b5)log(1/\u03b5)(cid:1), which is non-polynomial in 1/\u03b5. SSA is slow since its search is uniform, i.e., it\n\nRelated work Kearns et al. [1999] came up with a sparse sampling algorithm (SSA) for planning\nin MDPs with \ufb01nite actions and arbitrary state spaces. SSA estimates the value of a state s by\nbuilding a sparse look-ahead tree starting from s. However, SSA achieves a sample complexity of\n\ndoes not select actions adaptively. Walsh et al. [2010] gave an improved version of SSA with adaptive\naction selection, but its sample complexity is still non-polynomial. The UCT algorithm [Kocsis\nand Szepesv\u00e1ri, 2006], used for planning in MDPs and games, selects actions based on optimistic\nestimates of their values and has good empirical performance in several applications. However, the\nsample complexity of UCT can be worse than exponential in 1/\u03b5 for some environments, which is\nmainly due to exploration issues [Coquelin and Munos, 2007]. Algorithms with sample complexities\n\nof order O(cid:0)1/\u03b5d(cid:1), where d is a problem-dependent quantity, have been proposed for deterministic\n\ndynamics [Hren and Munos, 2008], and in an open-loop3 setting [Bubeck and Munos, 2010, Leurent\nand Maillard, 2019, Bartlett et al., 2019], for bounded number of next states and a full MDP model\nis known [Bu\u00b8soniu and Munos, 2012], for bounded number of next states in a \ufb01nite-horizon setting\n[Feldman and Domshlak, 2014], for bounded number of next states [Sz\u00f6r\u00e9nyi et al., 2014], and for\ngeneral MDPs [Grill et al., 2016]. In general, when the state space is in\ufb01nite and the transitions are\nstochastic, the problem-dependent quantity d can make the sample complexity guarantees exponential.\nFor a related setting, when rewards are only obtained in the leaves of a \ufb01xed tree, Kaufmann and\nKoolen [2017] and Huang et al. [2017] present algorithms to identify the optimal action in a game\nbased on best-arm identi\ufb01cation tools.\nEntropy regularization in MDPs and reinforcement learning have been employed in several commonly\nused algorithms. In the context of policy gradient algorithms, common examples are the TRPO\nalgorithm [Schulman et al., 2015] which uses the Kullback-Leibler divergence between the current\nand the updated policy to constrain the gradient step sizes, the A3C algorithm [Mnih et al., 2016]\nthat penalizes policies with low entropy to improve exploration, and the work of Neu et al. [2017]\npresenting a theoretical framework for entropy regularization using the joint state-action distribution.\nFormulations with entropy-augmented rewards, which is the case in our work, have been used to\nlearn multi-modal policies to improve exploration and robustness [Haarnoja et al., 2017, 2018] and\ncan also be related to policy gradient methods [Schulman et al., 2017]. Furthermore, Geist et al.\n[2019] propose a theory of regularized MDPs which includes entropy as a special case. Summing\nup, reinforcement learning knows how to employ entropy regularization. In this work, we tasked\nourselves to give insights on why.\n\n2 Setting and motivation\nBoth MDPs and two-player games can be formalized as a tuple (S,A, P, R, \u03b3), where S is the set\nof states, A is the set of actions, P (cid:44) {P (\u00b7|s, a)}s,a\u2208S\u00d7A is a set of probability distributions over\nS, R : S \u00d7 A \u2192 [0, 1] is a (possibly random) reward function and \u03b3 \u2208 [0, 1[ is the known discount\nfactor. In the MDP case, at each round t, an agent is at state s, chooses action a and observes a\nreward R(s, a) and a transition to a next state z \u223c P (\u00b7|s, a). In the case of turn-based two-player\ngames, there are two agents and, at each round t, an agent chooses an action, observes a reward and a\ntransition; at round t + 1 it\u2019s the other player\u2019s turn. This is equivalent to an MDP with an augmented\nstate space S + (cid:44) S \u00d7 {1, 2} and transition probabilities such that P ((z, j)|(s, i), a) = 0 if i = j.\nWe assume that the action space A is \ufb01nite with cardinality K and the state space S has arbitrary\n(possibly in\ufb01nite) cardinality.\nOur objective is to \ufb01nd an algorithm that outputs a good estimate of the value V (s) for any given\nstate s as quickly as possible. An agent can then use this algorithm to choose the best action in an\nMDP or a game. More precisely, for a state s \u2208 S and given \u03b5 > 0 and \u03b4 > 0, our goal is to compute\n\nan estimate (cid:98)V (s) of V (s) such that P(cid:104)(cid:12)(cid:12)(cid:98)V (s) \u2212 V (s)(cid:12)(cid:12) > \u03b5\n\n(cid:105) \u2264 \u03b4 with small number of oracle calls\n\n3This means that the policy is seen as a function of time, not the states. The open-loop setting is particularly\n\nadapted to environments with deterministic transitions.\n\n2\n\n\f(cid:110)E[R(s, a) + \u03b3V (z)] + \u03bbH(\u03c0(\u00b7|s))\n(cid:111)\nE[R(s, a) + \u03b3V (z)](cid:1), z \u223c P (\u00b7|s, a),\nexp(cid:0) 1\n\n\u03bb\n\nV (s) (cid:44) max\n\n\u03c0(\u00b7|s)\u2208P(A)\n\n= \u03bb log\n\n(cid:88)\n\na\u2208A\n\n, a \u223c \u03c0(\u00b7|s), z \u223c P (\u00b7|s, a)\n\n(3)\n\nrequired to compute this estimate. In our setting, we consider the case of entropy-regularized MDPs\nand games, where the objective is augmented with an entropy term.\n\n2.1 Value functions\nMarkov decision process The policy \u03c0 of an agent is a function from S to P(A), the set of\nprobability distributions over A. We denote by \u03c0(a|s) the probability of the agent choosing action\na at state s. In MDPs, the value function at a state s, V (s), is de\ufb01ned as the supremum over all\npossible policies of the expected sum of discounted rewards obtained starting from s, which satis\ufb01es\nthe Bellman equations [Puterman, 1994],\n\n\u2200s \u2208 S, V (s) = max\n\nE[R(s, a) + \u03b3V (z)], a \u223c \u03c0(\u00b7|s), z \u223c P (\u00b7|s, a).\n\n(1)\n\n\u03c0(\u00b7|s)\u2208P(A)\n\nTwo-player turn-based zero-sum games\nIn this case, there are two agents (1 and 2), each one\nwith its own policy and different goals. If the policy of Agent 2 is \ufb01xed, Agent 1 aims to \ufb01nd a policy\nthat maximizes the sum of discounted rewards. Conversely, if the policy of Agent 1 is \ufb01xed, Agent 2\naims to \ufb01nd a policy that minimizes this sum. Optimal strategies for both agents can be shown to exist\nand for any (s, i) \u2208 S + (cid:44) S \u00d7 {1, 2}, the value function is de\ufb01ned as [Hansen et al., 2013]\n\n(cid:26)max\u03c0(\u00b7|s)\u2208P(A) E[R((s, i), a) + \u03b3V (z, j)],\n\n(2)\nwith a \u223c \u03c0(\u00b7|s) and (z, j) \u223c P (\u00b7|(s, i), a). In this case, the function s (cid:55)\u2192 V (s, i) is the optimal\nvalue function for Agent i when the other agent follows its optimal strategy.\n\nmin\u03c0(\u00b7|s)\u2208P(A) E[R((s, i), a) + \u03b3V (z, j)],\n\nif i = 1,\nif i = 2,\n\nV (s, i) (cid:44)\n\nEntropy-regularized value functions Consider a regularization factor \u03bb > 0. In the case of\nMDPs, when rewards are augmented by an entropy term, the value function at state s is given by\n[Haarnoja et al., 2017, Dai et al., 2018, Geist et al., 2019]\n\nwhere H(\u03c0(\u00b7|s)) is the entropy of the probability distribution \u03c0(\u00b7|s) \u2208 P(A).\n\nThe function LogSumExp\u03bb : RK \u2192 R, de\ufb01ned as LogSumExp\u03bb(x) (cid:44) \u03bb log(cid:80)K\n\ni=1 exp(xi/\u03bb), is\na smooth approximation of the max function, since (cid:107)max\u2212LogSumExp\u03bb(cid:107)\u221e \u2264 \u03bb log K. Similarly,\nthe function \u2212LogSumExp\u2212\u03bb is a smooth approximation of the min function. This allows us to\nde\ufb01ne the regularized version of the value function for turn-based two player games, in which both\nplayers have regularized rewards, by replacing the max and the min in Equation 2 by their smooth\napproximations.\nFor any state s, let Fs (cid:44) LogSumExp\u03bb or Fs (cid:44) \u2212LogSumExp\u2212\u03bb depending on s. Both for MDPs\nand games, we can write the entropy-regularized value functions as\n\nV (s) = Fs(Qs), with Qs(a) (cid:44) E[R(s, a) + \u03b3V (z)], z \u223c P (\u00b7|s, a),\n\n(4)\nwhere Qs (cid:44) (Qs(a))a\u2208A, the Q function at state s, is a vector in RK representing the value of each\naction. The function Fs is the Bellman operator at state s, which becomes smooth due to the entropy\nregularization.\n\nUseful properties Our algorithm exploits the smoothness property of Fs de\ufb01ned above. In particu-\nlar, these functions are L-smooth, that is, for any Q, Q(cid:48) \u2208 RK, we have\n|Fs(Q) \u2212 Fs(Q(cid:48)) \u2212 (Q \u2212 Q(cid:48))T\u2207Fs(Q(cid:48))| \u2264 L(cid:107)Q \u2212 Q(cid:48)(cid:107)2\n\n(5)\nFurthermore, the functions Fs have two important properties: \u2207Fs(Q)4 (cid:23) 0 and (cid:107)\u2207Fs(Q)(cid:107)1 = 1\nfor all Q \u2208 RK. This implies that the gradient \u2207Fs(Q) de\ufb01nes a probability distribution.5\n\n2, with L = 1/\u03bb\u00b7\n\n4\u2207Fs(Q) is the gradient of Fs(Q) with respect to Q.\n5It is a Boltzmann distribution with temperature \u03bb.\n\n3\n\n\fAssumptions We assume that S, A, \u03bb, and \u03b3 are given to the learner. Moreover, we assume that we\ncan access a generative model, the oracle, from which we can get reward and transition samples from\narbitrary state-action pairs. Formally, when called with parameter (s, a) \u2208 S \u00d7 A, the oracle outputs\na new random variable (R, Z) independent from any other outputs received from the generative\nmodel so far such that Z \u223c P (\u00b7|s, a) and R has same distribution as R(s, a). We denote a call to the\noracle as R, Z \u2190 oracle(s, a).\n\n2.2 Using regularization for the polynomial sample complexity\n\nTo pave the road for SmoothCruiser, we consider two extreme cases, based on the strength of the\nregularization:\n\n1. Strong regularization In this case, \u03bb \u2192 \u221e and L = 0, that is, Fs is linear for all s:\n2. No regularization In this case, \u03bb = 0 and L \u2192 \u221e, that is, Fs cannot be well approximated\n\nsx, with (cid:107)ws(cid:107)1 = 1, ws \u2208 Rk and ws (cid:23) 0,\n\nFs(x) = wT\n\nby a linear function.6\n\nIn the strongly regularized case, we can approximate the value V (s) with (cid:101)O(1/\u03b52) oracle\nE[(cid:80)\u221e\ncalls. This is due to the linearity of Fs, since the value function can be written as V (s) =\nt=0 \u03b3tR(St, At) | S0 = s] where At is distributed according to the probability vector wSt.\n\nAs a result, V (s) can be estimated by Monte-Carlo sampling of trajectories.\nWith no regularization, we can apply a simple adaptation of the sparse sampling algorithm of\nKearns et al. [1999] that we brie\ufb02y describe. Assume that we have an subroutine that provides an\n\u03b3(s), for any s. We can\napproximation of the value function with precision \u03b5/\n\ncall this subroutine several times as well as the oracle to get improved estimate (cid:98)V de\ufb01ned as\n\n\u221a\n\n\u221a\n\n\u03b3, denoted by (cid:98)V\u03b5/\n(cid:104)\nN(cid:88)\nri(s, a) + \u03b3(cid:98)V\u03b5/\n\n(cid:105)\n\n\u221a\n\n\u03b3(zi)\n\n,\n\n(cid:98)V (s) = Fs\n\n(cid:17)\n\n(cid:16)(cid:98)Qs\n\n(cid:98)Qs(a) \u2190 1\n\nN\n\nwith\n\ni=1\n\nof V (s) with precision \u03b5 with high probability. By applying this idea recursively, we start with\n\nwhere ri(s, a) and zi are rewards and next states sampled by calling the oracle with parameters\n\n(s, a). By Hoeffding\u2019s inequality, we can choose N = O(cid:0)1/\u03b52(cid:1) such that (cid:98)V (s) is an approximation\n(cid:98)V = 0, which is an approximation of the value function with precision 1/(1 \u2212 \u03b3), and progressively\ncomplexity of O(cid:0)(1/\u03b5)log(1/\u03b5)(cid:1): to estimate the value at a given recursion depth, we make O(cid:0)1/\u03b52(cid:1)\n\nimprove the estimates towards a desired precision \u03b5, which can be reached at a recursion depth of\nH = O(log(1/\u03b5)). Following the same reasoning as Kearns et al. [1999], this approach has a sample\n\nrecursive calls and stop once we reach the maximum depth, resulting in a sample complexity of\n\n(cid:19)O(log( 1\n\u03b5 ))\u00b7\n\n(cid:18) 1\n\n\u03b5\n\n=\n\n1\n\n(cid:125)\n(cid:124)\n\u03b52 \u00d7 \u00b7\u00b7\u00b7 \u00d7 1\n\u03b52\nO(log(1/\u03b5)) times\n\n(cid:123)(cid:122)\n\nIn the next section, we provide SmoothCruiser (Algorithm 1), that uses the assumption that the\nfunctions Fs are L-smooth with 0 < L < \u221e to interpolate between the two cases above and obtain a\n\nsample complexity of (cid:101)O(cid:0)1/\u03b54(cid:1).\n\n3 SmoothCruiser\n\nWe now describe our planning algorithm. Its building blocks are two procedures, sampleV (Algo-\nrithm 2) and estimateQ (Algorithm 3) that recursively call each other. The procedure sampleV\nreturns a noisy estimate of V (s) with a bias bounded by \u03b5. The procedure estimateQ averages the\n\noutputs of several calls to sampleV to obtain an estimate (cid:98)Qs that is an approximation of Qs with\nprecision \u03b5 with high probability. Finally, SmoothCruiser calls estimateQ(s, \u03b5) to obtain (cid:98)Qs and\noutputs (cid:98)V (s) = Fs((cid:98)Qs). Using the assumption that Fs is 1-Lipschitz, we can show that (cid:98)V (s) is an\n\napproximation of V (s) with precision \u03b5. Figure 1 illustrates a call to SmoothCruiser.\n\n6This is the case of the max and min functions.\n\n4\n\n\fFigure 1: Visualization of a call to SmoothCruiser(s0, \u03b50, \u03b4(cid:48)).\n\n3.1 Smooth sailing\n\nThe most important part of the algorithm is the pro-\ncedure sampleV, that returns a low-bias estimate\nof the value function. Having the estimate of the\nvalue function, the procedure estimateQ averages\nthe outputs of sampleV to obtain a good estimate of\nthe Q function with high probability. The main idea\nof sampleV is to \ufb01rst compute an estimate of preci-\n\u221a\nsion O(\n\n\u03b5) of the value of each action {(cid:98)Qs(a)}a\u2208A\nto linearly approximate the function Fs around (cid:98)Qs.\nThe local approximation of Fs around (cid:98)Qs is subsequently used to estimate the value of s with a better\n\nprecision, of order O(\u03b5), which is possible due to the smoothness of Fs.\n\nAlgorithm 1 SmoothCruiser\n\nInput: (s, \u03b5, \u03b4(cid:48)) \u2208 S \u00d7 R+\u00d7 R+\nM\u03bb \u2190 sups\u2208S |Fs(0)| = \u03bb log K\n\u03ba \u2190 (1 \u2212 \u221a\n(cid:98)Qs \u2190 estimateQ(s, \u03b5)\nSet \u03b4(cid:48), \u03ba, and M\u03bb as global parameters\n(cid:16)(cid:98)Qs\n\nOutput: Fs\n\n\u03b3)/(KL)\n\n(cid:17)\n\nOutput: 0\n\nOutput: Fs\n\nAlgorithm 2 sampleV\n1: Input: (s, \u03b5) \u2208 S \u00d7 R+\n2: if \u03b5 \u2265 (1 + M\u03bb)/(1 \u2212 \u03b3) then\n3:\n4: else if \u03b5 \u2265 \u03ba then\n5:\n6:\n7: else if \u03b5 < \u03ba then\n8:\n9:\n10:\n11:\n12:\n13:\n14: end if\n\n(cid:98)Qs \u2190 estimateQ(s, \u03b5)\n(cid:17)\n(cid:16)(cid:98)Qs\n(cid:98)Qs \u2190 estimateQ(s,\n(cid:98)V \u2190 sampleV(Z, \u03b5/\n(cid:17)\u2212 (cid:98)QT\n(cid:16)(cid:98)Qs\n\n\u221a\n\u03ba\u03b5)\nA \u2190 action drawn from \u2207Fs\n(R, Z) \u2190 oracle(s, A)\n\u221a\n\u03b3)\n\n(cid:16)(cid:98)Qs\n\nOutput:\n\ns\u2207Fs\n\n(cid:17)\n\nFs\n\n(cid:17)\n\n(cid:16)(cid:98)Qs\n+ (R + \u03b3(cid:98)V )\n\n\u03b3)2\n\n(cid:24)\n\n18(1+M\u03bb)2\n(1\u2212\u03b3)4(1\u2212\u221a\n\n(cid:25)\n\nlog(2K/\u03b4(cid:48))\n\n\u03b52\n\nAlgorithm 3 estimateQ\n1: Input: (s, \u03b5)\n2: N (\u03b5) \u2190\n3: for a \u2208 A do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: end for\n\nqi \u2190 0 for i \u2208 1, ..., N (\u03b5)\nfor i \u2208 1, ..., N (\u03b5) do\n(R, Z) \u2190 oracle(s, a).\n\u221a\n\n\u03b3(cid:1)\n(cid:98)V \u2190 sampleV(cid:0)Z, \u03b5/\nqi \u2190 R + \u03b3(cid:98)V\n(cid:98)Qs(a) \u2190 mean(q1, . . . , qN )\n// clip (cid:98)Qs(a) to [0, (1 + M\u03bb)/(1 \u2212 \u03b3)]\n(cid:98)Qs(a) \u2190 max(0, (cid:98)Qs(a))\n(cid:98)Qs(a) \u2190 min((1 + M\u03bb)/(1\u2212 \u03b3), (cid:98)Qs(a))\n\nend for\n\n15: Output: (cid:98)Qs\n\nFor a target accuracy \u03b5 at state s, sampleV distinguishes three cases, based on a reference threshold\n\u03ba (cid:44) (1 \u2212 \u221a\n\u03b3)/(KL), which is the maximum value of \u03b5 for which we can compute a good estimate\nof the value function using linear approximations of Fs.\n\n\u2022 First, if \u03b5 \u2265 (1 + \u03bb log K)/(1 \u2212 \u03b3), then 0 is a valid output, since V (s) is bounded by\n(1 + \u03bb log K)/(1 \u2212 \u03b3). This case furthermore ensures that our algorithm terminates, since\nthe recursive calls are made with increasing values of \u03b5.\n\n\u2022 Second, if \u03ba \u2264 \u03b5 \u2264 (1 + \u03bb log K)/(1\u2212 \u03b3), we run Fs(estimateQ(s, \u03b5)) in which for each\n\naction, both the oracle and sampleV are called O(cid:0)1/\u03b52(cid:1) times in order to return(cid:98)V (s) which\n\nis with high probability an \u03b5-approximation of V (s).\n\n5\n\n120<\u2265\u2190(,)\u02c6000\u203e\u203e\u203e\u203e\u221a\u2190(,\u00a0)\u02c611/0\u203e\u203e\u203e\u203e\u203e\u221a\u223c\u2207()0\u02c60\u223c\u2207()1\u02c61345678\fof V (s) in a more ef\ufb01cient way than calling the oracle and sampleV O(cid:0)1/\u03b52(cid:1) times. We\n\n\u2022 Finally, if \u03b5 < \u03ba, we take advantage of the smoothness of Fs to compute an \u03b5-approximation\n\u03ba\u03b5 instead of \u03b5, which requires O(1/\u03b5)\n\nachieve it by calling estimateQ with a precision\ncalls instead.\n\n\u221a\n\n3.2 Smoothness guarantee an improved sample complexity\n\nThe term QT\n\nQT\n\nIn this part, we describe the key ideas that allows us to exploit the smoothness of the Bellman operator\nto obtain a better sample complexity. Notice that when \u03b5 < \u03ba, the procedure estimateQ is called to\n\nobtain an estimate (cid:98)Qs such that\nThe procedure sampleV then continues with computing a linear approximation of Fs(Qs) around (cid:98)Qs.\n\n(cid:107)(cid:98)Qs \u2212 Qs(cid:107)2 = O(cid:16)(cid:112)\u03b5/L\n\n(cid:17)\n\n.\n\nUsing the L-smoothness of Fs, we guarantee the \u03b5-approximation,\n\nWe wish to output this linear approximation, but we need to handle the fact that the vector Qs (the true\n\nQ-function at s) is unknown. Notice that the vector \u2207Fs((cid:98)Qs) represents a probability distribution.\n\n(cid:111)| \u2264 L(cid:107)(cid:98)Qs \u2212 Qs(cid:107)2\n\nFs((cid:98)Qs) + (Qs \u2212 (cid:98)Qs)T\u2207Fs((cid:98)Qs)\n\n|Fs(Qs) \u2212(cid:110)\n2 = O(\u03b5).\ns\u2207Fs((cid:98)Qs) in the linear approximation of Fs(Qs) above can be expressed as\n(cid:12)(cid:12)(cid:12)(cid:98)Qs\n(cid:105)\ns\u2207Fs((cid:98)Qs) = E(cid:104)\n, with A \u223c \u2207Fs((cid:98)Qs).\ns\u2207Fs((cid:98)Qs) from estimating only Qs(A):\n\u2022 sample action A \u223c \u2207Fs((cid:98)Qs)\n\u2022 obtain an O(\u03b5)-approximation of Qs(A): (cid:101)Q(A) = Rs,A + \u03b3sampleV(cid:0)Zs,A, \u03b5/\n\u2022 output (cid:98)V (s) = Fs((cid:98)Qs) \u2212 (cid:98)QT\n\ns\u2207Fs((cid:98)Qs) + (cid:101)Q(A)\n\nQs(A)\n\n\u2022 call the generative model to sample a reward and a next state Rs,A, Zs,A \u2190 oracle(s, A)\n\nTherefore, we can build a low-bias estimate of QT\n\nWe show that (cid:98)V (s) is an \u03b5-approximation of the true value function V (s). The bene\ufb01t of such\n\n\u03b5) instead of O(\u03b5), which thanks to the\napproach is that we can call estimateQ with a precision O(\nsmoothness of Fs, reduces the sample complexity. In particular, one call to sampleV(s, \u03b5) will make\nO(1/\u03b5) recursive calls to sampleV(s,O(\n\u03b5)), and the total number of calls to sampleV behaves as\n\u00d7 1\n\u00d7 1\n\u03b51/4\n\u03b51/2\n\nTherefore, the number of sampleV calls made by SmoothCruiser is of order O(cid:0)1/\u03b52(cid:1), which\n\n\u00d7 \u00b7\u00b7\u00b7 \u2264 1\n\u03b52\u00b7\n\n\u221a\n\n\u221a\n\n1\n\u03b5\n\n\u03b3(cid:1)\n\n\u221a\n\nimplies that the total sample complexity is of O(1/\u03b54).\n\n3.3 Comparison to Monte-Carlo tree search\n\nAlgorithm 4 genericMCTS\n\nSeveral planning algorithms are based on Monte-\nCarlo tree search (MCTS, Coulom, 2007, Kocsis and\nSzepesv\u00e1ri, 2006). Algorithm 4 gives a template for\nMCTS, which uses the procedure search that calls\nselectAction and evaluateLeaf. Algorithm 5,\nsearch, returns an estimate of the value function;\nselectAction selects the action to be executed\n(also called tree policy); and evaluateLeaf returns an estimate of the value of a leaf. We now\nprovide the analogies that make it possible to see SmoothCruiser as an MCTS algorithm:\n\nInput: state s\nrepeat search(s, 0)\nuntil timeout\nOutput: estimate of best action or value.\n\n\u2022 sampleV corresponds to the function search\n\n6\n\n\f\u2022 selectAction is implemented by calling estimateQ to compute (cid:98)Qs, followed by sam-\npling an action with probability proportional to \u2207Fs((cid:98)Qs)\n\n\u2022 evaluateLeaf is implemented using the sparse sampling strategy of Kearns et al. [1999],\n\nif we see leaves as the nodes reached when \u03b5 \u2265 \u03ba\n\n4 Theoretical guarantees\n\nAlgorithm 5 search\nInput: state s, depth d\nif d > dmax then\n\nIn Theorem 1 we bound the sample complexity.\nNote that SmoothCruiser is non-adaptive, hence\nits sample complexity is deterministic and problem\nindependent. Indeed, since our algorithm is agnostic\nto the output of the oracle, it performs the same num-\nber of oracle calls for any given \u03b5 and \u03b4(cid:48), regardless\nof the random outcome of these calls.\nTheorem 1. Let n(\u03b5, \u03b4(cid:48)) be the number of oracle\ncalls before SmoothCruiser terminates. For any state s \u2208 S and \u03b5, \u03b4(cid:48) > 0,\n\nOutput: evaluateLeaf(s)\n\nend if\na \u2190 selectAction(s, d)\nR, Z \u2190 oracle(s, a)\nOutput: R + \u03b3search(Z, d + 1)\n\n(cid:16) c4\n\n(cid:17)(cid:105)log2(c5(log( c2\n\n\u03b4(cid:48) )))\n\n= (cid:101)O\n\n(cid:19)\n\n,\n\n(cid:18) 1\n\n\u03b54\n\n(cid:16) c2\n\n(cid:17)(cid:104)\n\n\u03b4(cid:48)\n\nn(\u03b5, \u03b4(cid:48)) \u2264 c1\n\n\u03b54 log\n\nc3 log\n\n\u03b5\n\nwhere c1, c2, c3, c4, and c5 are constants that depend only on K, L, and \u03b3.\n\nThe proof of Theorem 1 with the exact constants is in the appendix. In Theorem 2, we provide our\nconsistency result, stating that the output of SmoothCruiser applied to a state s \u2208 S is a good\napproximation of V (s) with high probability.\nTheorem 2. For any s \u2208 S, \u03b5 > 0, and \u03b4 > 0, there exists a \u03b4(cid:48) that depends on \u03b5 and \u03b4 such that the\n\nP(cid:104)(cid:12)(cid:12)(cid:98)V (s) \u2212 V (s)(cid:12)(cid:12) > \u03b5\n\noutput (cid:98)V (s) of SmoothCruiser(s, \u03b5, \u03b4(cid:48)) satis\ufb01es\nand such that n(\u03b5, \u03b4(cid:48)) = O(cid:0)1/\u03b54+c(cid:1) for any c > 0.\nP(cid:104)(cid:12)(cid:12)(cid:98)V (s) \u2212 V (s)(cid:12)(cid:12) > \u03b5\n\nMore precisely, in the proof of Theorem 2, we establish that\n\n(cid:105) \u2264 \u03b4.\n\n(cid:105) \u2264 \u03b4(cid:48)n(\u03b5, \u03b4(cid:48)).\n\nTherefore, for any parameter \u03b4(cid:48) satisfying \u03b4(cid:48)n(\u03b5, \u03b4(cid:48)) \u2264 \u03b4, SmoothCruiser with parameters \u03b5 and \u03b4(cid:48)\nprovides an approximation of V (s) which is (\u03b5, \u03b4) correct.\n\nImpact of regularization constant For a regularization constant \u03bb, the smoothness constant is\nL = 1/\u03bb.\nin Theorem 1 we did not make the dependence on L explicit to preserve simplicity.\nHowever, it easy to analyze the sample complexity in the two limits:\nstrong regularization L \u2192 0 and Fs is linear\nno regularization L \u2192 \u221e and Fs is not smooth\n\nAs L \u2192 0, the condition \u03ba \u2264 \u03b5 \u2264 (1+\u03bb log K)/(1\u2212\u03b3) will be met less and eventually the algorithm\nthe other hand, as L goes to \u221e, the condition \u03b5 < \u03ba will be met less and the algorithm eventually\nruns a uniform sampling strategy of Kearns et al. [1999], which results in a sample complexity of\n\nwill sample N = O(cid:0)1/\u03b52(cid:1) trajectories, which implies a sample complexity of order O(cid:0)1/\u03b52(cid:1). On\norder O(cid:0)(1/\u03b5)log(1/\u03b5)(cid:1), which is non-polynomial in 1/\u03b5.\n\nLet V\u03bb(s) be the entropy regularized value function and V0(s) be its non-regularized version. Since Fs\nis 1-Lipschitz and (cid:107)LogSumExp\u03bb \u2212 max(cid:107)\u221e \u2264 \u03bb log K, we can prove that sups |V\u03bb(s) \u2212 V0(s)| \u2264\n\u03bb log K/(1 \u2212 \u03b3). Thus, we can interpret V\u03bb(s) as an approximate value function which we can\nestimate faster.\n\n7\n\n\fcomplexity lower bound of \u2126(cid:0)(1/\u03b5)1/ log(1/\u03b3)(cid:1), which is polynomial in 1/\u03b5, but its exponent grows\n\nComparison to lower bound For non-regularized problems, Kearns et al. [1999] prove a sample\n\nas \u03b3 approaches 1. For regularized problems, Theorem 1 shows that the sample complexity is\npolynomial with an exponent that is independent of \u03b3. Hence, when \u03b3 is close to 1, regularization\ngives us a better asymptotic behavior with respect to 1/\u03b5 than the lower bound for the non-regularized\ncase, although we are not estimating the same value.\n\n5 Generalization of SmoothCruiser\n\nConsider the general de\ufb01nition of value functions in Equation 4. Although we focused on the case\nwhere Fs is the LogSumExp function, which arises as a consequence of entropy regularization,\nour theoretical results hold for any set of functions {Fs}s\u2208S that for any s satisfy the following\nconditions:\n\n1. Fs is differentiable\n2. \u2200Q \u2208 RK, 0 < (cid:107)\u2207Fs(Q)(cid:107)1 \u2264 1\n3. (nonnegative gradient) \u2200Q \u2208 RK,\u2207Fs(Q) (cid:23) 0\n4. (L-smooth) there exists L \u2265 0 such that for any Q, Q(cid:48) \u2208 RK\n\n|Fs(Q) \u2212 Fs(Q(cid:48)) \u2212 (Q \u2212 Q(cid:48))T\u2207Fs(Q(cid:48))| \u2264 L(cid:107)Q \u2212 Q(cid:48)(cid:107)2\n\n2\n\nFor the more general de\ufb01nition above, we need to make two simple modi\ufb01cations of the procedure\nsampleV. When \u03b5 < \u03ba, the action A in sampleV is sampled according to\n\nand its output is modi\ufb01ed to\n\nFs((cid:98)Qs) \u2212 (cid:98)QT\n\nA \u223c \u2207Fs((cid:98)Qs)\n(cid:107)\u2207Fs((cid:98)Qs)(cid:107)1\ns\u2207Fs((cid:98)Qs) + (R + \u03b3(cid:98)v)(cid:107)\u2207Fs((cid:98)Q)(cid:107)1.\n\nIn particular, SmoothCruiser can be used for more general regularization schemes, as long as the\nBellman operators satisfy the assumptions above. One such example is presented in Appendix E.\n\n6 Conclusion\n\nWe provided SmoothCruiser, an algorithm that estimates the value function of MDPs and discounted\ngames de\ufb01ned through smooth approximations of the optimal Bellman operator, which is the case\nin entropy-regularized value functions. More generally, our algorithm can also be used when value\nfunctions are de\ufb01ned through any smooth Bellman operator with nonnegative gradients. We showed\n\nthat our algorithm has a polynomial sample complexity of (cid:101)O(1/\u03b54), where \u03b5 is the desired precision.\n\nThis guarantee is problem independent and holds for state spaces of arbitrary cardinality.\nOne interesting interpretation of our results is that computing entropy-regularized value functions,\nwhich are commonly employed for reinforcement learning, can be seen as a smooth relaxation of a\nplanning problem for which we can obtain a much better sample complexity in terms of the required\nprecision \u03b5. Unsurprisingly, when the regularization tends to zero, we recover the well-known\n\nnon-polynomial bound O(cid:0)(1/\u03b5)log(1/\u03b5)(cid:1) of Kearns et al. [1999]. Hence, an interesting direction for\n\nfuture work is to study adaptive regularization schemes in order to accelerate planning algorithms.\nAlthough SmoothCruiser makes large amount of recursive calls, which makes it impractical in most\nsituations, we believe it might help us to understand how regularization speeds planning and inspire\nmore practical algorithms. This might be possible by exploiting its similarities to Monte-Carlo tree\nsearch that we have outlined above.\n\nAcknowledgments The research presented was supported by European CHIST-ERA project\nDELTA, French Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Coun-\ncil, Inria and Otto-von-Guericke-Universit\u00e4t Magdeburg associated-team north-European project\nAllocate, and French National Research Agency project BoB (grant n.ANR-16-CE23-0003), FMJH\nProgram PGMO with the support of this program from Criteo.\n\n8\n\n\fReferences\nPeter L Bartlett, Victor Gabillon, Jennifer Healey, and Michal Valko. Scale-free adaptive planning for\ndeterministic dynamics & discounted rewards. In International Conference on Machine Learning,\n2019.\n\nS\u00e9bastien Bubeck and R\u00e9mi Munos. Open-loop optimistic planning. In Conference on Learning\n\nTheory, 2010.\n\nLucian Bu\u00b8soniu and R\u00e9mi Munos. Optimistic planning for Markov decision processes. In Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, 2012.\n\nPierre-Arnaud Coquelin and R\u00e9mi Munos. Bandit algorithms for tree search. In Uncertainty in\n\nArti\ufb01cial Intelligence, 2007.\n\nR\u00e9mi Coulom. Ef\ufb01cient selectivity and backup operators in Monte-Carlo tree search. Computers and\n\ngames, 4630:72\u201383, 2007.\n\nBo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED:\nIn International\n\nConvergent reinforcement learning with nonlinear function approximation.\nConference on Machine Learning, 2018.\n\nZohar Feldman and Carmel Domshlak. Simple regret optimization in online planning for Markov\n\ndecision processes. Journal of Arti\ufb01cial Intelligence Research, 2014.\n\nMatthieu Geist, Bruno Scherrer, and Olivier Pietquin. A Theory of regularized Markov decision\n\nprocesses. In International Conference on Machine Learning, pages 2160\u20132169, 2019.\n\nJean-Bastien Grill, Michal Valko, and R\u00e9mi Munos. Blazing the trails before beating the path:\n\nSample-ef\ufb01cient Monte-Carlo planning. In Neural Information Processing Systems, 2016.\n\nTuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with\n\ndeep energy-based policies. In International Conference on Machine Learning, 2017.\n\nTuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy\nmaximum entropy deep reinforcement learning with a stochastic actor. In International Conference\non Machine Learning, 2018.\n\nThomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly\npolynomial for 2-player turn-based stochastic games with a constant discount factor. Journal of\nthe ACM, 60, 2013.\n\nJean-Francois Hren and R\u00e9mi Munos. Optimistic planning of deterministic systems. In European\n\nWorkshop on Reinforcement Learning, 2008.\n\nRuitong Huang, Mohammad M. Ajallooeian, Csaba Szepesv\u00e1ri, and Martin M\u00fcller. Structured\n\nbest-arm identi\ufb01cation with \ufb01xed con\ufb01dence. In Algorithmic Learning Theory, 2017.\n\nEmilie Kaufmann and Wouter M Koolen. Monte-carlo tree search by best-arm identi\ufb01cation. In\n\nNeural Information Processing Systems, 2017.\n\nMichael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal\nplanning in large Markov decision processes. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, 1999.\n\nLevente Kocsis and Csaba Szepesv\u00e1ri. Bandit-based Monte-Carlo planning. In European Conference\n\non Machine Learning, 2006.\n\nEdouard Leurent and Odalric-Ambrym Maillard. Practical open-loop pptimistic planning.\n\nEuropean Conference on Machine Learning, 2019.\n\nIn\n\nVolodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim\nHarley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement\nlearning. In International Conference on Machine Learning, 2016.\n\n9\n\n\fGergely Neu, Anders Jonsson, and Vicen\u00e7 G\u00f3mez. A uni\ufb01ed view of entropy-regularized Markov\n\ndecision processes. In arXiv:1705.07798, 2017.\n\nMartin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John\n\nWiley & Sons, New York, NY, 1994.\n\nJohn Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\n\npolicy optimization. In International Conference on Machine Learning, 2015.\n\nJohn Schulman, Xi Chen, and Pieter Abbeel. Equivalence between policy gradients and soft Q-\n\nlearning. In arXiv:1704.06440, 2017.\n\nRichard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. SIGART\n\nBulletin, 2(4):160\u2013163, 1991.\n\nRichard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. The MIT Press,\n\nsecond edition, 2018.\n\nBal\u00e1zs Sz\u00f6r\u00e9nyi, Gunnar Kedenburg, and R\u00e9mi Munos. Optimistic planning in Markov decision\n\nprocesses using a generative model. In Neural Information Processing Systems, 2014.\n\nThomas J Walsh, Sergiu Goschin, and Michael L Littman. Integrating sample-based planning and\n\nmodel-based reinforcement learning. AAAI Conference on Arti\ufb01cial Intelligence, 2010.\n\n10\n\n\f", "award": [], "sourceid": 6711, "authors": [{"given_name": "Jean-Bastien", "family_name": "Grill", "institution": "Google DeepMind"}, {"given_name": "Omar", "family_name": "Darwiche Domingues", "institution": "Inria"}, {"given_name": "Pierre", "family_name": "Menard", "institution": "Inria"}, {"given_name": "Remi", "family_name": "Munos", "institution": "DeepMind"}, {"given_name": "Michal", "family_name": "Valko", "institution": "DeepMind Paris and Inria Lille - Nord Europe"}]}