{"title": "Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 833, "page_last": 840, "abstract": null, "full_text": "Reinforcement Learning in Continuous Action Spaces\n\nthrough Sequential Monte Carlo Methods\n\nAlessandro Lazaric Marcello Restelli Andrea Bonarini\n\nDepartment of Electronics and Information\n\nPolitecnico di Milano\n\npiazza Leonardo da Vinci 32, I-20133 Milan, Italy\n\n{bonarini,lazaric,restelli}@elet.polimi.it\n\nAbstract\n\nLearning in real-world domains often requires to deal with continuous state and\naction spaces. Although many solutions have been proposed to apply Reinforce-\nment Learning algorithms to continuous state problems, the same techniques can\nbe hardly extended to continuous action spaces, where, besides the computation of\na good approximation of the value function, a fast method for the identi\ufb01cation of\nthe highest-valued action is needed. In this paper, we propose a novel actor-critic\napproach in which the policy of the actor is estimated through sequential Monte\nCarlo methods. The importance sampling step is performed on the basis of the\nvalues learned by the critic, while the resampling step modi\ufb01es the actor\u2019s policy.\nThe proposed approach has been empirically compared to other learning algo-\nrithms into several domains; in this paper, we report results obtained in a control\nproblem consisting of steering a boat across a river.\n\n1 Introduction\n\nMost of the research on Reinforcement Learning (RL) [13] has studied solutions to \ufb01nite Markov\nDecision Processes (MDPs). On the other hand, learning in real-world environments requires to\ndeal with continuous state and action spaces. While several studies focused on problems with con-\ntinuous states, little attention has been deserved to tasks involving continuous actions. Although\nseveral tasks may be (suboptimally) solved by coarsely discretizing the action variables (for in-\nstance using the tile coding approach [11, 12]), a different approach is required for problems in\nwhich high-precision control is needed and actions slightly different from the optimal one lead to\nvery low utility values. In fact, since RL algorithms need to experience each available action several\ntimes to estimate its utility, using very \ufb01ne discretizations may be too expensive for the learning\nprocess. Some approaches, although using a \ufb01nite set of target actions, deal with this problem by\nselecting real-valued actions obtained by interpolation of the available discrete actions on the basis\nof their utility values [9, 14]. Despite of this capability, the learning performance of these algorithms\nrelies on strong assumptions about the shape of the value function that are not always satis\ufb01ed in\nhighly non-linear control problems. The wire \ufb01tting algorithm [2] (later adopted also in [4]) tries to\nsolve this problem by implementing an adaptive interpolation scheme in which a \ufb01nite set of pairs\nhaction, valuei is modi\ufb01ed in order to better approximate the action value function.\nBesides having the capability of selecting any real-valued action, RL algorithms for continuous ac-\ntion problems should be able to ef\ufb01ciently \ufb01nd the greedy action, i.e., the action associated to the\nhighest estimated value. Differently from the \ufb01nite MDP case, a full search in a continuous action\nspace to \ufb01nd the optimal action is often unfeasible. To overcome this problem, several approaches\nlimit their search over a \ufb01nite number of points. In order to keep low this number, many algorithms\n(e.g., tile coding and interpolation-based) need to make (often implicit) assumptions about the shape\nof the value function. To overcome these dif\ufb01culties, several approaches have adopted the actor-\n\n\fcritic architecture [7, 10]. The key idea of actor-critic methods is to explicitly represent the policy\n(stored by the actor) with a memory structure independent of the one used for the value function\n(stored by the critic). In a given state, the policy followed by the agent is a probability distribution\nover the action space, usually represented by parametric functions (e.g., Gaussians [6], neural net-\nworks [14], fuzzy systems [5]). The role of the critic is, on the basis of the estimated value function,\nto criticize the actions taken by the actor, which consequently modi\ufb01es its policy through a stochas-\ntic gradient on its parameter space. In this way, starting from a fully exploratory policy, the actor\nprogressively changes its policy so that actions that yield higher utility values are more frequently\nselected, until the learning process converges to the optimal policy. By explicitly representing the\npolicy, actor-critic approaches can ef\ufb01ciently implement the action selection step even in problems\nwith continuous action spaces.\n\nIn this paper, we propose to use a Sequential Monte Carlo (SMC) method [8] to approximate the\nsequence of probability distributions implemented by the actor, thus obtaining a novel actor-critic\nalgorithm called SMC-learning. Instead of a parametric function, the actor represents its stochastic\npolicy by means of a \ufb01nite set of random samples (i.e., actions) that, using simple resampling and\nmoving mechanisms, is evolved over time according to the values stored by the critic. Actions are\ninitially drawn from a prior distribution, and then they are resampled according to an importance\nsampling estimate which depends on the utility values learned by the critic. By means of the resam-\npling and moving steps, the set of available actions gets more and more thick around actions with\nlarger utilities, thus encouraging a detailed exploration of the most promising action-space regions,\nand allowing SMC-learning to \ufb01nd real continuous actions. It is worth pointing out that the main\ngoal here is not an accurate approximation of the action-value function on the whole action space,\nbut to provide an ef\ufb01cient way to converge to the continuous optimal policy. The main characteris-\ntics of the proposed approach are: the agent may learn to execute any continuous action, the action\nselection phase and the search for the action with the best estimated value are computationally ef\ufb01-\ncient, no assumption on the shape of the value function is required, the algorithm is model-free, and\nit may learn to follow also stochastic policies (needed in multi-agent problems).\n\nIn the next section, we introduce basic RL notation and brie\ufb02y discuss issues about learning with\ncontinuous actions. Section 3 details the proposed learning approach (SMC-Learning), explaining\nhow SMC methods can be used to learn in continuous action spaces. Experimental results are\ndiscussed in Section 4, and Section 5 draws conclusions and contains directions for future research.\n\n2 Reinforcement Learning\n\nIn reinforcement learning problems, an agent interacts with an unknown environment. At each\ntime step, the agent observes the state, takes an action, and receives a reward. The goal of the\nagent is to learn a policy (i.e., a mapping from states to actions) that maximizes the long-term\nreturn. An RL problem can be modeled as a Markov Decision Process (MDP) de\ufb01ned by a quadru-\nple hS, A, T , R, \u03b3i, where S is the set of states, A(s) is the set of actions available in state s,\nT : S \u00d7A\u00d7S \u2192 [0, 1] is a transition distribution that speci\ufb01es the probability of observing a certain\nstate after taking a given action in a given state, R : S \u00d7A \u2192 \u211c is a reward function that speci\ufb01es the\ninstantaneous reward when taking a given action in a given state, and \u03b3 \u2208 [0, 1) is a discount factor.\nThe policy of an agent is characterized by a probability distribution \u03c0(a|s) that speci\ufb01es the probabil-\nity of taking action a in state s. The utility of taking action a in state s and following a policy \u03c0 there-\n\nwhere r1 = R(s, a). RL approaches aim at learning the policy that maximizes the action-value func-\ntion in each state. The optimal action-value function can be computed by solving the Bellman equa-\n\nt=1 \u03b3t\u22121rt|s = s1, a = a1, \u03c0(cid:3),\nafter is formalized by the action-value function Q\u03c0(s, a) = E(cid:2)P\u221e\ntion: Q\u2217(s, a) = R(s, a)+\u03b3Ps\u2032 T (s, a, s\u2032) maxa\u2032 Q\u2217(s\u2032, a\u2032). The optimal policy can be de\ufb01ned as\n\nthe greedy action in each state: \u03c0\u2217(a|s) is equal to 1/|arg maxa Q\u2217(s, a)| if a \u2208 arg maxa Q\u2217(s, a),\nand 0 otherwise.\n\nTemporal Difference (TD) algorithms [13] allows the computation of Q\u2217(s, a) by direct interaction\nwith the environment. Given the tuple hst, at, rt, st+1, at+1i (i.e., the experience performed by the\nagent), at each step, action values may be estimated by online algorithms, such as SARSA, whose\nupdate rule is:\n\n(1)\nwhere \u03b1 \u2208 [0, 1] is a learning rate and u(rt, at+1, st+1) = rt + \u03b3Q(st+1, at+1) is the target utility.\n\nQ(st, at) \u2190 (1 \u2212 \u03b1)Q(st, at) + \u03b1u(rt, at+1, st+1),\n\n\fAlthough value-function approaches have theoretical guarantees about convergence to the optimal\npolicy and have been proved to be effective in many applications, they have several limitations:\nalgorithms that maximize the value function cannot solve problems whose solutions are stochas-\ntic policies (e.g., multi-agent learning problems); small errors in the estimated value of an action\nmay lead to discontinuous changes in the policy [3], thus leading to convergence problems when\nfunction approximators are considered. These problems may be overcome by adopting actor-critic\nmethods [7] in which the action-value function and the policy are stored into two distinct repre-\nsentations. The actor typically represents the distribution density over the action space through a\nfunction \u03c0(a|s, \u03b8), whose parameters \u03b8 are updated in the direction of performance improvement,\nas established by the critic on the basis of its approximation of the value function, which is usually\ncomputed through an on-policy TD algorithm.\n\n3 SMC-Learning for Continuous Action Spaces\n\nSMC-learning is based on an actor-critic architecture, in which the actor stores and updates, for\neach state s, a density distribution \u03c0t(a|s) that speci\ufb01es the agent\u2019s policy at time instant t. At\nthe beginning of the learning process, without any prior information about the problem, the actor\nusually considers a uniform distribution over the action space, thus implementing a fully exploratory\npolicy. As the learning process progresses, the critic collects data for the estimation of the value\nfunction (in this paper, the critic estimates the action-value function), and provides the actor with\ninformation about which actions are the most promising. On the other hand, the actor changes its\npolicy to improve its performance and to progressively reduce the exploration in order to converge\nto the optimal deterministic policy.\nInstead of using parametric functions, in SMC-learning the\nactor represents its evolving stochastic policy by means of Monte Carlo sampling. The idea is the\nfollowing: for each state s, the set of available actions A(s) is initialized with N samples drawn\nfrom a proposal distribution \u03c00(a|s):\n\nA(s) = {a1, a2, \u00b7 \u00b7 \u00b7 , aN },\n\nai \u223c \u03c00(a|s).\n\nEach sampled action ai is associated to an importance weight wi \u2208 W(s) whose value is initialized\nto 1/N, so that the prior density can be approximated as\n\n\u03c00(a|s) \u2243\n\nNXi=1\n\nwi \u00b7 \u03b4(a \u2212 ai),\n\nwhere ai \u2208 A(s), wi \u2208 W(s), and \u03b4 is the Dirac delta measure. As the number of samples goes\nto in\ufb01nity this representation gets equivalent to the functional description of the original probability\ndensity function. This means that the actor can approximately follow the policy speci\ufb01ed by the\ndensity \u03c00(a|s), by simply choosing actions at random from A(s), where the (normalized) weights\nare the selection probabilities. Given the continuous action-value function estimated by the critic\nand chosen a suitable exploration strategy (e.g., the Boltzmann exploration), it is possible to de\ufb01ne\nthe desired probability distribution over the continuous action space, usually referred to as the tar-\nget distribution. As long as the learning process goes on, the action values estimated by the critic\nbecome more and more reliable, and the policy followed by the agent should change in order to\nchoose more frequently actions with higher utilities. This means that, in each state, the target dis-\ntribution changes according to the information collected during the learning process, and the actor\nmust consequently adapt its approximation.\n\nIn general, when no information is available about the shape of the target distribution, SMC meth-\nods can be effectively employed to approximate sequences of probability distributions by means of\nrandom samples, which are evolved over time exploiting importance sampling and resampling tech-\nniques. The idea behind importance sampling is to modify the weights of the samples to account\nfor the differences between the target distribution p(x) and the proposal distribution \u03c0(x) used to\ngenerate the samples. By setting each weight wi proportional to the ratio p(xi)/\u03c0(xi), the discrete\ni=1 wi \u00b7 \u03b4(x \u2212 xi) better approximates the target distribution. In our context,\nthe importance sampling step is performed by the actor, which modi\ufb01es the weights of the actions\naccording to their utility values estimated by the critic. When some samples have very small or very\nlarge normalized weights, it follows that the target density signi\ufb01cantly differs from the proposal\ndensity used to draw the samples. From a learning perspective, this means that the set of available\n\nweighted distributionPN\n\n\fAlgorithm 1 SMC-learning algorithm\n\nfor all s \u2208 S do\n\nInitialize A(s) by drawing N samples from \u03c00(a|s)\nInitialize W(s) with uniform values: wi = 1/N\n\nend for\nfor each time step t do\n\nAction Selection\nGiven the current state st, the actor selects action at from A(st) according to \u03c0t(a|s) = PN\nCritic Update\nGiven the reward rt and the utility of next state st+1, the critic updates the action value Q(st, at)\nActor Update\nGiven the action-value function, the actor updates the importance weights\nif the weights have a high variance then\n\ni=1 wi \u00b7 \u03b4(a \u2212 ai)\n\nthe set A(st) is resampled\n\nend if\nend for\n\nactions contains a number of samples whose estimated utility is very low. To avoid this, the ac-\ntor has to modify the set of available actions by resampling new actions from the current weighted\napproximation of the target distribution.\n\nIn SMC-learning, SMC methods are included into a learning algorithm that iterates through three\nmain steps (see Algorithm 1): the action selection performed by the actor, the update of the action-\nvalue function managed by the critic, and \ufb01nally the update of the policy of the actor.\n\n3.1 Action Selection\n\nOne of the main issues of learning in continuous action spaces is to determine which is the best action\nin the current state, given the (approximated) action-value function. Actor-critic methods effectively\nsolve this problem by explicitly storing the current policy. As previously described, in SMC-learning\nthe actor performs the action selection step by taking one action at random among those available\nin the current state. The probability of extraction of each action is equal to its normalized weight\nP r(ai|s) = wi. The time complexity of the action selection phase for SMC-learning is logarithmic\nin the number of actions samples.\n\n3.2 Critic Update\n\nWhile the actor determines the policy, the critic, on the basis of the collected rewards, computes\nan approximation of the action-value function. Although several function approximation schemes\ncould be adopted for this task (e.g., neural networks, regression tress, support-vector machines), we\nuse a simple solution: the critic stores an action value, Q(s, ai), for each action available in state s\n(like in tabular approaches) and modi\ufb01es it according to TD update rules (see Equation 1). Using\non-policy algorithms, such as SARSA, the time complexity of the critic update is constant (i.e., does\nnot depend on the number of available actions).\n\n3.3 Actor Update\n\nThe core of SMC-learning is represented by the update of the policy distribution performed by the\nactor. Using the importance sampling principle, the actor modi\ufb01es the weights wi, thus performing\na policy improvement step based on the action values computed by the critic. In this way, actions\nwith higher estimates get more weight. Several RL schemes could be adopted to update the weights.\nIn this paper, we focus on the Boltzmann exploration strategy [13].\n\nThe Boltzmann exploration strategy privileges the execution of actions with higher estimated utility\nvalues. The probabilities computed by the Boltzmann exploration can be used as weights for the\navailable actions. At time instant t, the weight of action ai in state s is updated as follows:\n\n\u2206Qt+1 (s,ai)\n\n\u03c4\n\ne\n\n\u2206Qt+1 (s,aj )\n\n\u03c4\n\nj=1 wje\n\nPN\n\nwhere \u2206Qt+1(s, ai) = Qt+1(s, ai) \u2212 Qt(s, ai), and the parameter \u03c4 (usually referred as to temper-\nature) speci\ufb01es the exploration degree: the higher \u03c4 , the higher the exploration.\n\nwt+1\n\ni = wt\ni\n\n,\n\n(2)\n\n\fOnce the weights have been modi\ufb01ed, the agent\u2019s policy has changed. Unfortunately, it is not\npossible to optimally solve continuous action MDPs by exploring only a \ufb01nite set of actions sampled\nfrom a prior distribution, since the optimal action may not be available. Since the prior distribution\nused to initialize the set of available actions signi\ufb01cantly differs from the optimal policy distribution,\nafter a few iterations, several actions will have negligible weights: this problem is known as the\nweight degeneracy phenomenon [1]. Since the number of samples should be kept low for ef\ufb01ciency\nreasons, having actions associated with very small weights means to waste learning parameters for\napproximating both the policy and the value function in regions of the action space that are not\nrelevant with respect to the optimal policy. Furthermore, long learning time is spent to execute\nand update utility values of actions that are not likely to be optimal. Therefore, following the SMC\napproach, after the importance sampling phase, a resampling step may be needed in order to improve\nthe distribution of the samples on the action domain. The degeneracy phenomenon can be measured\nthrough the effective sample size [8], which, for each state s, can be estimated by\n\nbNef f (s) =\n\n1\n\nXwi\u2208W(s)\n\n,\n\nw2\ni\n\n(3)\n\nwhere wi is the normalized weight. bNef f (s) is always less than the number of actions contained\nin A(s), and low values of bNef f (s) reveal high degeneracy. In order to avoid high degeneracy,\nthe actions are resampled whenever the ratio between the effective sample size bNef f (s) and the\n\nnumber of samples N falls below some given threshold \u03c3. The goal of resampling methods is to\nreplace samples with small weights, with new samples close to samples with large weights, so that\nthe discrepancy between the resampled weights is reduced. The new set of samples is generated by\nresampling (with replacement) N times from the following discrete distribution\n\n\u03c0(a|s) =\n\nNXi=1\n\nwi \u00b7 \u03b4(a \u2212 ai),\n\n(4)\n\nso that samples with high weights are selected many times. Among the several resampling ap-\nproaches that have been proposed, here we consider the systematic resampling scheme, since it can\nbe easily implemented, takes O(N) time, and minimizes the Monte Carlo variance (refer to [1] for\nmore details). The new samples inherit the same action values of their parents, and the sample\nweights are initialized using the Boltzmann distribution.\n\nAlthough the resampling step reduces the degeneracy, it introduces another problem known as sam-\nple impoverishment. Since samples with large weights are replicated several times, after a few\nresampling steps a signi\ufb01cant number of samples could be identical. Furthermore, we need to learn\nover a continuous space, and this cannot be carried out using a discrete set of \ufb01xed samples; in\nfact, the learning agent would not be able to achieve the optimal policy whenever the initial set of\navailable actions in state s (A(s)) does not contain the optimal action of that state. This limitation\nmay be overcome by means of a smoothing step, that consists of moving the samples according to a\ncontinuous approximation \u03c0\u2032(a|s, wi) of the posterior distribution . The approximation is obtained\nby using a weighted mean of kernel densities:\n\n\u03c0\u2032(a|s, wi) =\n\n1\nh\n\nNXi=1\n\nwiK(cid:18) a \u2212 ai\nh (cid:19) ,\n\n(5)\n\nwhere h > 0 is the kernel bandwidth. Typical choices for the kernel densities are Gaussian kernels\nand Epanechnikov kernels. However, these kernels produce over-dispersed posterior distributions,\nand this negatively affects the convergence speed of the learning process, especially when a few\nsamples are used. Here, we propose to use uniform kernels:\n\nKi(a) = U(cid:20) (ai\u22121 \u2212 ai)\n\n2\n\n;\n\n(ai+1 \u2212 ai)\n\n2\n\n(cid:21) .\n\nAs far as boundary samples are concerned (i.e., a1 and aN ), their corresponding kernel is set\nto K1(a) = U [(a1 \u2212 a2); (a2 \u2212 a1)/2] and KN (a) = U [(aN \u22121 \u2212 aN )/2; (aN \u2212 aN \u22121)] re-\nspectively, thus preserving the possibility to cover the whole action domain. Using these (non-\noverlapped) kernel densities, each sample is moved locally within an interval which is determined\nby its distances from the adjacent samples, thus achieving fast convergence.\n\n(6)\n\n\f 200\n\n 180\n\n 160\n\n 140\n\n 120\n\n 100\n\n 80\n\n 60\n\n 40\n\n 20\n\n 0\n\n\u03c9\n\nviability zone\n\n\u03b4\n\nquay\n\ncurrent\n\nParameter\n\nfc\nI\n\nsM AX\n\nsD\np\n\nquay\n\nZs width\nZv width\n\nValue\n1.25\n0.1\n2.5\n1.75\n0.9\n\n(200, 110)\n\n0.2\n20\n\nAlg.\nAll\nAll\n\nSARSA\nSMC\nSMC\n\nCont.-QL\n\nParam.\n\u03b10/\u03b4\u03b1\n\n\u03b3\n\n\u03c40/\u03b4\u03c4\n\n\u03c3\n\n\u03c40/\u03b4\u03c4\n\u01eb/\u03b4\u01eb\n\nValue\n0.5/0.01\n\n0.99\n\n3.0/0.0001\n\n0.95\n\n25.0/0.0005\n0.4/0.005\n\n 0\n\n 20\n\n 40\n\n 60\n\n 80\n\n 100  120\n\n 140\n\n 160  180  200\n\nFigure 1: The boat problem.\n\nTable 1: The dy-\nnamics parameters.\n\nTable 2: The learning param-\neters.\n\nBesides reducing the dispersion of the samples, this resampling scheme implements, from the critic\nperspective, a variable resolution generalization approach. Since the resampled actions inherit the\naction value associated to their parent, the learned values are generalized over a region whose width\ndepends on the distance between samples. As a result, at the beginning of the learning process, when\nthe actions are approximately uniformly distributed, SMC-learning performs broad generalization,\nthus boosting the performance. On the other hand, when the learning is near convergence the avail-\nable actions tend to group around the optimal action, thus automatically reducing generalization\nwhich may prevent the learning of the optimal policy (see [12]).\n\n4 Experiments\n\nIn this section, we show experimental results with the aim of analyzing the properties of SMC-\nlearning and to compare its performance with other RL approaches. Additional experiments on a\nmini-golf task and on the swing-up pendulum problem are reported in Appendix.\n\n4.1 The Boat Problem\n\nTo illustrate how the SMC-learning algorithm works and to assess its effectiveness with respect to\napproaches based on discretization, we used a variant of the boat problem introduced in [5]. The\nproblem is to learn a controller to drive a boat from the left bank to the right bank quay of a river,\nwith a strong non-linear current (see Figure 1). The boat\u2019s bow coordinates, x and y, are de\ufb01ned\nin the range [0, 200] and the controller sets the desired direction U over the range [\u221290\u25e6, 90\u25e6]. The\ndynamics of the boat\u2019s bow coordinates is described by the following equations:\n\nxt+1 = min(200, max(0, xt + st+1 cos(\u03b4t+1)))\nyt+1 = min(200, max(0, yt \u2212 st\u22121 sin(\u03b4t+1) \u2212 E(xt+1)))\n\nwhere the effect of the current is de\ufb01ned by E(x) = fc(cid:16) x\n\ncurrent, and the boat angle \u03b4t and speed st are updated according to the desired direction Ut+1 as:\n\n50 \u2212(cid:0) x\n\n100(cid:1)2(cid:17), where fc is the force of the\n\n\u03b4t+1 = \u03b4t + I\u2126t+1\n\u2126t+1 = \u2126t + ((\u03c9t+1 \u2212 \u2126t)(st+1/sM AX ))\nst+1 = st + (sD \u2212 st)I\n\u03c9t+1 = min(max(p(Ut+1 \u2212 \u03b4t), \u221245\u25e6), 45\u25e6)\n\nwhere I is the system inertia, sM AX is the maximum speed allowed for the boat, sD is the speed\ngoal, \u03c9 is the rudder angle, and p is a proportional coef\ufb01cient used to compute the rudder angle in\norder to reach the desired direction Ut.\nThe reward function is de\ufb01ned on three bank zones. The success zone Zs corresponds to the quay,\nthe viability zone Zv is de\ufb01ned around the quay, and the failure zone Zf in all the other bank points.\nTherefore, the reward function is de\ufb01ned as:\n\nR(x, y) =\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\n+10\nD(x,y)\n-10\n0\n\n(x, y) \u2208 Zs\n(x, y) \u2208 Zv\n(x, y) \u2208 Zf\notherwise\n\n(7)\n\n\fSarsa vs SMC-learning\n\nQL-Continuous, Tile coding, SMC-learning\n\nd\nr\na\nw\ne\nR\n\n \nl\n\na\n\nt\n\no\nT\n\n 10\n\n 8\n\n 6\n\n 4\n\n 2\n\n 0\n\nSMC-learning (5 samples)\nSMC-learning (10 samples)\nSarsa (5 actions)\nSarsa (10 actions)\nSarsa (20 actions)\nSarsa (40 actions)\n\n 0\n\n 20\n\n 40\n\n 60\n\n 80\n\n 100\n\nd\nr\na\nw\ne\nR\n\n \nl\n\na\n\nt\n\no\nT\n\n 10\n\n 8\n\n 6\n\n 4\n\n 2\n\n 0\n\nSMC-learning (10 samples)\nQL-Continuous (40 actions)\nTile coding (80 actions)\n\n 0\n\n 20\n\n 40\n\n 60\n\n 80\n\n 100\n\nEpisodes (x1000)\n\nEpisodes (x1000)\n\nFigure 2: Performance comparison between SMC-learning and SARSA (left), SMC-learning and\ntile coding and Continuous Q-learning (right)\n\nwhere D is a function that gives a reward decreasing linearly from 10 to -10 relative to the dis-\ntance from the success zone. In the experiment, each state variable is discretized in 10 intervals\nand the parameters of the dynamics are those listed in Table 1. At each trial, the boat is positioned\nat random along the left bank in one of the points shown in Figure 1. In the following, we com-\npare the results obtained with four different algorithms: SARSA with Boltzmann exploration with\ndifferent discretizations of the action space, SARSA with tile coding (or CMAC) [12], Continuous\nQ-learning [9], and SMC-learning. The learning parameters of each algorithm are listed in Table 2. 1\nFigure 2-left compares the learning performance (in terms of total reward per episode) for SARSA\nwith 5, 10, 20, and 40 evenly distributed actions to the results obtained by SMC-learning with\n5 and 10 samples. As it can be noticed, the more the number of actions available the better the\nperformance of SARSA is. With only 5 actions (one action each 36\u25e6), the paths that the controller\ncan follow are quite limited and the quay is not reachable from any of the starting point. As a\nresult, the controller learned by SARSA achieves a very poor performance. On the other hand, a\n\ufb01ner discretization allows the boat to reach more frequently the quay, even if it takes about three\ntimes the number of episodes to converge with respect to the case with 5 actions. As it can be\nnoticed, SMC-learning with 5 samples outperforms SARSA with 5 and 10 actions both in terms of\nperformance and in convergence time. In fact, after few trials, SMC-learning succeeds to remove\nthe less-valued samples and to add new samples in regions of the action space where higher rewards\ncan be obtained. As a result, not only it can achieve better performance than SARSA, but it does\nnot spend time exploring useless actions, thus improving also the convergence time. Nonetheless,\nwith only 5 samples the actor stores a very roughly approximated policy, which, as a consequence\nof resampling, may converge to actions that do not obtain a performance as good as that of SARSA\nwith 20 and 40 actions. By increasing the number of samples from 5 to 10, SMC-learning succeeds\nin realizing a better coverage of the action space, and obtains equivalent performance as SARSA\nwith 40 actions. At the same time, while the more actions available, the more SARSA takes to\nconverge, the convergence time of SMC-learning, as in the case with 5 samples, bene\ufb01ts from the\ninitial resampling, thus taking less than one sixth of the trials needed by SARSA to converge.\n\nFigure 2-right shows the comparison of the performance of SMC-learning, SARSA with tile coding\nusing two tilings and a resolution of 2.25\u25e6 (equivalent to 80 actions), and Continuous Q-learning\nwith 40 actions. We omit the results with fewer actions because both tile coding and Continuous\nQ-learning obtain poor performance. As it can be noticed, SMC-learning outperforms both the\ncompared algorithms. In particular, the generalization over the action space performed by tile cod-\ning negatively affects the learning performance because of the non-linearity of the dynamics of the\nsystem. In fact, when only few actions are available, two adjacent actions may have completely dif-\nferent effects on the dynamics and, thus, receive different rewards. Generalizing over these actions\nprevents the agent from learning which is the best action among those available. On the other hand,\nas long as the samples get closer, SMC-learning dynamically reduces its generalization over the ac-\n\n1\u03b4x is the decreasing rate for parameter x, whose value after N trials is computed as x(N ) = x(0)\n\n1+\u03b4xN .\n\n\ftion space, so that their utility can be more accurately estimated. Similarly, Continuous Q-learning\nis strictly related to the actions provided by the designer and to the implicit assumption of linearity\nof the action-value function. As a result, although it could learn any real-valued action, it does not\nsucceed in obtaining the same performance as SMC-learning even with the quadruple of actions. In\nfact, the capability of SMC-learning to move samples towards more rewarding regions of the action\nspace allows the agent to learn more effective policies even with a very limited number of samples.\n\n5 Conclusions\n\nIn this paper, we have described a novel actor-critic algorithm to solve continuous action problems.\nThe algorithm is based on a Sequential Monte Carlo approach that allows the actor to represent\nthe current policy through a \ufb01nite set of available actions associated to weights, which are updated\nusing the utility values computed by the critic. Experimental results show that SMC-learning is\nable to identify the highest valued actions through a process of importance sampling and resam-\npling. This allows SMC-learning to obtain better performance with respect to static solutions such\nas Continuous Q-learning and tile coding even with a very limited number of samples, thus improv-\ning also the convergence time. Future research activity will follow two main directions: extending\nSMC-learning to problems in which no good discretization of the state space is a priori known, and\nexperimenting in continuous action multi-agent problems.\n\nReferences\n\n[1] M. Sanjeev Arulampalam, Simon Maskell, Neil Gordon, and Tim Clapp. A tutorial on particle\n\ufb01lters for online nonlinear/non-gaussian bayesian tracking. IEEE Trans. on Signal Processing,\n50(2):174\u2013188, 2002.\n\n[2] Leemon C. Baird and A. Harry Klopf. Reinforcement learning with high-dimensional, con-\ntinuous actions. Technical Report WL-TR-93-117, Wright-Patterson Air Force Base Ohio:\nWright Laboratory, 1993.\n\n[3] D.P. Bertsekas and J.N. Tsitsiklis. Neural Dynamic Programming. Athena Scienti\ufb01c, Belmont,\n\nMA, 1996.\n\n[4] Chris Gaskett, David Wettergreen, and Alexander Zelinsky. Q-learning in continuous state and\naction spaces. In Australian Joint Conference on Arti\ufb01cial Intelligence, pages 417\u2013428, 2003.\n[5] L. Jouffe. Fuzzy inference system learning by reinforcement methods. IEEE Trans. on Systems,\n\nMan, and Cybernetics-PART C, 28(3):338\u2013355, 1998.\n\n[6] H. Kimura and S. Kobayashi. Reinforcement learning for continuous action using stochastic\n\ngradient ascent. In 5th Intl. Conf. on Intelligent Autonomous Systems, pages 288\u2013295, 1998.\n\n[7] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. SIAM Journal on Control and Opti-\n\nmization, 42(4):1143\u20131166, 2003.\n\n[8] J. S. Liu and E. Chen. Sequential monte carlo methods for dynamical systems. Journal of\n\nAmerican Statistical Association, 93:1032\u20131044, 1998.\n\n[9] Jose Del R. Millan, Daniele Posenato, and Eric Dedieu. Continuous-action q-learning. Ma-\n\nchine Learning, 49:247\u2013265, 2002.\n\n[10] Jan Peters and Stefen Schaal. Policy gradient methods for robotics. In Proceedings of the IEEE\n\nInternational Conference on Intelligent Robotics Systems (IROS), pages 2219\u20132225, 2006.\n\n[11] J. C. Santamaria, R. S: Sutton, and A. Ram. Experiments with reinforcement learning in\n\nproblems with continuous state and action spaces. Adaptive Behavior, 6:163\u2013217, 1998.\n\n[12] Alexander A. Sherstov and Peter Stone. Function approximation via tile coding: Automating\n\nparameter choice. In SARA 2005, LNAI, pages 194\u2013205. Springer Verlag, 2005.\n\n[13] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT\n\nPress, Cambridge, MA, 1998.\n\n[14] Hado van Hasselt and Marco Wiering. Reinforcement learning in continuous action spaces. In\n2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning,\npages 272\u2013279, 2007.\n\n\f", "award": [], "sourceid": 959, "authors": [{"given_name": "Alessandro", "family_name": "Lazaric", "institution": null}, {"given_name": "Marcello", "family_name": "Restelli", "institution": null}, {"given_name": "Andrea", "family_name": "Bonarini", "institution": null}]}