{"title": "Policy Search for Motor Primitives in Robotics", "book": "Advances in Neural Information Processing Systems", "page_first": 849, "page_last": 856, "abstract": "Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results into a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning by assuming a form of exploration that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable in complex motor learning tasks. We compare this algorithm to alternative parametrized policy search methods and show that it outperforms previous methods. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAM robot arm.", "full_text": "Policy Search for Motor Primitives in Robotics\n\nJens Kober, Jan Peters\n\nMax Planck Institute for Biological Cybernetics\n\nSpemannstr. 38\n\n72076 T\u00fcbingen, Germany\n\n{jens.kober,jan.peters}@tuebingen.mpg.de\n\nAbstract\n\nMany motor skills in humanoid robotics can be learned using parametrized motor\nprimitives as done in imitation learning. However, most interesting motor learn-\ning problems are high-dimensional reinforcement learning problems often beyond\nthe reach of current methods. In this paper, we extend previous work on policy\nlearning from the immediate reward case to episodic reinforcement learning. We\nshow that this results in a general, common framework also connected to pol-\nicy gradient methods and yielding a novel algorithm for policy learning that is\nparticularly well-suited for dynamic motor primitives. The resulting algorithm is\nan EM-inspired algorithm applicable to complex motor learning tasks. We com-\npare this algorithm to several well-known parametrized policy search methods and\nshow that it outperforms them. We apply it in the context of motor learning and\nshow that it can learn a complex Ball-in-a-Cup task using a real Barrett WAMTM\nrobot arm.\n\n1 Introduction\nPolicy search, also known as policy learning, has become an accepted alternative of value function-\nbased reinforcement learning [2]. In high-dimensional domains with continuous states and actions,\nsuch as robotics, this approach has previously proven successful as it allows the usage of domain-\nappropriate pre-structured policies, the straightforward integration of a teacher\u2019s presentation as\nwell as fast online learning [2, 3, 10, 18, 5, 6, 4]. In this paper, we will extend the previous work\nin [17, 18] from the immediate reward case to episodic reinforcement learning and show how it\nrelates to policy gradient methods [7, 8, 11, 10]. Despite that many real-world motor learning tasks\nare essentially episodic [14], episodic reinforcement learning [1] is a largely undersubscribed topic.\nThe resulting framework allows us to derive a new algorithm called Policy Learning by Weighting\nExploration with the Returns (PoWER) which is particularly well-suited for learning of trial-based\ntasks in motor control. We are especially interested in a particular kind of motor control policies also\nknown as dynamic motor primitives [22, 23]. In this approach, dynamical systems are being used in\norder to encode a policy, i.e., we have a special kind of parametrized policy which is well-suited for\nrobotics problems.\nWe show that the presented algorithm works well when employed in the context of learning dynamic\nmotor primitives in four different settings, i.e., the two benchmark problems from [10], the Under-\nactuated Swing-Up [21] and the complex task of Ball-in-a-Cup [24, 20]. Both the Underactuated\nSwing-Up as well as the Ball-in-a-Cup are achieved on a real Barrett WAMTM robot arm. Please also\nrefer to the video on the \ufb01rst author\u2019s website. Looking at these tasks from a human motor learning\nperspective, we have a human acting as teacher presenting an example for imitation learning and,\nsubsequently, the policy will be improved by reinforcement learning. Since such tasks are inherently\nsingle-stroke movements, we focus on the special class of episodic reinforcement learning. In our\nexperiments, we show how a presented movement is recorded using kinesthetic teach-in and, subse-\nquently, how a Barrett WAMTM robot arm is learning the behavior by a combination of imitation and\nreinforcement learning.\n\n1\n\n\f2 Policy Search for Parameterized Motor Primitives\nOur goal is to \ufb01nd reinforcement learning techniques that can be applied to a special kind of pre-\nstructured parametrized policies called motor primitives [22, 23], in the context of learning high-\ndimensional motor control tasks.\nIn order to do so, we \ufb01rst discuss our problem in the general\ncontext of reinforcement learning and introduce the required notation in Section 2.1. Using a gener-\nalization of the approach in [17, 18], we derive a new EM-inspired algorithm called Policy Learning\nby Weighting Exploration with the Returns (PoWER) in Section 2.3 and show how the general\nframework is related to policy gradients methods in 2.2. [12] extends the [17] algorithm to episodic\nreinforcement learning for discrete states; we use continuous states. Subsequently, we discuss how\nwe can turn the parametrized motor primitives [22, 23] into explorative [19], stochastic policies.\n2.1 Problem Statement & Notation\nIn this paper, we treat motor primitive learning problems in the framework of reinforcement learning\nwith a strong focus on the episodic case [1]. We assume that at time t there is an actor in a state\nst and chooses an appropriate action at according to a stochastic policy \u03c0(at|st, t). Such a policy\nis a probability distribution over actions given the current state. The stochastic formulation allows\na natural incorporation of exploration and, in the case of hidden state variables, the optimal time-\ninvariant policy has been shown to be stochastic [8]. Upon the completion of the action, the actor\ntransfers to a state st+1 and receives a reward rt. As we are interested in learning complex motor\ntasks consisting of a single stroke [23], we focus on \ufb01nite horizons of length T with episodic restarts\n[1] and learn the optimal parametrized, stochastic policy for such reinforcement learning problems.\nWe assume an explorative version of the dynamic motor primitives [22, 23] as parametrized policy \u03c0\nwith parameters \u03b8 \u2208 Rn. However, in this section, we will keep most derivations suf\ufb01ciently general\nthat they would transfer to various other parametrized policies. The general goal in reinforcement\nlearning is to optimize the expected return of the policy \u03c0 with parameters \u03b8 de\ufb01ned by\n\n(1)\nwhere T is the set of all possible paths, rollout \u03c4 = [s1:T +1, a1:T ] (also called episode or trial)\ndenotes a path of states s1:T +1 = [s1, s2, . . ., sT +1] and actions a1:T = [a1, a2, . . ., aT ]. The\nprobability of rollout \u03c4 is denoted by p(\u03c4 ) while R(\u03c4 ) refers to its return. Using the standard\nassumptions of Markovness and additive accumulated rewards, we can write\n\nJ(\u03b8) = \u0001 Tp(\u03c4 )R(\u03c4 )d\u03c4 ,\n\nR(\u03c4 ) = T \u22121PT\n\np(\u03c4 ) = p(s1)QT\n\nt=1p(st+1|st, at)\u03c0(at|st, t),\n\nt=1r(st, at, st+1, t),\n\n(2)\nwhere p(s1) denotes the initial state distribution, p(st+1|st, at) the next state distribution condi-\ntioned on last state and action, and r(st, at, st+1, t) denotes the immediate reward.\nWhile episodic Reinforcement Learning (RL) problems with \ufb01nite horizons are common in mo-\ntor control, few methods exist in the RL literature, e.g., Episodic REINFORCE [7], the Episodic\nNatural Actor Critic eNAC [10] and model-based methods using differential-dynamic programming\n[21]. Nevertheless, in the analytically tractable cases, it has been studied deeply in the optimal\ncontrol community where it is well-known that for a \ufb01nite horizon problem, the optimal solution\nis non-stationary [15] and, in general, cannot be represented by a time-independent policy. The\nmotor primitives based on dynamical systems [22, 23] are a particular type of time-variant policy\nrepresentation as they have an internal phase which corresponds to a clock with additional \ufb02exibility\n(e.g., for incorporating coupling effects, perceptual in\ufb02uences, etc.), thus, they can represent optimal\nsolutions for \ufb01nite horizons. We embed this internal clock or movement phase into our state and,\nthus, from optimal control perspective have ensured that the optimal solution can be represented.\n2.2 Episodic Policy Learning\nIn this section, we discuss episodic reinforcement learning in policy space which we will refer to\nas Episodic Policy Learning. For doing so, we \ufb01rst discuss the lower bound on the expected return\nsuggested in [17] for guaranteeing that policy update steps are improvements. In [17, 18] only the\nimmediate reward case is being discussed, we extend their framework to episodic reinforcement\nlearning and, subsequently, derive a general update rule which yields the policy gradient theorem\n[8], a generalization of the reward-weighted regression [18] as well as the novel Policy learning by\nWeighting Exploration with the Returns (PoWER) algorithm.\n2.2.1 Bounds on Policy Improvements\nUnlike in reinforcement learning, other machine learning branches have focused on optimizing lower\nbounds, e.g., resulting in expectation-maximization (EM) algorithms [16]. The reasons for this pref-\nerence apply in policy learning: if the lower bound also becomes an equality for the sampling policy,\n\n2\n\n\fwe can guarantee that the policy will be improved by optimizing the lower bound. Surprisingly, re-\nsults from supervised learning can be transferred with ease. For doing so, we follow the scenario\nsuggested in [17], i.e., generate rollouts \u03c4 using the current policy with parameters \u03b8 which we\n0.\nweight with the returns R (\u03c4 ) and subsequently match it with a new policy parametrized by \u03b8\nThis matching of the success-weighted path distribution is equivalent to minimizing the Kullback-\nLeibler divergence D (p\u03b80 (\u03c4 )kp\u03b8 (\u03c4 ) R (\u03c4 )) between the new path distribution p\u03b80 (\u03c4 ) and the\nreward-weighted previous one p\u03b8 (\u03c4 ) R (\u03c4 ). As shown in [17, 18], this results in a lower bound on\nthe expected return using Jensen\u2019s inequality and the concavity of the logarithm, i.e.,\n\nlog J(\u03b8\n\np\u03b8 (\u03c4 )\np\u03b8 (\u03c4 ) p\u03b80 (\u03c4 ) R (\u03c4 ) d\u03c4 \u2265 \u0002T\n0) = log\u0002T\n\u221d \u2212D (p\u03b8 (\u03c4 ) R (\u03c4 )kp\u03b80 (\u03c4 )) = L\u03b8(\u03b8\n\np\u03b8 (\u03c4 ) R (\u03c4 ) log p\u03b80 (\u03c4 )\n0),\n\np\u03b8 (\u03c4 ) d\u03c4 + const,\n\n(3)\n\n(4)\n\n0) and we show how it relates to previous policy learning methods.\n\nwhere D (p (\u03c4 )kq (\u03c4 )) = \u0001 p (\u03c4 ) log(p (\u03c4 ) /q (\u03c4 ))d\u03c4 is the Kullback-Leibler divergence which is\n\nconsidered a natural distance measure between probability distributions, and the constant is needed\nfor tightness of the bound. Note that p\u03b8 (\u03c4 ) R (\u03c4 ) is an improper probability distribution as pointed\nout in [17]. The policy improvement step is equivalent to maximizing the lower bound on the\nexpected return L\u03b8(\u03b8\n2.2.2 Resulting Policy Updates\nIn the following part, we will discuss three different policy updates which directly result from Sec-\ntion 2.2.1. First, we show that policy gradients [7, 8, 11, 10] can be derived from the lower bound\n0) (as was to be expected from supervised learning, see [13]). Subsequently, we show that\nL\u03b8(\u03b8\nnatural policy gradients can be seen as an additional constraint regularizing the change in the path\ndistribution resulting from a policy update when improving the policy incrementally. Finally, we\nwill show how expectation-maximization (EM) algorithms for policy learning can be generated.\n0) that de\ufb01nes the lower bound on the\nPolicy Gradients. When differentiating the function L\u03b8(\u03b8\nexpected return, we directly obtain\n\u2202\u03b80 L\u03b8(\u03b8\n\nwhere T is the set of all possible paths and \u2202\u03b80 log p\u03b80 (\u03c4 ) = PT\n\n0) = \u0001 Tp\u03b8 (\u03c4 ) R (\u03c4 ) \u2202\u03b80 log p\u03b80 (\u03c4 ) d\u03c4 ,\n\nnPT\n\n(5)\nt=1 \u2202\u03b80 log \u03c0(at|st, t) denotes the\nlog-derivative of the path distribution. As this log-derivative only depends on the policy, we can\nestimate a gradient from rollouts without having a model by simply replacing the expectation by\n0 is close to \u03b8, we have the policy gradient estimator which is widely known as\na sum; when \u03b8\n0) = \u2202\u03b8J(\u03b8). Obviously, a reward which\nEpisodic REINFORCE [7], i.e., we have lim\u03b80\u2192\u03b8 \u2202\u03b80 L\u03b8(\u03b8\nprecedes an action in an rollout, can neither be caused by the action nor cause an action in the same\nrollout. Thus, when inserting Equations (2) into Equation (5), all cross-products between rt and\n\u2202\u03b8 log \u03c0(at+\u03b4t|st+\u03b4t, t + \u03b4t) for \u03b4t > 0 become zero in expectation [10]. Therefore, we can omit\nthese terms and rewrite the estimator as\n0) = E\n(6)\n\u02dct=tr(s\u02dct, a\u02dct, s\u02dct+1, \u02dct)|st = s, at = a} is called the state-action value\n0 \u2192 \u03b8 in the in\ufb01nite\n\no\nt=1\u2202\u03b80 log \u03c0(at|st, t)Q\u03c0(s, a, t)\n\nwhere Q\u03c0(s, a, t) = E{PT\n\nfunction [1]. Equation (6) is equivalent to the policy gradient theorem [8] for \u03b8\nhorizon case where the dependence on time t can be dropped.\nThe derivation results in the Natural Actor Critic as discussed in [9, 10] when adding an additional\npunishment to prevent large steps away from the observed path distribution. This can be achieved by\nrestricting the amount of change in the path distribution and, subsequently, determining the steepest\ndescent for a \ufb01xed step away from the observed trajectories. Change in probability distributions\nis naturally measured using the Kullback-Leibler divergence, thus, after adding the additional con-\nstraint of D(p\u03b80(\u03c4 )kp\u03b8(\u03c4 )) \u2248 0.5(\u03b8\n0 \u2212 \u03b8) = \u03b4 using a second-order expansion as\napproximation where F(\u03b8) denotes the Fisher information matrix [9, 10].\nPolicy Search via Expectation Maximization. One major drawback of gradient-based ap-\nproaches is the learning rate, an open parameter which can be hard to tune in control problems\nbut is essential for good performance. Expectation-Maximization algorithms are well-known to\navoid this problem in supervised learning while even yielding faster convergence [16]. Previously,\nsimilar ideas have been explored in immediate reinforcement learning [17, 18]. In general, an EM-\n0). In\nalgorithm would choose the next policy parameters \u03b8n+1 such that \u03b8n+1 = argmax\u03b80 L\u03b8(\u03b8\nthe case where \u03c0(at|st, t) belongs to the exponential family, the next policy can be determined\nanalytically by setting Equation (6) to zero, i.e.,\n\n0 \u2212 \u03b8)TF(\u03b8)(\u03b8\n\n\u2202\u03b80 L\u03b8(\u03b8\n\n,\n\nnPT\n\no\nt=1\u2202\u03b80 log \u03c0(at|st, t)Q\u03c0(s, a, t)\n\nE\n\n= 0,\n\n(7)\n\n3\n\n\fAlgorithm 1 Policy learning by Weighting Exploration with the Returns for Motor Primitives\n\nInput: initial policy parameters \u03b80\nrepeat\n\nSample: Perform rollout(s) using a = (\u03b8 + \u03b5t)T\u03c6(s, t) with [\u03b5t]ij \u223c N (0, \u03c32\npolicy and collect all (t, st, at, st+1, \u03b5t, rt+1) for t = {1, 2, . . . , T + 1}.\n\u02dct=t r(s\u02dct, a\u02dct, s\u02dct+1, \u02dct).\n\nEstimate: Use unbiased estimate \u02c6Q\u03c0(s, a, t) =PT\n\nij) as stochastic\n\nReweight: Compute importance weights and reweight rollouts, discard low-importance roll-\nouts.\nUpdate policy using \u03b8k+1 = \u03b8k +\n\nt=1\u03b5tQ\u03c0(s, a, t)\n\nt=1Q\u03c0(s, a, t)\n\n.\n\nw(\u03c4 )\n\nw(\u03c4 )\n\nuntil Convergence \u03b8k+1 \u2248 \u03b8k\n\nDPT\n\nE\n\n.DPT\n\nE\n\n0. Depending on the choice of a stochastic policy, we will obtain different solutions\nand solving for \u03b8\nand different learning algorithms. It allows the extension of the reward-weighted regression to larger\nhorizons as well as the introduction of the Policy learning by Weighting Exploration with the Returns\n(PoWER) algorithm.\n\n2.3 Policy learning by Weighting Exploration with the Returns (PoWER)\nIn most learning control problems, we attempt to have a deterministic mean policy \u00afa = \u03b8T\u03c6(s, t)\nwith parameters \u03b8 and basis functions \u03c6. In Section 3, we will introduce the basis functions of\nthe motor primitives. When learning motor primitives, we turn this deterministic mean policy\n\u00afa = \u03b8T\u03c6(s, t) into a stochastic policy using additive exploration \u03b5(s, t) in order to make model-\nfree reinforcement learning possible, i.e., we always intend to have a policy \u03c0(at|st, t) which can be\nbrought into the form a = \u03b8T\u03c6(s, t) + \u0001(\u03c6(s, t)). Previous work in this context [7, 4, 10, 18], with\nthe notable exception of [19], has focused on state-independent, white Gaussian exploration, i.e.,\n\u0001(\u03c6(s, t)) \u223c N (0, \u03a3). It is straightforward to obtain the Reward-Weighted Regression for episodic\n0 which naturally yields a weighted regression method with the\nRL by solving Equation (7) for \u03b8\nstate-action values Q\u03c0(s, a, t) as weights. This form of exploration has resulted into various ap-\nplications in robotics such as T-Ball batting, Peg-In-Hole, humanoid robot locomotion, constrained\nreaching movements and operational space control, see [4, 10, 18] for both reviews and their own\napplications.\nHowever, such unstructured exploration at every step has a multitude of disadvantages: it causes a\nlarge variance which grows with the number of time-steps [19, 10], it perturbs actions too frequently\n\u2018washing\u2019 out their effects and can damage the system executing the trajectory. As a result, all\nmethods relying on this state-independent exploration have proven too fragile for learning the Ball-\nin-a-Cup task on a real robot system. Alternatively, as introduced by [19], one could generate a form\nof structured, state-dependent exploration \u0001(\u03c6(s, t)) = \u03b5T\nij), where\nij are meta-parameters of the exploration that can also be optimized. This argument results into\n\u03c32\nthe policy a \u223c \u03c0(at|st, t) = N (a|\u03b8T\u03c6(s, t), \u02c6\u03a3(s, t)). Inserting the resulting policy into Equation\n(7), we obtain the optimality condition in the sense of Equation (7) and can derive the update rule\n\nt \u03c6(s, t) with [\u03b5t]ij \u223c N (0, \u03c32\no\n\no\u22121\n\nt=1Q\u03c0(s, a, t)W(s, t)\n\nt=1Q\u03c0(s, a, t)W(s, t)\u03b5t\n\n(8)\nwith W(s, t) = \u03c6(s, t)\u03c6(s, t)T/(\u03c6(s, t)T\u03c6(s, t)). Note that for our motor primitives W reduces\nto a diagonal, constant matrix and cancels out. Hence the simpli\ufb01ed form in Algorithm 1.\nIn\norder to reduce the number of rollouts in this on-policy scenario, we reuse the rollouts through\nimportance sampling as described in the context of reinforcement learning in [1]. To avoid the\nfragility sometimes resulting from importance sampling in reinforcement learning, samples with\nvery small importance weights are discarded. The expectations E{\u00b7} are replaced by the importance\nsampler denoted by h\u00b7iw(\u03c4 ). The resulting algorithm is shown in Algorithm 1. As we will see in\nSection 3, this PoWER method outperforms all other described methods signi\ufb01cantly.\n\nnPT\n\nnPT\n\nE\n\n0 = \u03b8 + E\n\n\u03b8\n\n3 Application to Motor Primitive Learning for Robotics\nIn this section, we demonstrate the effectiveness of the algorithm presented in Section 2.3 in the\ncontext of motor primitive learning for robotics. For doing so, we will \ufb01rst give a quick overview\nhow the motor primitives work and how the algorithm can be used to adapt them. As \ufb01rst evaluation,\nwe will show that the novel presented PoWER algorithm outperforms many previous well-known\n\n4\n\n\fmethods, i.e., \u2018Vanilla\u2019 Policy Gradients, Finite Difference Gradients, the Episodic Natural Actor\nCritic and the generalized Reward-Weighted Regression on the two simulated benchmark problems\nsuggested in [10] and a simulated Underactuated Swing-Up [21]. Real robot applications are done\nwith our best benchmarked method, the PoWER method. Here, we \ufb01rst show PoWER can learn\nthe Underactuated Swing-Up [21] even on a real robot. As a signi\ufb01cantly more complex motor\nlearning task, we show how the robot can learn a high-speed Ball-in-a-Cup [24] movement with\nmotor primitives for all seven degrees of freedom of our Barrett WAMTM robot arm.\n\n3.1 Using the Motor Primitives in Policy Search\nThe motor primitive framework [22, 23] can be described as two coupled differential equations, i.e.,\nwe have a canonical system \u02d9y = f(y, z) with movement phase y and possible external coupling to z\nas well as a nonlinear system \u00a8x = g(x, \u02d9x, y, \u03b8) which yields the current action for the system. Both\ndynamical systems are chosen to be stable and to have the right properties so that they are useful for\nthe desired class of motor control problems. In this paper, we focus on single stroke movements as\nthey frequently appear in human motor control [14, 23] and, thus, we will always choose the point\nattractor version of the motor primitives exactly as presented in [23] and not the older one in [22].\nThe biggest advantage of the motor primitive framework of [22, 23] is that the function g is linear\nin the policy parameters \u03b8 and, thus, well-suited for imitation learning as well as for our presented\nreinforcement learning algorithm. For example, if we would have to learn only a motor primitive for\na single degree of freedom qi, then we could use a motor primitive in the form \u00af\u00a8qi = g(qi, \u02d9qi, y, \u03b8) =\n\u03c6(s)T\u03b8 where s = [qi, \u02d9qi, y] is the state and where time is implicitly embedded in y. We use the\noutput of \u00af\u00a8qi = \u03c6(s)T\u03b8 = \u00afa as the policy mean. The perturbed accelerations \u00a8qi = a = \u00afa + \u03b5 is given\nto the system. The details of \u03c6 are given in [23].\nIn Sections 3.3 and 3.4, we use im-\nitation learning for the initialization.\nFor imitations, we follow [22]: \ufb01rst,\nextract the duration of the movement\nfrom initial and \ufb01nal zero velocity\nand use it to adjust the time constants.\nSecond, use locally-weighted regres-\nsion to solve for an imitation from a\nsingle example.\n\nFigure 1: This \ufb01gure shows the mean performance of all\ncompared methods in two benchmark tasks averaged over\ntwenty learning runs with the error bars indicating the stan-\ndard deviation. Policy learning by Weighting Exploration\nwith the Returns (PoWER) clearly outperforms Finite Dif-\nference Gradients (FDG), \u2018Vanilla\u2019 Policy Gradients (VPG),\nthe Episodic Natural Actor Critic (eNAC) and the adapted\nReward-Weighted Regression (RWR) for both tasks.\n\n3.2 Benchmark Comparison\nAs benchmark comparison, we in-\ntend to follow a previously studied\nscenario in order to evaluate which\nmethod is best-suited for our prob-\nlem class. For doing so, we perform\nour evaluations on the exact same\nbenchmark problems as [10] and use\ntwo tasks commonly studied in mo-\ntor control literature for which the analytic solutions are known, i.e., a reaching task where a goal\nhas to be reached at a certain time while the used motor commands have to be minimized and a\nreaching task of the same style with an additional via-point. In this comparison, we mainly want to\nshow the suitability of our algorithm and show that it outperforms previous methods such as Finite\nDifference Gradient (FDG) methods [10], \u2018Vanilla\u2019 Policy Gradients (VPG) with optimal baselines\n[7, 8, 11, 10], the Episodic Natural Actor Critic (eNAC) [9, 10], and the episodic version of the\nReward-Weighted Regression (RWR) algorithm [18]. For both tasks, we use the same rewards as in\n[10] but we use the newer form of the motor primitives from [23]. All open parameters were manu-\nally optimized for each algorithm in order to maximize the performance while not destabilizing the\nconvergence of the learning process.\nWhen applied in the episodic scenario, Policy learning by Weighting Exploration with the Returns\n(PoWER) clearly outperformed the Episodic Natural Actor Critic (eNAC), \u2018Vanilla\u2019 Policy Gradient\n(VPG), Finite Difference Gradient (FDG) and the adapted Reward-Weighted Regression (RWR)\nfor both tasks. The episodic Reward-Weighted Regression (RWR) is outperformed by all other\nalgorithms suggesting that this algorithm does not generalize well from the immediate reward case.\n\n5\n\n102103\u22121000\u2212500\u2212250number of rolloutsaverage return(a) minimum motor command102103\u2212102\u2212101number of rolloutsaverage return(b) passing through a point  FDGVPGeNACRWRPoWER\fFigure 2: This \ufb01gure shows the time series of the Underactuated Swing-Up where only a single joint\nof the robot is moved with a torque limit ensured by limiting the maximal motor current of that joint.\nThe resulting motion requires the robot to (i) \ufb01rst move away from the target to limit the maximal\nrequired torque during the swing-up in (ii-iv) and subsequent stabilization (v). The performance of\nthe PoWER method on the real robot is shown in (vi).\n\nFigure 3: This \ufb01gure shows the perfor-\nmance of all compared methods for the\nswing-up in simulation and show the mean\nperformance averaged over 20 learning\nruns with the error bars indicating the stan-\ndard deviation. PoWER outperforms the\nother algorithms from 50 rollouts on and\n\ufb01nds a signi\ufb01cantly better policy.\n\nWhile FDG gets stuck on a plateau, both eNAC and VPG converge to the same, good \ufb01nal solution.\nPoWER \ufb01nds the same (or even slightly better) solution while achieving it noticeably faster. The\nresults are presented in Figure 1. Note that this plot has logarithmic scales on both axes, thus a\nunit difference corresponds to an order of magnitude. The omission of the \ufb01rst twenty rollouts was\nnecessary to cope with the log-log presentation.\n3.3 Underactuated Swing-Up\nAs additional simulated benchmark and for the real-\nrobot evaluations, we employed the Underactuated\nSwing-Up [21]. Here, only a single degree of free-\ndom is represented by the motor primitive as described\nin Section 3.1. The goal is to move a hanging heavy\npendulum to an upright position and stabilize it there\nin minimum time and with minimal motor torques.\nBy limiting the motor current for that degree of free-\ndom, we can ensure that the torque limits described in\n[21] are maintained and directly moving the joint to\nthe right position is not possible. Under these torque\nlimits, the robot needs to (i) \ufb01rst move away from the\ntarget to limit the maximal required torque during the\nswing-up in (ii-iv) and subsequent stabilization (v) as\nillustrated in Figure 2 (i-v). This problem is similar to\na mountain-car problem where the car would have to stop on top or experience a failure.\nThe applied torque limits were the same as in [21] and so was the reward function was the except\nthat the complete return of the trajectory was transformed by an exp(\u00b7) to ensure positivity. Again all\nopen parameters were manually optimized. The motor primitive with nine shape parameters and one\ngoal parameter was initialized by imitation learning from a kinesthetic teach-in. Subsequently, we\ncompared the other algorithms as previously considered in Section 3.2 and could show that PoWER\nwould again outperform them. The results are given in Figure 3. As it turned out to be the best\nperforming method, we then used it successfully for learning optimal swing-ups on a real robot. See\nFigure 2 (vi) for the resulting real-robot performance.\n3.4 Ball-in-a-Cup on a Barrett WAMTM\nThe most challenging application in this paper is the children\u2019s game Ball-in-a-Cup [24] where a\nsmall cup is attached at the robot\u2019s end-effector and this cup has a small wooden ball hanging down\nfrom the cup on a 40cm string.\nInitially, the ball is hanging down vertically. The robot needs\nto move fast in order to induce a motion at the ball through the string, swing it up and catch it\nwith the cup, a possible movement is illustrated in Figure 4 (top row). The state of the system is\ndescribed in joint angles and velocities of the robot and the Cartesian coordinates of the ball. The\nactions are the joint space accelerations where each of the seven joints is represented by a motor\nprimitive. All motor primitives are perturbed separately but employ the same joint \ufb01nal reward\ngiven by r(tc) = exp(\u2212\u03b1(xc \u2212 xb)2 \u2212 \u03b1(yc \u2212 yb)2) while r(t) = 0 for all other t 6= tc where tc\nis the moment where the ball passes the rim of the cup with a downward direction, the cup position\ndenoted by [xc, yc, zc] \u2208 R3, the ball position [xb, yb, zb] \u2208 R3 and a scaling parameter \u03b1 = 100.\nThe task is quite complex as the reward is not modi\ufb01ed solely by the movements of the cup but\nforemost by the movements of the ball and the movements of the ball are very sensitive to changes\nin the movement. A small perturbation of the initial condition or during the trajectory will drastically\nchange the movement of the ball and hence the outcome of the rollout.\n\n6\n\n501001502000.60.70.80.91number of rolloutsaverage return  RWRPoWERFDGVPGeNAC\fFigure 4: This \ufb01gure shows schematic drawings of the Ball-in-a-Cup motion, the \ufb01nal learned robot\nmotion as well as a kinesthetic teach-in. The green arrows show the directions of the current move-\nments in that frame. The human cup motion was taught to the robot by imitation learning with\n31 parameters per joint for an approximately 3 seconds long trajectory. The robot manages to re-\nproduce the imitated motion quite accurately, but the ball misses the cup by several centimeters.\nAfter ca. 75 iterations of our Policy learning by Weighting Exploration with the Returns (PoWER)\nalgorithm the robot has improved its motion so that the ball goes in the cup. Also see Figure 5.\n\nFigure 5: This \ufb01gure shows the expected\nreturn of the learned policy in the Ball-in-\na-Cup evaluation averaged over 20 runs.\n\nDue to the complexity of the task, Ball-in-a-Cup is\neven a hard motor learning task for children who usu-\nally only succeed at it by observing another person\nplaying and a lot of improvement by trial-and-error.\nMimicking how children learn to play Ball-in-a-Cup,\nwe \ufb01rst initialize the motor primitives by imitation and,\nsubsequently, improve them by reinforcement learn-\ning. We recorded the motions of a human player by\nkinesthetic teach-in in order to obtain an example for\nimitation as shown in Figure 4 (middle row). From the\nimitation, it can be determined by cross-validation that\n31 parameters per motor primitive are needed. As ex-\npected, the robot fails to reproduce the the presented\nbehavior and reinforcement learning is needed for self-improvement. Figure 5 shows the expected\nreturn over the number of rollouts where convergence to a maximum is clearly recognizable. The\nrobot regularly succeeds at bringing the ball into the cup after approximately 75 iterations.\n4 Conclusion\nIn this paper, we have presented a new perspective on policy learning methods and an application\nto a highly complex motor learning task on a real Barrett WAMTM robot arm. We have generalized\nthe previous work in [17, 18] from the immediate reward case to the episodic case. In the process,\nwe could show that policy gradient methods are a special case of this more general framework.\nDuring initial experiments, we realized that the form of exploration highly in\ufb02uences the speed of\nthe policy learning method. This empirical insight resulted in a novel policy learning algorithm,\nPolicy learning by Weighting Exploration with the Returns (PoWER), an EM-inspired algorithm\nthat outperforms several other policy search methods both on standard benchmarks as well as on a\nsimulated Underactuated Swing-Up.\nWe successfully applied this novel PoWER algorithm in the context of learning two tasks on a\nphysical robot, i.e., the Underacted Swing-Up and Ball-in-a-Cup. Due to the curse of dimensionality,\nwe cannot start with an arbitrary solution. Instead, we mimic the way children learn Ball-in-a-Cup\nand \ufb01rst present an example for imitation learning which is recorded using kinesthetic teach-in.\nSubsequently, our reinforcement learning algorithm takes over and learns how to move the ball into\n\n7\n\n02040608010000.20.40.60.81number of rolloutsaverage return\fthe cup reliably. After only realistically few episodes, the task can be regularly ful\ufb01lled and the\nrobot shows very good average performance.\nReferences\n[1] R. Sutton and A. Barto. Reinforcement Learning. MIT Press, 1998.\n[2] J. Bagnell, S. Kadade, A. Ng, and J. Schneider. Policy search by dynamic programming. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2003.\n\n[3] A. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In\n\nInternational Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2000.\n\n[4] F. Guenter, M. Hersch, S. Calinon, and A. Billard. Reinforcement learning for imitating con-\n\nstrained reaching movements. RSJ Advanced Robotics, 21, 1521-1544, 2007.\n\n[5] M. Toussaint and C. Goerick. Probabilistic inference for structured planning in robotics. In\n\nInternational Conference on Intelligent Robots and Systems (IROS), 2007.\n\n[6] M. Hoffman, A. Doucet, N. de Freitas, and A. Jasra. Bayesian policy learning with trans-\n\ndimensional MCMC. In Advances in Neural Information Processing Systems (NIPS), 2007.\n\n[7] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce-\n\nment learning. Machine Learning, 8:229\u2013256, 1992.\n\n[8] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforce-\nment learning with function approximation. In Advances in Neural Information Processing\nSystems (NIPS), 2000.\n\n[9] J. Bagnell and J. Schneider. Covariant policy search. In International Joint Conference on\n\nArti\ufb01cial Intelligence (IJCAI), 2003.\n\n[10] J. Peters and S. Schaal. Policy gradient methods for robotics. In International Conference on\n\nIntelligent Robots and Systems (IROS), 2006.\n\n[11] G. Lawrence, N. Cowan, and S. Russell. Ef\ufb01cient gradient estimation for motor control learn-\n\ning. In International Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2003.\n\n[12] H. Attias. Planning by probabilistic inference. In Ninth International Workshop on Arti\ufb01cial\n\nIntelligence and Statistics (AISTATS), 2003.\n\n[13] J. Binder, D. Koller, S. Russell, and K. Kanazawa. Adaptive probabilistic networks with hidden\n\nvariables. Machine Learning, 29:213\u2013244, 1997.\n\n[14] G. Wulf. Attention and motor skill learning. Human Kinetics, Champaign, IL, 2007.\n[15] D. E. Kirk. Optimal control theory. Prentice-Hall, Englewood Cliffs, New Jersey, 1970.\n[16] G. J. McLachan and T. Krishnan. The EM Algorithm and Extensions. Wiley Series in Proba-\n\nbility and Statistics. John Wiley & Sons, 1997.\n\n[17] P. Dayan and G. E. Hinton. Using expectation-maximization for reinforcement learning. Neu-\n\nral Computation, 9(2):271\u2013278, 1997.\n\n[18] J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational\n\nspace control. In International Conference on Machine Learning (ICML), 2007.\n\n[19] T. R\u00fcckstie\u00df, M. Felder, and J. Schmidhuber. State-dependent exploration for policy gradient\n\nmethods. In European Conference on Machine Learning (ECML), 2008.\n\n[20] M. Kawato, F. Gandolfo, H. Gomi, and Y. Wada. Teaching by showing in kendama based on\n\noptimization principle. In International Conference on Arti\ufb01cial Neural Networks, 1994.\n\n[21] C. G. Atkeson. Using local trajectory optimizers to speed up global optimization in dynamic\n\nprogramming. In Advances in Neural Information Processing Systems (NIPS), 1994.\n\n[22] A. Ijspeert, J. Nakanishi, and S. Schaal. Learning attractor landscapes for learning motor\n\nprimitives. In Advances in Neural Information Processing Systems (NIPS), 2003.\n\n[23] S. Schaal, P. Mohajerian, and A. Ijspeert. Dynamics systems vs. optimal control \u2014 a unifying\n\nview. Progress in Brain Research, 165(1):425\u2013445, 2007.\n\n[24] Wikipedia, May 31, 2008. http://en.wikipedia.org/wiki/Ball_in_a_cup\n[25] J. Kober, B. Mohler, and J. Peters. Learning perceptual coupling for motor primitives. In\n\nInternational Conference on Intelligent RObots and Systems (IROS), 2008.\n\n8\n\n\f", "award": [], "sourceid": 89, "authors": [{"given_name": "Jens", "family_name": "Kober", "institution": null}, {"given_name": "Jan", "family_name": "Peters", "institution": null}]}