{"title": "Model-Based Relative Entropy Stochastic Search", "book": "Advances in Neural Information Processing Systems", "page_first": 3537, "page_last": 3545, "abstract": "Stochastic search algorithms are general black-box optimizers. Due to their ease of use and their generality, they have recently also gained a lot of attention in operations research, machine learning and policy search. Yet, these algorithms require a lot of evaluations of the objective, scale poorly with the problem dimension, are affected by highly noisy objective functions and may converge prematurely. To alleviate these problems, we introduce a new surrogate-based stochastic search approach. We learn simple, quadratic surrogate models of the objective function. As the quality of such a quadratic approximation is limited, we do not greedily exploit the learned models. The algorithm can be misled by an inaccurate optimum introduced by the surrogate. Instead, we use information theoretic constraints to bound the `distance' between the new and old data distribution while maximizing the objective function. Additionally the new method is able to sustain the exploration of the search distribution to avoid premature convergence. We compare our method with state of art black-box optimization methods on standard uni-modal and multi-modal optimization functions, on simulated planar robot tasks and a complex robot ball throwing task.The proposed method considerably outperforms the existing approaches.", "full_text": "Model-Based Relative Entropy Stochastic Search\n\nAbbas Abdolmaleki1,2,3, Rudolf Lioutikov4, Nuno Lau1, Luis Paulo Reis2,3,\n\nJan Peters4,6, and Gerhard Neumann5\n\n1: IEETA, University of Aveiro, Aveiro, Portugal\n\n2: DSI, University of Minho, Braga, Portugal\n3: LIACC, University of Porto, Porto, Portugal\n\n4: IAS, 5: CLAS, TU Darmstadt, Darmstadt, Germany\n\n6: Max Planck Institute for Intelligent Systems, Stuttgart, Germany\n{Lioutikov,peters,neumann}@ias.tu-darmstadt.de\n{abbas.a, nunolau}@ua.pt, lpreis@dsi.uminho.pt\n\nAbstract\n\nStochastic search algorithms are general black-box optimizers. Due to their ease\nof use and their generality, they have recently also gained a lot of attention in oper-\nations research, machine learning and policy search. Yet, these algorithms require\na lot of evaluations of the objective, scale poorly with the problem dimension, are\naffected by highly noisy objective functions and may converge prematurely. To\nalleviate these problems, we introduce a new surrogate-based stochastic search\napproach. We learn simple, quadratic surrogate models of the objective function.\nAs the quality of such a quadratic approximation is limited, we do not greedily ex-\nploit the learned models. The algorithm can be misled by an inaccurate optimum\nintroduced by the surrogate. Instead, we use information theoretic constraints to\nbound the \u2018distance\u2019 between the new and old data distribution while maximizing\nthe objective function. Additionally the new method is able to sustain the explo-\nration of the search distribution to avoid premature convergence. We compare our\nmethod with state of art black-box optimization methods on standard uni-modal\nand multi-modal optimization functions, on simulated planar robot tasks and a\ncomplex robot ball throwing task. The proposed method considerably outper-\nforms the existing approaches.\n\n1\n\nIntroduction\n\nStochastic search algorithms [1, 2, 3, 4] are black box optimizers of an objective function that is\neither unknown or too complex to be modeled explicitly. These algorithms only make weak assump-\ntion on the structure of underlying objective function. They only use the objective values and don\u2019t\nrequire gradients or higher derivatives of the objective function. Therefore, they are well suited\nfor black box optimization problems. Stochastic search algorithms typically maintain a stochas-\ntic search distribution over parameters of the objective function, which is typically a multivariate\nGaussian distribution [1, 2, 3]. This policy is used to create samples from the objective function.\nSubsequently, a new stochastic search distribution is computed by either computing gradient based\nupdates [2, 4, 5], evolutionary strategies [1], the cross-entropy method [7], path integrals [3, 8], or\ninformation-theoretic policy updates [9]. Information-theoretic policy updates [10, 9, 2] bound the\nrelative entropy (also called Kullback Leibler or KL divergence) between two subsequent policies.\nUsing a KL-bound for the update of the search distribution is a common approach in the stochastic\nsearch. However, such information theoretic bounds could so far only be approximately applied\neither by using Taylor-expansions of the KL-divergence resulting in natural evolutionary strate-\ngies (NES) [2, 11], or sample-based approximations, resulting in the relative entropy policy search\n\n1\n\n\f(REPS) [9] algorithm. In this paper, we present a novel stochastic search algorithm which is called\nMOdel-based Relative-Entropy stochastic search (MORE). For the \ufb01rst time, our algorithm bounds\nthe KL divergence of the new and old search distribution in closed form without approximations.\nWe show that this exact bound performs considerably better than approximated KL bounds.\nIn order to do so, we locally learn a simple, quadratic surrogate of the objective function. The\nquadratic surrogate allows us to compute the new search distribution analytically where the KL\ndivergence of the new and old distribution is bounded. Therefore, we only exploit the surrogate\nmodel locally which prevents the algorithm to be misled by inaccurate optima introduced by an\ninaccurate surrogate model.\nHowever, learning quadratic reward models directly in parameter space comes with the burden of\nquadratically many parameters that need to be estimated. We therefore investigate new methods that\nrely on dimensionality reduction for learning such surrogate models. In order to avoid over-\ufb01tting,\nwe use a supervised Bayesian dimensionality reduction approach. This dimensionality reduction\ntechnique avoids over \ufb01tting, which makes the algorithm applicable also to high dimensional prob-\nlems. In addition to solving the search distribution update in closed form, we also upper-bound the\nentropy of the new search distribution to ensure that exploration is sustained in the search distribu-\ntion throughout the learning progress, and, hence, premature convergence is avoided. We will show\nthat this method is more effective than commonly used heuristics that also enforce exploration, for\nexample, adding a small diagonal matrix to the estimated covariance matrix [3].\nWe provide a comparison of stochastic search algorithms on standard objective functions used for\nbenchmarking and in simulated robotics tasks. The results show that MORE considerably outper-\nforms state-of-the-art methods.\n\n1.1 Problem Statement\nWe want to maximize an objective function R(\u03b8) : Rn \u2192 R. The goal is to \ufb01nd one or more\nparameter vectors \u03b8 \u2208 Rn which have the highest possible objective value. We maintain a search\ndistribution \u03c0(\u03b8) over the parameter space \u03b8 of the objective function R(\u03b8). The search distribu-\ntion \u03c0(\u03b8) is implemented as a multivariate Gaussian distribution, i.e., \u03c0(\u03b8) = N (\u03b8|\u00b5, \u03a3). In each\niteration, the search distribution \u03c0(\u03b8) is used to create samples \u03b8[k] of the parameter vector \u03b8. Sub-\nsequently, the (possibly noisy) evaluation R[k] of \u03b8[k] is obtained by querying the objective function.\nThe samples {\u03b8[k], R[k]}k=1...N are subsequently used to compute a new search distribution. This\nprocess will run iteratively until the algorithm converges to a solution.\n\n1.2 Related Work\n\nRecent information-theoretic (IT) policy search algorithms [9] are based on the relative entropy pol-\nicy search (REPS) algorithm which was proposed in [10] as a step-based policy search algorithm.\nHowever, in [9] an episode-based version of REPS that is equivalent to stochastic search was pre-\nsented. The key idea behind episode-based REPS is to control the exploration-exploitation trade-off\nby bounding the relative entropy between the old \u2018data\u2019 distribution q(\u03b8) and the newly estimated\nsearch distribution \u03c0(\u03b8) by a factor \u0001. Due to the relative entropy bound, the algorithm achieves a\nsmooth and stable learning process. However, the episodic REPS algorithm uses a sample based\napproximation of the KL-bound, which needs a lot of samples in order to be accurate. Moreover, a\ntypical problem of REPS is that the entropy of the search distribution decreases too quickly, resulting\nin premature convergence.\nTaylor approximations of the KL-divergence have also been used very successfully in the area of\nstochastic search, resulting in natural evolutionary strategies (NES). NES uses the natural gradient to\noptimize the objective [2]. The natural gradient has been shown to outperform the standard gradient\nin many applications in machine learning [12]. The intuition of the natural gradient is that we want\nto obtain an update direction of the parameters of the search distribution that is most similar to the\nstandard gradient while the KL-divergence between new and old search distributions is bounded.\nTo obtain this update direction, a second order approximation of the KL, which is equivalent to the\nFisher information matrix, is used.\n\n2\n\n\fSurrogate based stochastic search algorithms [6][13] have been shown to be more sample ef\ufb01cient\nthan direct stochastic search methods and can also smooth out the noise of the objective function. For\nexample, an individual optimization method is used on the surrogate that is stopped whenever the\nKL-divergence between the new and the old distribution exceeds a certain bound [6]. For the \ufb01rst\ntime, our algorithm uses the surrogate model to compute the new search distribution analytically,\nwhich bounds the KL divergence of the new and old search distribution, in closed form.\nQuadratic models have been used successfully in trust region methods for local surrogate approxi-\nmation [14, 15]. These methods do not maintain a stochastic search distribution but a point estimate\nand a trust region around this point. They update the point estimate by optimizing the surrogate and\nstaying in the trusted region. Subsequently, heuristics are used to increase or decrease the trusted\nregion. In the MORE algorithm, the trusted region is de\ufb01ned implicitly by the KL-bound.\nThe Covariance Matrix Adaptation-Evolutionary Strategy (CMA-ES) is considered as the state of\nthe art in stochastic search optimization. CMA-ES also maintains a Gaussian distribution over the\nproblem parameter vector and uses well-de\ufb01ned heuristics to update the search distribution.\n\n2 Model-Based Relative Entropy Stochastic Search\n\nSimilar to information theoretic policy search algorithms [9], we want to control the exploration-\nexploitation trade-off by bounding the relative entropy of two subsequent search distribution. How-\never, by bounding the KL, the algorithm can adapt the mean and the variance of the algorithm. In\norder to maximize the objective for the immediate iteration, the shrinkage in the variance typically\ndominates the contribution to the KL-divergence, which often leads to a premature convergence of\nthese algorithms. Hence, in addition to control the KL-divergence of the update, we also need to\ncontrol the shrinkage of the covariance matrix. Such a control mechanism can be implemented by\nlower-bounding the entropy of the new distribution. In this paper, we will set the bound always\nto a certain percentage of the entropy of the old search distribution such that MORE converges\nasymptotically to a point estimate.\n\n2.1 The MORE framework\n\nSimilar as in [9], we can formulate an optimization problem to obtain a new search distribution\nthat maximizes the expected objective value while upper-bounding the KL-divergence and lower-\nbounding the entropy of the distribution\n\n(cid:90)\n\n\u03c0(\u03b8)R\u03b8d\u03b8,\n\nmax\n\n\u03c0\n\ns.t. KL(cid:0)\u03c0(\u03b8)||q(\u03b8)(cid:1) \u2264 \u0001,\n\nH(\u03c0) \u2265 \u03b2,\n\n1 =\n\n\u03c0(\u03b8)d\u03b8,\n\n(1)\n\n\u2212(cid:82) \u03c0(\u03b8) log \u03c0(\u03b8)d\u03b8 denotes the entropy of the distribution \u03c0 and q(\u03b8) is the old distribution. The\n\nwhere R\u03b8 denotes the expected objective1 when evaluating parameter vector \u03b8. The term H(\u03c0) =\n\nparameters \u0001 and \u03b2 are user-speci\ufb01ed parameters to control the exploration-exploitation trade-off of\nthe algorithm.\nWe can obtain a closed form solution for \u03c0(\u03b8) by optimizing the Lagrangian for the optimization\nproblem given above. This solution is given as\n\n(cid:90)\n\n(cid:18) R\u03b8\n\n(cid:19)\n\n\u03b7 + \u03c9\n\n\u03c0(\u03b8) \u221d q(\u03b8)\u03b7/(\u03b7+\u03c9) exp\n\n,\n\n(2)\n\nwhere \u03b7 and \u03c9 are the Lagrangian multipliers. As we can see, the new distribution is now a geo-\nmetric average between the old sampling distribution q(\u03b8) and the exponential transformation of the\nobjective function. Note that, by setting \u03c9 = 0, we obtain the standard episodic REPS formulation\n[9]. The optimal value for \u03b7 and \u03c9 can be obtained by minimizing the dual function g(\u03b7, \u03c9) such\nthat \u03b7 > 0 and \u03c9 > 0, see [16]. The dual function g(\u03b7, \u03c9) is given by\n\ng(\u03b7, \u03c9) = \u03b7\u0001 \u2212 \u03c9\u03b2 + (\u03b7 + \u03c9) log\n\nq(\u03b8)\n\n\u03b7\n\n\u03b7+\u03c9 exp\n\n(3)\n\n(cid:18) R\u03b8\n\n(cid:19)\n\n(cid:19)\n\nd\u03b8\n\n.\n\n\u03b7 + \u03c9\n\n(cid:18)(cid:90)\n\n1Note that we are typically not able to obtain the expected reward but only a noisy estimate of the underlying\n\nreward distribution.\n\n3\n\n\fAs we are dealing with continuous distributions, the entropy can also be negative. We specify \u03b2\nsuch that the relative difference of H(\u03c0) to a minimum exploration policy H(\u03c00) is decreased for a\ncertain percentage, i.e., we change the entropy constraint to\n\nH(\u03c0) \u2212 H(\u03c00) \u2265 \u03b3(H(q) \u2212 H(\u03c00)) \u21d2 \u03b2 = \u03b3(H(q) \u2212 H(\u03c00)) + H(\u03c00).\n\nThroughout all our experiments, we use the same \u03b3 value of 0.99 and we set minimum entropy\nH(\u03c00) of search distribution to a small enough value like \u221275. We will show that using the addi-\ntional entropy bound considerably alleviates the premature convergence problem.\n\n2.2 Analytic Solution of the Dual-Function and the Policy\n\nUsing a quadratic surrogate model of the objective function, we can compute the integrals in the\ndual function analytically, and, hence, we can satisfy the introduced bounds exactly in the MORE\nframework. At the same time, we take advantage of surrogate models such as a smoothed estimate\nin the case of noisy objective functions and a decrease in the sample complexity2.\nWe will for now assume that we are given a quadratic surrogate model\n\nR\u03b8 \u2248 \u03b8T R\u03b8 + \u03b8T r + r0\n\n1\n2\n\n(cid:16)\nf T F f \u2212 \u03b7bT Q\u22121b \u2212 \u03b7 log |2\u03c0Q| + (\u03b7 + \u03c9) log |2\u03c0(\u03b7 + \u03c9)F|(cid:17)\n\nof the objective function R\u03b8 which we will learn from data in Section 3. Moreover, the search\ndistribution is Gaussian, i.e., q(\u03b8) = N (\u03b8|b, Q). In this case the integrals in the dual function given\nin Equation 3 can be solved in closed form. The integral inside the log-term in Equation (3) now\nrepresents an integral over an un-normalized Gaussian distribution. Hence, the integral evaluates to\nthe inverse of the normalization factor of the corresponding Gaussian. After rearranging terms, the\ndual can be written as\ng(\u03b7, \u03c9) = \u03b7\u0001 \u2212 \u03b2\u03c9 +\n(4)\nwith F = (\u03b7Q\u22121\u2212 2R)\u22121 and f = \u03b7Q\u22121b + r. Hence, the dual function g(\u03b7, \u03c9) can be ef\ufb01ciently\nevaluated by matrix inversions and matrix products. Note that, for a large enough value of \u03b7, the\nmatrix F will be positive de\ufb01nite and hence invertible even if R is not. In our optimization, we\nalways restrict the \u03b7 values such that F stays positive de\ufb01nite3.\nNevertheless, we could always \ufb01nd the \u03b7 value with the correct KL-divergence.\nIn contrast to\nMORE, Episodic REPS relies on a sample based approximation of the integrals in the dual function\nin Equation (3). It uses the sampled rewards R\u03b8 of the parameters \u03b8 to approximate this integral.\nWe can also obtain the update rule for the new policy \u03c0(\u03b8). From Equation (2), we know that the new\npolicy is the geometric average of the Gaussian sampling distribution q(\u03b8) and a squared exponential\ngiven by the exponentially transformed surrogate. After re-arranging terms and completing the\nsquare, the new policy can be written as\n\n\u03c0(\u03b8) = N (\u03b8|F f , F (\u03b7 + \u03c9)) ,\n\n(5)\n\nwhere F , f are given in the previous section.\n\n3 Learning Approximate Quadratic Models\n\nIn this section, we show how to learn a quadratic surrogate. Note that we use the quadratic surrogate\nin each iteration to locally approximate the objective function and not globally. As the search dis-\ntribution will shrink in each iteration, the model error will also vanish asymptotically. A quadratic\nsurrogate is also a natural choice if a Gaussian distribution is used, cause the exponent of the Gaus-\nsian is also quadratic in the parameters. Hence, even using a more complex surrogate, it can not\nbe exploited by a Gaussian distribution. A local quadratic surrogate model provides similar second-\norder information as the Hessian in standard gradient updates. However, a quadratic surrogate model\nalso has quadratically many parameters which we have to estimate from a (ideally) very small data\n\n2The regression performed for learning the quadratic surrogate model estimates the expectation of the ob-\n\njective function from the observed samples.\n\n3To optimize g, any constrained nonlinear optimization method can be used[13].\n\n4\n\n\f(a) Rosenbrock\n\n(b) Rastrigin\n\n(c) Noisy Function\n\nFigure 1: Comparison of stochastic search methods for optimizing the uni-modal Rosenbrock (a) and the\nmulti modal (b) Rastrigin function. (c) Comparison for a noisy objective function. All results show that MORE\nclearly outperforms other methods.\n\nset. Therefore, already learning a simple local quadratic surrogate is a challenging task. In order\nto learn the local quadratic surrogate, we can use linear regression to \ufb01t a function of the form\nf (\u03b8) = \u03c6(\u03b8)\u03b2, where \u03c6(\u03b8) is a feature function that returns a bias term, all linear and all quadratic\nterms of \u03b8. Hence, the dimensionality of \u03c6(\u03b8) is D = 1 + d + d(d + 1)/2, where d is the di-\nmensionality of the parameter space. To reduce the dimensionality of the regression problem, we\nproject \u03b8 in a lower dimensional space lp\u00d71 = W \u03b8 and solve the linear regression problem in this\nreduced space4. The quadratic form of the objective function can then be computed from \u03b2 and\nW . Still, the question remains how to choose the projection matrix W . We did not achieve good\nperformance with standard PCA [17] as PCA is unsupervised. Yet, the W matrix is typically quite\nhigh dimensional such that it is hard to obtain the matrix by supervised learning and simultaneously\navoid over-\ufb01tting. Inspired by [18], where supervised Bayesian dimensionality reduction are used\nfor classi\ufb01cation, we also use a supervised Bayesian approach where we integrate out the projection\nmatrix W .\n\n3.1 Bayesian Dimensionality Reduction for Quadratic Functions\n\n(cid:90)\n\n(cid:90)\n\nIn order to integrate out the parameters W , we use the following probabilistic dimensionality re-\nduction model\n\np(r\u2217|\u03b8\u2217, D) =\n\np(r\u2217|\u03b8\u2217, W )p(W|D)dW ,\n\n(6)\n\nwhere r\u2217 is prediction of the objective at query point \u03b8\u2217, D is the training data set consisting of\nparameters \u03b8[k] and their objective evaluations R[k]. The posterior for W is given by Bayes rule,\ni.e., p(W|D) = p(D|W )p(W )/p(D). The likelihood function p(D|W ) is given by\n\np(D|W ) =\n\np(D|W , \u03b2)p(\u03b2)d\u03b2,\n\n(7)\nwhere p(D|W , \u03b2) is the likelihood of the linear model \u03b2 and p(\u03b2) its prior. For the likelihood\nof the linear model we use a multiplicative noise model, i.e., the higher the absolute value of the\nobjective, the higher the variance. The intuition behind this choice is that we are mainly interested\nin minimizing the relative error instead of the absolute error5. Our likelihood and prior is therefore\ngiven by\n\nN(cid:89)\n\nk=1\n\np(D|W , \u03b2) =\n\nN (R[k]|\u03c6(W \u03b8[k])\u03b2, \u03c32|R[k]|),\n\np(\u03b2) = N (\u03b2|0, \u03c4 2I),\n\n(8)\n\nfold.\n\n4W (p\u00d7d) is a projection matrix that projects a vector from a d dimension manifold to a p dimension mani-\n\n5We observed empirically that such relative error performs better if we have non-smooth objective functions\nwith a large difference in the objective values. For example, an error of 10 has a huge in\ufb02uence for an objective\nvalue of \u22121, while for a value of \u221210000, such an error is negligible.\n\n5\n\nEpisodes\u00d7 1500Average Return-10y0123456789-4-3-2-10123456MOREREPSPoWERxNESCMA-ESEpisodesAverage Return-10y00.511.522.534.003.503.002.502.001.501.000.50\u00d7 104MOREREPSPoWERxNESCMA-ESEpisodesAverage Return\u00d7 104-10y012345676.005.004.003.002.001.00-0.00-1.00-2.00MOREREPSPoWERxNESCMA-ES\fEquation 7 is a weighted Bayesian linear regression model in \u03b2 where the weight of each sample\nis scaled by the absolute value of |R[k]|\u22121. Therefore, p(D|W ) can be obtained ef\ufb01ciently in\nclosed form. However, due to the feature transformation, the output R[k] depends non-linearly on\nthe projection W . Therefore, the posterior p(W|D) cannot be obtained in closed form any more.\nWe use a simple sample-based approach in order to approximate the posterior p(W|D). We use K\nsamples from the prior p(W ) to approximate the integrals in Equation (6) and in p(D). In this case,\nthe predictive model is given by\n\n(cid:88)\n\ni\n\np(D|W i)\n\n,\n\n(9)\n\np(D)\n\np(r\u2217|\u03b8\u2217, W i)\n\nwhere p(D) \u2248 1/K(cid:80)\n\np(r\u2217|\u03b8\u2217, D) \u2248 1\nK\ni p(D|W i). The prediction for a single W i can again be obtained by a\nstandard Bayesian linear regression. Our algorithm is only interested in the expectation R\u03b8 = E[r|\u03b8]\nin the form of a quadratic model. Given a certain W i, we can obtain a single quadratic model from\n\u03c6(W i\u03b8)\u00b5\u03b2, where \u00b5\u03b2 is the mean of the posterior distribution p(\u03b2|W , D) obtained by Bayesian\nlinear regression. The expected quadratic model is then obtained by a weighted average over all K\nquadratic models with weight p(D|W i)/p(D). Note that with a higher number of projection matrix\nsamples(K), the better the posterior can be approximated. Generating these samples is typically\ninexpensive as it just requires computation time but no evaluation of the objective function. We\nalso investigated using more sophisticated sampling techniques such as elliptical slice sampling\n[19] which achieved a similar performance but considerably increased computation time. Further\noptimization of the sampling technique is part of future work.\n\n4 Experiments\n\nWe compare MORE with state of the art methods in stochastic search and policy search such as\nCMA-ES [1], NES [2], PoWER [20] and episodic REPS [9].\nIn our \ufb01rst experiments, we use\nstandard optimization test functions [21], such as the the Rosenbrock (uni modal) and the Rastrigin\n(multi modal) functions. We use a 15 dimensional version of these functions.\nFurthermore, we use a 5-link planar robot that has to reach a given point in task space as a toy task\nfor the comparisons. The resulting policy has 25 parameters, but we also test the algorithms in high-\ndimensional parameter spaces by scaling the robot up to 30 links (150 parameters). We subsequently\nmade the task more dif\ufb01cult by introducing hard obstacles, which results in a discontinuous objective\nfunction. We denote this task hole-reaching task. Finally, we evaluate our algorithm on a physical\nsimulation of a robot playing beer pong. The used parameters of the algorithms and a detailed\nevaluation of the parameters of MORE can be found in the supplement.\n\ni=1[x2\n\n4.1 Standard Optimization Test Functions\n\nWe chose one uni-modal functions f (x) = (cid:80)n\u22121\n10n +(cid:80)n\n\ni )2 + (1 \u2212 xi)2], also known as\nRosenbrock function and a multi-modal function which is known as the Rastgirin function f (x) =\ni \u2212 10 cos(2\u03c0xi)]. All these functions have a global minimum equal f (x) = 0. In\n\ni=1 [100(xi+1 \u2212 x2\n\nour experiments, the mean of the initial distributions has been chosen randomly.\nAlgorithmic Comparison. We compared our algorithm against CMA-ES, NES, PoWER and REPS.\nIn each iteration, we generated 15 new samples 6. For MORE, REPS and PoWER, we always keep\nthe last L = 150 samples, while for NES and CMA-ES only the 15 current samples are kept7. As\nwe can see in the Figure 1, MORE outperforms all the other methods in terms of learning speed and\n\ufb01nal performance in all test functions. However, in terms of the computation time, MORE was 5\ntimes slower than the other algorithms. Yet, MORE was suf\ufb01ciently fast as one policy update took\nless than 1s.\nPerformance on a Noisy Function. We also conducted an experiment on optimizing the Sphere\nfunction where we add multiplicative noise to the reward samples, i.e., y = f (x) + \u0001|f (x)|, where\n\u0001 \u223c N (0, 1.0) and f (x) = xM x with a randomly chosen M matrix.\n\n6We use the heuristics introduced in [1, 2] for CMA-ES and NES\n7NES and CMA-ES algorithms typically only use the new samples and discard the old samples. We also\n\ntried keeping old samples or getting more new samples which decreased the performance considerably.\n\n6\n\n\f(a) Reaching Task\n\n(b) High-D Reaching Task\n\n(c) Evaluation of \u03b3\n\nFigure 2: (a) Algorithmic comparison for a planar task (5 joints, 25 parameters). MORE outperforms all the\nother methods considerably.(b) Algorithmic comparison for a high-dimensional task (30 joints, 150 parame-\nters). The performance of NES degraded while MORE could still outperform CMA-ES. (c) Evaluation of the\nentropy bound \u03b3. For a low \u03b3, the entropy bound is not active and the algorithm converges prematurely. If \u03b3 is\nclose to one, the entropy is reduced too slowly and convergence takes long.\n\nFigure 1(c) shows that MORE successfully smooths out the noise and converges, while other meth-\nods diverge. The result shows that MORE can learn highly noisy reward functions.\n\n4.2 Planar Reaching and Hole Reaching\n\nWe used a 5-link planar robot with DMPs [22] as the underlying control policy. Each link had a\nlength of 1m. The robot is modeled as a decoupled linear dynamical system. The end-effector of\nthe robot has to reach a via-point v50 = [1, 1] at time step 50 and at the \ufb01nal time step T = 100\nthe point v100 = [5, 0] with its end effector. The reward was given by a quadratic cost term for the\ntwo via-points as well as quadratic costs for high accelerations. Note that this objective function\nis highly non-quadratic in the parameters as the via-points are de\ufb01ned in end effector space. We\nused 5 basis functions per degree of freedom for the DMPs while the goal attractor for reaching the\n\ufb01nal state was assumed to be known. Hence, our parameter vector had 25 dimensions. The setup,\nincluding the learned policy is shown in the supplement.\nAlgorithmic Comparison. We generated 40 new samples. For MORE, REPS, we always keep\nthe last L = 200 samples, while for NES and CMA-ES only the 40 current samples are kept. We\nempirically optimized the open parameters of the algorithms by manually testing 50 parameter sets\nfor each algorithm. The results shown in Figure 2(a) clearly show that MORE outperforms all other\nmethods in terms of speed and the \ufb01nal performance.\nEntropy Bound. We also evaluated the entropy bound in Figure 2(c). We can see that the entropy\nconstraint is a crucial component of the algorithm to avoid the premature convergence.\nHigh-Dimensional Parameter Spaces. We also evaluated the same task with a 30-link planar robot,\nresulting in a 150 dimensional parameter space. We compared MORE, CMA, REPS and NES. While\nNES considerably degraded in performance, CMA and MORE performed well, where MORE found\nconsiderably better policies (average reward of -6571 versus -15460 of CMA-ES), see Figure 2(b).\nThe setup with the learned policy from MORE is depicted in the supplement.\nWe use the same robot setup as in the planar reaching task for hole reaching task. For completing\nthe hole reaching task, the robot\u2019s end effector has to reach the bottom of a hole (35cm wide and 1\nm deep) centering at [2, 0] without any collision with the ground or the walls, see Figure 3(c). The\nreward was given by a quadratic cost term for the desired \ufb01nal point, quadratic costs for high accel-\nerations and additional punishment for collisions with the walls. Note that this objective function is\ndiscontinuous due to the costs for collisions. The goal attractor of the DMP for reaching the \ufb01nal\nstate in this task is unknown and is also learned. Hence, our parameter vector had 30 dimensions.\nAlgorithmic Comparison. We used the same learning parameters as for the planar reaching task.\nThe results shown in Figure 3(a) show that MORE clearly outperforms all other methods. In this\ntask, NES could not \ufb01nd any reasonable solution while Power, REPS and CMA-ES could only learn\nsub-optimal solutions. MORE could also achieve the same learning speed as REPS and CMA-ES,\nbut would then also converge to a sub-optimal solution.\n\n7\n\nEpisodesAverage Return-10y00.511.522.533.546.005.505.004.504.003.503.00\u00d7 104REPSMORExNESCMA-ESEpisodesAverage Return-10y00.511.522.533.547.507.006.506.005.505.004.504.003.50\u00d7 104REPSMORExNESCMA-ESEpisodesAverage Return-10y00.511.522.533.545.505.004.504.003.503.00\u00d7 104Gamma = 0.99Gamma = 0.96Gamma = 0.99999\f(a) Hole Reaching Task\n\n(b) Beer Pong Task\n\n(c) Hole Reaching Task Posture\n\nFigure 3: (a) Algorithmic comparison for the hole reaching task. MORE could \ufb01nd policies of much higher\nquality. (b) Algorithmic comparison for the beer pong task. Only MORE could reliably learn high-quality\npolicies while for the other methods, even if some trials found good solutions, other trials got stuck prematurely.\n\n4.3 Beer Pong\n\nIn this task, a seven DoF simulated barrett WaM robot arm\nhad to play beer-pong, i.e., it had to throw a ball such that\nit bounces once on the table and falls into a cup. The ball\nwas placed in a container mounted on the end-effector. The\nball could leave the container by a strong deceleration of the\nrobot\u2019s end-effector. We again used a DMP as underlying pol-\nicy representation, where we used the shape parameters (\ufb01ve\nper DoF) and the goal attractor (one per DoF) as parameters.\nThe mean of our search distribution was initialized with imita-\ntion learning. The cup was placed at a distance of 2.2m from\nthe robot and it had a height of 7cm. As reward function, we\ncomputed the point of the ball trajectory after the bounce on\nthe table, where the ball is passing the plane of the entry of the\ncup. The reward was set to be 20 times the negative squared\ndistance of that point to the center of the cup while punishing the acceleration of the joints. We\nevaluated MORE, CMA, PoWER and REPS on this task. The setup is shown in Figure 4 and the\nlearning curve is shown in Figure 3(b). MORE was able to accurately hit the ball into the cup while\nthe other algorithms couldn\u2019t \ufb01nd a robust policy.\n\nFigure 4: The Beer Pong Task. The\nrobot has to throw a ball such that it\nbounces of the table and ends up in the\ncup.\n\n(a) Beer Pong Task\n\n5 Conclusion\n\nUsing KL-bounds to limit the update of the search distribution is a wide-spread idea in the stochas-\ntic search community but typically requires approximations.\nIn this paper, we presented a new\nmodel-based stochastic search algorithm that computes the KL-bound analytically. By relying on a\nGaussian search distribution and on locally learned quadratic models of the objective function, we\ncan obtain a closed form of the information theoretic policy update. We also introduced an addi-\ntional entropy term in the formulation that is needed to avoid premature shrinkage of the variance of\nthe search distribution. Our algorithm considerably outperforms competing methods in all the con-\nsidered scenarios. The main disadvantage of MORE is the number of parameters. However based\non our experiments, these parameters are not problem speci\ufb01c.\n\nAcknowledgment\n\nThis project has received funding from the European Unions Horizon 2020 research and innovation\nprogramme under grant agreement No #645582 (RoMaNS) and the \ufb01rst author is supported by FCT\nunder grant SFRH/BD/81155/2011.\n\n8\n\nEpisodesAverage Return-10y00.511.522.533.547.507.006.506.005.505.004.504.00\u00d7 104REPSPoWERMORExNESCMA-ESEpisodesAverage Return-10y0501001501.501.000.50-0.00-0.50-1.00-1.50REPSPoWERMORECMA-ESx-axis [m]y-axis [m]-1.5-1-0.500.511.522.53-1012345\fReferences\n[1] N. Hansen, S.D. Muller, and P. Koumoutsakos. Reducing the Time Complexity of the De-\nrandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary\nComputation, 2003.\n\n[2] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Ef\ufb01cient Natural Evolution Strate-\nIn Proceedings of the 11th Annual conference on Genetic and evolutionary computa-\n\ngies.\ntion(GECCO), 2009.\n\n[3] F. Stulp and O. Sigaud. Path Integral Policy Improvement with Covariance Matrix Adaptation.\n\nIn International Conference on Machine Learning (ICML), 2012.\n\n[4] T. R\u00a8uckstie\u00df, M. Felder, and J. Schmidhuber. State-dependent Exploration for Policy Gradient\n\nMethods. In Proceedings of the European Conference on Machine Learning (ECML), 2008.\n\n[5] T. Furmston and D. Barber. A Unifying Perspective of Parametric Policy Search Methods for\n\nMarkov Decision Processes. In Neural Information Processing Systems (NIPS), 2012.\n\n[6] I. Loshchilov, M. Schoenauer, and M. Sebag. Intensive Surrogate Model Exploitation in Self-\n\nAdaptive Surrogate-Assisted CMA-ES (SAACM-ES). In GECCO, 2013.\n\n[7] S. Mannor, R. Rubinstein, and Y. Gat. The Cross Entropy method for Fast Policy Search. In\n\nProceedings of the 20th International Conference on Machine Learning(ICML), 2003.\n\n[8] E. Theodorou, J. Buchli, and S. Schaal. A Generalized Path Integral Control Approach to\n\nReinforcement Learning. The Journal of Machine Learning Research, 2010.\n\n[9] A. Kupcsik, M. P. Deisenroth, J. Peters, and G. Neumann. Data-Ef\ufb01cient Contextual Policy\nSearch for Robot Movement Skills. In Proceedings of the National Conference on Arti\ufb01cial\nIntelligence (AAAI), 2013.\n\n[10] J. Peters, K. M\u00a8ulling, and Y. Altun. Relative Entropy Policy Search. In Proceedings of the\n\n24th National Conference on Arti\ufb01cial Intelligence (AAAI). AAAI Press, 2010.\n\n[11] D. Wierstra, T. Schaul, T. Glasmachers, Y. Sun, J. Peters, and J. Schmidhuber. Natural Evolu-\n\ntion Strategies. Journal of Machine Learning Research, 2014.\n\n[12] S. Amari. Natural Gradient Works Ef\ufb01ciently in Learning. Neural Computation, 1998.\n[13] I. Loshchilov, M. Schoenauer, and M. Sebag. Kl-based control of the learning schedule for\n\nsurrogate black-box optimization. CoRR, 2013.\n\n[14] M.J.D. Powell. The NEWUOA Software for Unconstrained Optimization without Derivatives.\n\nIn Report DAMTP 2004/NA05, University of Cambridge, 2004.\n\n[15] M.J.D. Powell. The BOBYQA Algorithm for Bound Constrained Optimization Without\n\nDerivatives. In Report DAMTP 2009/NA06, University of Cambridge, 2009.\n\n[16] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[17] I.T. Jolliffe. Principal Component Analysis. Springer Verlag, 1986.\n[18] G. Mehmet. Bayesian Supervised Dimensionality Reduction. IEEE T. Cybernetics, 2013.\n[19] I. Murray, R.P. Adams, and D.J.C. MacKay. Elliptical Slice Sampling. JMLR: W&CP, 9, 2010.\n[20] J. Kober and J. Peters. Policy Search for Motor Primitives in Robotics. Machine Learning,\n\npages 1\u201333, 2010.\n\n[21] M. Molga and C. Smutnicki.\n\nhttp://www.zsd.ict.pwr.wroc.pl/\ufb01les/docs/functions.pdf, 2005.\n\nTest Functions\n\nfor Optimization Needs.\n\nIn\n\n[22] A. Ijspeert and S. Schaal. Learning Attractor Landscapes for Learning Motor Primitives. In\n\nAdvances in Neural Information Processing Systems 15(NIPS). 2003.\n\n9\n\n\f", "award": [], "sourceid": 1949, "authors": [{"given_name": "Abbas", "family_name": "Abdolmaleki", "institution": "University of Porto"}, {"given_name": "Rudolf", "family_name": "Lioutikov", "institution": "TU Darmstadt"}, {"given_name": "Jan", "family_name": "Peters", "institution": "TU Darmstadt"}, {"given_name": "Nuno", "family_name": "Lau", "institution": "University of Aveiro"}, {"given_name": "Luis", "family_name": "Pualo Reis", "institution": "University of Minho"}, {"given_name": "Gerhard", "family_name": "Neumann", "institution": null}]}