{"title": "CMA-ES with Optimal Covariance Update and Storage Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 370, "page_last": 378, "abstract": "The covariance matrix adaptation evolution strategy (CMA-ES) is arguably one of the most powerful real-valued derivative-free optimization algorithms, finding many applications in machine learning. The CMA-ES is a Monte Carlo method, sampling from a sequence of multi-variate Gaussian distributions. Given the function values at the sampled points, updating and storing the covariance matrix dominates the time and space complexity in each iteration of the algorithm. We propose a numerically stable quadratic-time covariance matrix update scheme with minimal memory requirements based on maintaining triangular Cholesky factors. This requires a modification of the cumulative step-size adaption (CSA) mechanism in the CMA-ES, in which we replace the inverse of the square root of the covariance matrix by the inverse of the triangular Cholesky factor. Because the triangular Cholesky factor changes smoothly with the matrix square root, this modification does not change the behavior of the CMA-ES in terms of required objective function evaluations as verified empirically. Thus, the described algorithm can and should replace the standard CMA-ES if updating and storing the covariance matrix matters.", "full_text": "CMA-ES with Optimal Covariance Update and\n\nStorage Complexity\n\nOswin Krause\n\nDept. of Computer Science\nUniversity of Copenhagen\n\nCopenhagen, Denmark\n\noswin.krause@di.ku.dk\n\nD\u00eddac R. Arbon\u00e8s\n\nDept. of Computer Science\nUniversity of Copenhagen\n\nCopenhagen, Denmark\n\ndidac@di.ku.dk\n\nChristian Igel\n\nDept. of Computer Science\nUniversity of Copenhagen\n\nCopenhagen, Denmark\n\nigel@di.ku.dk\n\nAbstract\n\nThe covariance matrix adaptation evolution strategy (CMA-ES) is arguably one\nof the most powerful real-valued derivative-free optimization algorithms, \ufb01nding\nmany applications in machine learning. The CMA-ES is a Monte Carlo method,\nsampling from a sequence of multi-variate Gaussian distributions. Given the\nfunction values at the sampled points, updating and storing the covariance matrix\ndominates the time and space complexity in each iteration of the algorithm. We\npropose a numerically stable quadratic-time covariance matrix update scheme\nwith minimal memory requirements based on maintaining triangular Cholesky\nfactors. This requires a modi\ufb01cation of the cumulative step-size adaption (CSA)\nmechanism in the CMA-ES, in which we replace the inverse of the square root of\nthe covariance matrix by the inverse of the triangular Cholesky factor. Because\nthe triangular Cholesky factor changes smoothly with the matrix square root, this\nmodi\ufb01cation does not change the behavior of the CMA-ES in terms of required\nobjective function evaluations as veri\ufb01ed empirically. Thus, the described algorithm\ncan and should replace the standard CMA-ES if updating and storing the covariance\nmatrix matters.\n\n1\n\nIntroduction\n\nThe covariance matrix adaptation evolution strategy, CMA-ES [Hansen and Ostermeier, 2001], is\nrecognized as one of the most competitive derivative-free algorithms for real-valued optimization\n[Beyer, 2007; Eiben and Smith, 2015]. The algorithm has been successfully applied in many unbiased\nperformance comparisons and numerous real-world applications. In machine learning, it is mainly\nused for direct policy search in reinforcement learning and hyperparameter tuning in supervised\nlearning (e.g., see Gomez et al. [2008]; Heidrich-Meisner and Igel [2009a,b]; Igel [2010], and\nreferences therein).\nThe CMA-ES is a Monte Carlo method for optimizing functions f : Rd \u2192 R. The objective function\nf does not need to be continuous and can be multi-modal, constrained, and disturbed by noise. In\neach iteration, the CMA-ES samples from a d-dimensional multivariate normal distribution, the\nsearch distribution, and ranks the sampled points according to their objective function values. The\nmean and the covariance matrix of the search distribution are then adapted based on the ranked points.\nGiven the ranking of the sampled points, the runtime of one CMA-ES iteration is \u03c9(d2) because\nthe square root of the covariance matrix is required, which is typically computed by an eigenvalue\ndecomposition. If the objective function can be evaluated ef\ufb01ciently and/or d is large, the computation\nof the matrix square root can easily dominate the runtime of the optimization process.\nVarious strategies have been proposed to address this problem. The basic approach for reducing the\nruntime is to perform an update of the matrix only every \u03c4 \u2208 \u2126(d) steps [Hansen and Ostermeier,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f1996, 2001], effectively reducing the time complexity to O(d2). However, this forces the algorithm\nto use outdated matrices during most iterations and can increase the amount of function evaluations.\nFurthermore, it leads to an uneven distribution of computation time over the iterations. Another\napproach is to restrict the model complexity of the search distribution [Poland and Zell, 2001; Ros\nand Hansen, 2008; Sun et al., 2013; Akimoto et al., 2014; Loshchilov, 2014, 2015], for example,\nto consider only diagonal matrices [Ros and Hansen, 2008]. However, this can lead to a drastic\nincrease in function evaluations needed to approximate the optimum if the objective function is not\ncompatible with the restriction, for example, when optimizing highly non-separable problems while\nonly adapting the diagonal of the covariance matrix [Omidvar and Li, 2011]. More recently, methods\nwere proposed that update the Cholesky factor of the covariance matrix instead of the covariance\nmatrix itself [Suttorp et al., 2009; Krause and Igel, 2015]. This works well for some CMA-ES\nvariations (e.g., the (1+1)-CMA-ES and the multi-objective MO-CMA-ES [Suttorp et al., 2009;\nKrause and Igel, 2015; Bringmann et al., 2013]), however, the original CMA-ES relies on the matrix\nsquare root, which cannot be replaced one-to-one by a Cholesky factor.\nIn the following, we explore the use of the triangular Cholesky factorization instead of the square root\nin the standard CMA-ES. In contrast to previous attempts in this direction, we present an approach\nthat comes with a theoretical justi\ufb01cation for why it does not deteriorate the algorithm\u2019s performance.\nThis approach leads to the optimal asymptotic storage and runtime complexity when adaptation of\nthe full covariance matrix is required, as is the case for non-separable ill-conditioned problems. Our\nCMA-ES variant, referred to as Cholesky-CMA-ES, reduces the runtime complexity of the algorithm\nwith no signi\ufb01cant change in the number of objective function evaluations. It also reduces the memory\nfootprint of the algorithm.\nSection 2 brie\ufb02y describes the original CMA-ES algorithm (for details we refer Hansen [2015]).\nIn section 3 we propose our new method for approximating the step-size adaptation. We give a\ntheoretical justi\ufb01cation for the convergence of the new algorithm. We provide empirical performance\nresults comparing the original CMA-ES with the new Cholesky-CMA-ES using various benchmark\nfunctions in section 4. Finally, we discuss our results and draw our conclusions.\n\n2 Background\n\nBefore we brie\ufb02y describe the CMA-ES to \ufb01x our notation, we discuss some basic properties of\nusing a Cholesky decomposition to sample from a multi-variate Gaussian distribution. Sampling\nfrom a d-dimensional multi-variate normal distribution N (m, \u03a3), m \u2208 Rd ,\u03a3 \u2208 Rd\u00d7d is usually\ndone using a decomposition of the covariance matrix \u03a3. This could be the square root of the matrix\n\u03a3 = HH \u2208 Rd\u00d7d or a lower triangular Cholesky factorization \u03a3 = AAT , which is related to the\nsquare root by the QR-decomposition H = AE where E is an orthogonal matrix. We can sample a\npoint x from N (m, \u03a3) using a sample z \u223c N (0, I) by x = Hz + m = AEz + m = Ay + m,\nwhere we set y = Ez. We have y \u223c N (0, I) since E is orthogonal. Thus, as long as we are only\ninterested in the value of x and do not need y, we can sample using the Cholesky factor instead of\nthe matrix square root.\n\n2.1 CMA-ES\n\nThe CMA-ES has been proposed by Hansen and Ostermeier [1996, 2001] and its most recent version\nis described by Hansen [2015]. In the tth iteration of the algorithm, the CMA-ES samples \u03bb points\nfrom a multivariate normal distribution N (mt, \u03c32\nt \u00b7 Ct), evaluates the objective function f at these\npoints, and adapts the parameters Ct \u2208 Rd\u00d7d, mt \u2208 Rd, and \u03c3t \u2208 R+. In the following, we present\nthe update procedure in a slightly simpli\ufb01ed form (for didactic reasons, we refer to Hansen [2015] for\nthe details). All parameters (\u00b5, \u03bb, \u03c9, c\u03c3, d\u03c3, cc, c1, c\u00b5) are set to their default values [Hansen, 2015,\nTable 1].\nFor a minimization task, the \u03bb points are ranked by function value such that f (x1,t) \u2264 f (x2,t) \u2264\ni=1 \u03c9ixi,t. The\nweights depend only on the ranking, not on the function values directly. This renders the algorithm\ninvariant under order-preserving transformation of the objective function. Points with smaller ranks\ni=1 \u03c9i = 1. The weights\nare zero for ranks larger than \u00b5 < \u03bb, which is typically \u00b5 = \u03bb/2. Thus, points with function values\nworse than the median do not enter the adaptation process of the parameters. The covariance matrix\n\n\u00b7\u00b7\u00b7 \u2264 f (x\u03bb,t). The distribution mean is set to the weighted average mt+1 = (cid:80)\u00b5\n(i.e., better objective function values) are given a larger weight \u03c9i with(cid:80)\u03bb\n\n2\n\n\fwhere \u00b5eff = 1/(cid:80)\u00b5\n\nis updated using two terms, a rank-1 and a rank-\u00b5 update. For the rank-1 update, a long term average\nof the changes of mt is maintained\n\npc,t+1 = (1 \u2212 cc)pc,t +(cid:112)cc(2 \u2212 cc)\u00b5eff\n\n,\n\n(1)\n\nmt+1 \u2212 mt\n\n\u03c3t\n\ni=1 \u03c92\n\ni is the effective sample size given the weights. Note that pc,t is large\nwhen the algorithm performs steps in the same direction, while it becomes small when the algorithm\nperforms steps in alternating directions.1 The rank-\u00b5 update estimates the covariance of the weighted\nsteps xi,t \u2212 mt, 1 \u2264 i \u2264 \u00b5. Combining rank-1 and rank-\u00b5 update gives the \ufb01nal update rule for Ct,\nwhich can be motivated by principles from information geometry [Akimoto et al., 2012]:\n\u03c9i (xi,t \u2212 mt) (xi,t \u2212 mt)T\n\nCt+1 = (1 \u2212 c1 \u2212 c\u00b5)Ct + c1pc,t+1pT\n\n(2)\n\nc,t+1 +\n\n\u00b5(cid:88)\n\ni=1\n\nc\u00b5\n\u03c32\nt\n\nSo far, the update is (apart from initialization) invariant under af\ufb01ne linear transformations (i.e.,\nx (cid:55)\u2192 Bx + b, B \u2208 GL(d, R)).\nThe update of the global step-size parameter \u03c3t is based on the cumulative step-size adaptation\nalgorithm (CSA). It measures the correlation of successive steps in a normalized coordinate system.\nThe goal is to adapt \u03c3t such that the steps of the algorithm become uncorrelated. Under the assumption\nthat uncorrelated steps are standard normally distributed, a carefully designed long term average over\nthe steps should have the same expected length as a \u03c7-distributed random variable, denoted by E{\u03c7}.\nThe long term average has the form\n\np\u03c3,t+1 = (1 \u2212 c\u03c3)p\u03c3,t +(cid:112)c\u03c3(2 \u2212 c\u03c3)\u00b5eff C\n\n(3)\n\nmt+1 \u2212 mt\n\n\u22121/2\nt\n\n\u03c3t\n\nwith p\u03c3,1 = 0. The normalization by the factor C\nis the main difference between equations\n(1) and (3). It is important because it corrects for a change of Ct between iterations. Without this\ncorrection, it is dif\ufb01cult to measure correlations accurately in the un-normalized coordinate system.\nFor the update, the length of p\u03c3,t+1 is compared to the expected length E{\u03c7} and \u03c3t is changed\ndepending on whether the average step taken is longer or shorter than expected:\n\n\u22121/2\nt\n\n(cid:18) c\u03c3\n\nd\u03c3\n\n(cid:19)(cid:19)\n(cid:18)(cid:107)p\u03c3,t+1(cid:107)\nE{\u03c7} \u2212 1\n\n\u03c3t+1 = \u03c3t exp\n\n(4)\n\nThis update is not proven to preserve invariance under af\ufb01ne linear transformations [Auger, 2015],\nand it is it conjectured that it does not.\n\n3 Cholesky-CMA-ES\nIn general, computing the matrix square root or the Cholesky factor from an n \u00d7 n matrix has time\ncomplexity \u03c9(d2) (i.e., scales worse than quadratically). To reduce this complexity, Suttorp et al.\n[2009] have suggested to replace the process of updating the covariance matrix and decomposing it\nafterwards by updates directly operating on the decomposition (i.e., the covariance matrix is never\ncomputed and stored explicitly, only its factorization is maintained). Krause and Igel [2015] have\nshown that the update of Ct in equation (2) can be rewritten as a quadratic-time update of its triangular\nCholesky factor At with Ct = AtAT\nt . They consider the special case \u00b5 = \u03bb = 1. We propose\nto extend this update to the standard CMA-ES, which leads to a runtime O(\u00b5d2). As typically\n\u00b5 = O(log(d)), this gives a large speed-up compared to the explicit recomputation of the Cholesky\nfactor or the inverse of the covariance matrix.\nUnfortunately, the fast Cholesky update can not be applied directly to the original CMA-ES. To see\n\u22121/2\n(mt+1 \u2212 mt) in equation (3). Rewriting p\u03c3,t+1 in terms of st in\nthis, consider the term st = C\nt\na non-recursive fashion, we obtain\n\np\u03c3,t+1 =(cid:112)c\u03c3(2 \u2212 c\u03c3)\u00b5eff\n\nt(cid:88)\n\n(1 \u2212 c\u03c3)t\u2212k\n\nsk .\n\nk=1\n\n\u03c3k\n\n1Given cc, the factors in (1) are chosen to compensate for the change in variance when adding distributions.\n\u221a\n\u00b5eff \u00b7 (mt+1 \u2212 mt)/\u03c3t \u223c N (0, Ct) and if Ct = I and\n\nIf the ranking of the points would be purely random,\npc,t \u223c N (0, I) then pc,t+1 \u223c N (0, I).\n\n3\n\n\fAlgorithm 1: The Cholesky-CMA-ES.\ninput :\u03bb, \u00b5, m1, \u03c9i=1...\u00b5, c\u03c3, d\u03c3, cc, c1 and c\u00b5\nA1 = I, pc,1 = 0, p\u03c3,1 = 0\nfor t = 1, 2, . . . do\n\nfor i = 1, . . . , \u03bb do\n\ni=1 \u03c9ixi,t\n\nxi,t = \u03c3tAtyi,t + mt, yi,t \u223c N (0, I)\nSort xi,t, i = 1, . . . , \u03bb increasing by f (xi,t)\n\nmt+1 =(cid:80)\u00b5\npc,t+1 = (1 \u2212 cc)pc,t +(cid:112)cc(2 \u2212 cc)\u00b5eff\nAt+1 \u2190(cid:112)1 \u2212 c1 \u2212 c\u00b5At\n\n// Apply formula (2) to At\nAt+1 \u2190 rankOneUpdate(At+1, c1, pc,t+1)\nfor i = 1, . . . , \u00b5 do\n\nmt+1\u2212mt\n\n\u03c3t\n\nAt+1 \u2190 rankOneUpdate(At+1, c\u00b5\u03c9i, xi,t\u2212mt\n\np\u03c3,t+1 = (1 \u2212 c\u03c3)p\u03c3,t +(cid:112)c\u03c3(2 \u2212 c\u03c3)\u00b5effA\u22121\n\n// Update \u03c3 using \u02c6sk as in (5)\n\n\u03c3t\n\nt\n\n)\n\n(cid:16) c\u03c3\n\nd\u03c3\n\n(cid:16)(cid:107)p\u03c3,t+1(cid:107)\n(cid:17)(cid:17)\nE{\u03c7} \u2212 1\n\n\u03c3t+1 = \u03c3t exp\n\nmt+1\u2212mt\n\n\u03c3t\n\n:Cholesky factor A \u2208 Rd\u00d7d of C, \u03b2 \u2208 R, v \u2208 Rd\n\nAlgorithm 2: rankOneUpdate(A, \u03b2, v)\ninput\noutput : Cholesky factor A(cid:48) of C + \u03b2vvT\n\u03b1 \u2190 v\nb \u2190 1\nfor j = 1, . . . , d do\n\njj \u2190(cid:113)\n\nA2\n\nb \u03b12\nj\n\nA(cid:48)\n\u03b3 \u2190 A2\nfor k = j + 1, . . . , d do\nAkj\nA(cid:48)\njj \u03b2\u03b1j\n\u03b3 \u03b1k\n\njj + \u03b2\njjb + \u03b2\u03b12\nj\n\u03b1k \u2190 \u03b1k \u2212 \u03b1j\nA(cid:48)\nkj =\nb \u2190 b + \u03b2\n\nAkj +\n\nAjj\n\nA(cid:48)\njj\nAjj\n\u03b12\nj\nA2\njj\n\nBy the RQ-decomposition, we can \ufb01nd C 1/2\nlower triangular. When replacing st by \u02c6st = A\u22121\n\nt = AtEt with Et being an orthogonal matrix and At\n\np\u03c3,t+1 =(cid:112)c\u03c3(2 \u2212 c\u03c3)\u00b5eff\n\nt (mt+1 \u2212 mt), we obtain\nt(cid:88)\n\n(1 \u2212 c\u03c3)t\u2212k\n\nET\n\nk \u02c6sk .\n\nk=1\n\n\u03c3k\n\nt\n\nby A\u22121\n\nintroduces a new random rotation matrix ET\n\n\u22121/2\nThus, replacing C\nt , which changes in every\nt\niteration. Obtaining Et from At can be achieved by the polar-decomposition, which is a cubic-time\noperation: currently there are no algorithms known that can update an existing polar decomposition\nfrom an updated Cholesky factor in less than cubic time. Thus, if our goal is to apply the fast Cholesky\nupdate, we have to perform the update without this correction factor\n(1 \u2212 c\u03c3)t\u2212k\n\np\u03c3,t+1 \u2248(cid:112)c\u03c3(2 \u2212 c\u03c3)\u00b5eff\n\nt(cid:88)\n\n\u02c6sk .\n\n(5)\n\nk=1\n\n\u03c3k\n\nThis introduces some error, but we will show in the following that we can expect this error to be small\nand to decrease over time as the algorithm converges to the optimum. For this, we need the following\nresult:\n\n4\n\n\ft = \u00afAtEt denote the RQ-decomposition of \u00afC 1/2\n\nt=0 with \u00afCt =\nt\u2192\u221e\u2212\u2192 \u00afC and that \u00afC is symmetric positive de\ufb01nite with det \u00afC = 1.\n, where Et is orthogonal and \u00afAt lower\n\nLemma 1. Consider the sequence of symmetric positive de\ufb01nite matrices \u00afC\u221e\nCt(det Ct)\u22121/d. Assume that \u00afCt\nLet \u00afC 1/2\ntriangular. Then it holds ET\nProof. Let \u00afC = \u00afAE, the RQ-decomposition of \u00afC. As det \u00afC (cid:54)= 0, this decomposition is unique.\nBecause the RQ-decomposition is continuous, it maps convergent sequences to convergent sequences.\nTherefore Et\n\nt\u2192\u221e\u2212\u2192 E and thus, ET\n\nt\u2192\u221e\u2212\u2192 ET E = I.\n\nt\u2192\u221e\u2212\u2192 I.\n\nt\u22121Et\n\nt\n\nt\u22121Et\n\nThis result establishes that, when Ct converges to a certain shape (but not necessary to a certain\nscaling), At and thus Et will also converge (up to scaling). Thus, as we only need the norm of p\u03c3,t+1,\nwe can rotate the coordinate system and by multiplying with Et we obtain\n(1 \u2212 c\u03c3)t\u2212k\n\n(cid:107)p\u03c3,t+1(cid:107) = (cid:107)Etp\u03c3,t+1(cid:107) =(cid:112)c\u03c3(2 \u2212 c\u03c3)\u00b5eff\n\nEtET\n\n(6)\n\nk \u02c6sk\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t(cid:88)\n\nk=1\n\n\u03c3k\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) .\n\n2\n\nt\u22121\n\nt\u2192\u221e\u2212\u2192 I, the error in the norm will also vanish due to the exponential weighting\nTherefore, if EtET\nin the summation. Note that this does not hold for any decomposition Ct = BtBT\nt . If we do not\nconstrain Bt to be triangular and allow any matrix, we do not have a bijective mapping between Ct\nand Bt anymore and the introduction of d(d\u22121)\ndegrees of freedom (as, e.g., in the update proposed\nby Suttorp et al. [2009]) allows the creation of non-converging sequences of Et even for Ct = const.\nAs the CMA-ES is a randomized algorithm, we cannot assume convergence of Ct. However, in\nsimpli\ufb01ed algorithms the expectation of Ct converges [Beyer, 2014]. Still, the reasoning behind\nLemma 1 establishes that the error caused by replacing st by \u02c6st is small if Ct changes slowly.\nEquation (6) establishes that the error depends only on the rotation of coordinate systems. As the\nmapping from Ct to the triangular factor At is one-to-one and smooth, the coordinate system changes\nin every step will be small \u2013 and because of the exponentially decaying weighting, only the last few\ncoordinate systems matter at a particular time step t.\nThe Cholesky-CMA-ES algorithm is given in Algorithm 1. One can derive the algorithm from the\nstandard CMA-ES by decomposing (2) into a number of rank-1 updates Ct+1 = (((\u03b1Ct + \u03b21v1vT\n1 ) +\n\u03b22v2vT\n\n2 ) . . . ) and applying them to the Cholesky factor using Algorithm 2.\n\nProperties of the update rule. The O(\u00b5d2) complexity of the update in the Cholesky-CMA-\nES is asymptotically optimal.2 Apart from the theoretical guarantees, there are several additional\nadvantages compared to approaches using a non-triangular Cholesky factorization (e.g., Suttorp et\nal. [2009]). First, as only triangular matrices have to be stored, the storage complexity is optimal.\nSecond, the diagonal elements of a triangular Cholesky factor are the square roots of the eigenvalues\nof the factorized matrix, that is, we get the eigenvalues of the covariance matrix for free. These\nare important, for example, for monitoring the conditioning of the optimization problem and, in\nparticular, to enforce lower bounds on the variances of \u03c3tCt projected on its principal components.\nThird, a triangular matrix can be inverted in quadratic time. Thus, we can ef\ufb01ciently compute A\u22121\nfrom At when needed, instead of having two separate quadratic-time updates for A\u22121\nand At, which\nrequires more memory and is prone to numerical instabilities.\n\nt\n\nt\n\n4 Experiments and Results\n\n(cid:110)\n\n(cid:111)\n\n1\n\n1,\n\nExperiments. We compared the Cholesky-CMA-ES with other CMA-ES variants.3 The reference\nCMA-ES implementation uses a delay strategy in which the matrix square root is computed every\niterations [Hansen, 2015], which equals one for the dimensions considered\nmax\n2Actually, the complexity is related to the complexity of multiplying two \u00b5 \u00d7 d matrices. We assume a na\u00efve\nimplementation of matrix multiplication. With a faster multiplication algorithm, the complexity can be reduced\naccordingly.\n\n10d(c1+c\u00b5)\n\n3We added our algorithm to the open-source machine learning library Shark [Igel et al., 2008] and used\n\nLAPACK for high ef\ufb01ciency.\n\n5\n\n\f104\n\ns\nn\no\ni\nt\na\nr\ne\nt\nI\n\n103\n\n102\n\n4\n\n104\n\ns\nn\no\ni\nt\na\nr\ne\nt\nI\n\n103\n\n102\n\n4\n\n104\n\n103\n\n102\n\n4\n\n256\n\n104\n\n103\n\n104\n\n103\n\n102\n\n4\n\n256\n\n104\n\n103\n\n32\n\n(b) Cigar\n\n32\n\n(a) Sphere\n\n32\n\n256\n\n(c) Discus\n\n102\n\n4\n\n256\n\n32\nd\n\n102\n\n4\n\n256\n\n32\nd\n\n32\nd\n\n256\n\n(d) Ellipsoid\n\n(e) Rosenbrock\n\n(f) DiffPowers\n\nFigure 1: Function evaluations required to reach f (x) < 10\u221214 over problem dimensionality\n(medians of 100 trials). The graphs for CMA-ES-Ref and Cholesky-CMA-ES overlap.\n\n103\n\ns\n/\ne\nm\n\ni\nt\n\n1\n\n10\u22123\n\n4\n\n103\n\ns\n/\ne\nm\n\ni\nt\n\n1\n\n10\u22123\n\n4\n\n103\n\n1\n\n10\u22123\n\n4\n\n256\n\n103\n\n1\n\n32\n\n(b) Cigar\n\n103\n\n1\n\n10\u22123\n\n4\n\n256\n\n103\n\n1\n\n32\n\n(a) Sphere\n\n32\n\n256\n\n(c) Discus\n\n32\nd\n\n(d) Ellipsoid\n\n10\u22123\n\n4\n\n256\n\n10\u22123\n\n4\n\n256\n\n32\nd\n\n32\nd\n\n256\n\n(e) Rosenbrock\n\n(f) DiffPowers\n\nFigure 2: Runtime in seconds over problem dimensionality. Shown are medians of 100 trials. Note\nthe logarithmic scaling on both axes.\n\n6\n\n Cholesky-CMA-ESSuttorp-CMA-ESCMA-ES/dCMA-ES-Ref Cholesky-CMA-ESSuttorp-CMA-ESCMA-ES/dCMA-ES-Ref \ff (x)\n(cid:107)x(cid:107)2\n\ni=0\n\nName\nSphere\nRosenbrock\nDiscus\nCigar\nEllipsoid\n\n(cid:0)100(xi+1 \u2212 x2\n(cid:80)d\u22121\n0 +(cid:80)d\n0 +(cid:80)d\ni=1 10\u22126x2\n(cid:80)d\nx2\ni\n10\u22126x2\nDifferent Powers (cid:80)d\ni=1 x2\ni\n\u22126i\nd\u22121 x2\ni=0 10\ni\ni=0 |xi| 2+10i\nd\u22121\n\ni )2 + (1 \u2212 xi)2(cid:1)\n\nTable 1: Benchmark functions used in the experiments (additionally, a rotation matrix B transforms\nthe variables, x (cid:55)\u2192 Bx)\n\n102\n\n10\u22126\n\n)\nt\n\nm\n\n(\nf\ng\no\nl\n\n10\u221214\n\n0\n\n102\n\n10\u22126\n\n)\nt\n\nm\n\n(\nf\ng\no\nl\n\n10\u221214\n\n0\n\n102\n\n10\u22126\n\n)\nt\n\nm\n\n(\nf\ng\no\nl\n\n10\u221214\n\n0\n\n10\n\n(a) Sphere\n\n200\n\n(c) Discus\n\n200\ntime/s\n(e) Ellipsoid\n\n102\n\n10\u22126\n\n10\u221214\n\n0\n\n20\n\n102\n\n10\u22126\n\n10\u221214\n\n0\n\n400\n\n102\n\n10\u22126\n\n10\u221214\n\n0\n\n400\n\n30\n\n(b) Cigar\n\n60\n\n200\n\n400\n\n(d) DiffPowers\n\n200\ntime/s\n\n(f) Rosenbrock\n\n400\n\nFigure 3: Function value evolution over time on the benchmark functions with d = 128. Shown are\nsingle runs, namely those with runtimes closest to the corresponding median runtimes.\n\n7\n\n Cholesky-CMA-ESSuttorp-CMA-ESCMA-ES/dCMA-ES-Ref \fin our experiments. We call this variant CMA-ES-Ref. As an alternative, we experimented with\ndelaying the update for d steps. We refer to this variant as CMA-ES/d. We also adapted the non-\ntriangular Cholesky factor approach by Suttorp et al. [2009] to the state-of-the art implementation of\nthe CMA-ES. We refer to the resulting algorithm as Suttorp-CMA-ES.\nWe considered standard benchmark functions for derivative-free optimization given in Table 1. Sphere\nis considered to show that on a spherical function the step size adaption does not behave differently;\nCigar/Discus/Ellipsoid model functions with different convex shapes near the optimum; Rosenbrock\ntests learning a function with d \u2212 1 bends, which lead to slowly converging covariance matrices in\nthe optimization process; Diffpowers is an example of a function with arbitrarily bad conditioning.\nTo test rotation invariance, we applied a rotation matrix to the variables, x (cid:55)\u2192 Bx, B \u2208 SO(d, R).\nThis is done for every benchmark function, and a rotation matrix was chosen randomly at the\nbeginning of each trial. All starting points were drawn uniformly from [0, 1], except for Sphere,\nwhere we sampled from N (0, I). For each function, we vary d \u2208 {4, 8, 16, . . . , 256}. Due to the long\nrunning times, we only compute CMA-ES-Ref up to d = 128. For the given range of dimensions,\nfor every choice of d, we ran 100 trials from different initial points and monitored the number of\niterations and the wall-clock time needed to sample a point with a function value below 10\u221214. For\nRosenbrock we excluded the trials in which the algorithm did not converge to the global optimum.\nWe further evaluated the algorithms on additional benchmark functions inspired by Stich and M\u00fcller\n[2012] and measured the change of rotation introduced by the Cholesky-CMA-ES at each iteration\n(Et), see supplementary material.\n\nResults. Figure 1 shows that CMA-ES-Ref and Cholesky-CMA-ES required the same amount\nof function evaluations to reach a given objective value. The CMA-ES/d required slightly more\nevaluations depending on the benchmark function. When considering the wall-clock runtime, the\nCholesky-CMA-ES was signi\ufb01cantly faster than the other algorithms. As expected from the theo-\nretical analysis, the higher the dimensionality the more pronounced the differences, see Figure 2\n(note logarithmic scales). For d = 64 the Cholesky-CMA-ES was already 20 times faster than the\nCMA-ES-Ref. The drastic differences in runtime become apparent when inspecting single trials.\nNote that for d = 256 the matrix size exceeded the L2 cache, which affected the performance of\nthe Cholesky-CMA-ES and Suttorp-CMA-ES. Figure 3 plots the trials with runtimes closest to the\ncorresponding median runtimes for d = 128.\n\n5 Conclusion\n\nCMA-ES is a ubiquitous algorithm for derivative-free optimization. The CMA-ES has proven to be a\nhighly ef\ufb01cient direct policy search algorithm and to be a useful tool for model selection in supervised\nlearning. We propose the Cholesky-CMA-ES, which can be regarded as an approximation of the\noriginal CMA-ES. We gave theoretical arguments for why our approximation, which only affects the\nglobal step-size adaptation, does not impair performance. The Cholesky-CMA-ES achieves a better,\nasymptotically optimal time complexity of O(\u00b5d2) for the covariance update and optimal memory\ncomplexity. It allows for numerically stable computation of the inverse of the Cholesky factor in\nquadratic time and provides the eigenvalues of the covariance matrix without additional costs. We\nempirically compared the Cholesky-CMA-ES to the state-of-the-art CMA-ES with delayed covariance\nmatrix decomposition. Our experiments demonstrated a signi\ufb01cant increase in optimizaton speed. As\nexpected, the Cholesky-CMA-ES needed the same amount of objective function evaluations as the\nstandard CMA-ES, but required much less wall-clock time \u2013 and this speed-up increases with the\nsearch space dimensionality. Still, our algorithm scales quadratically with the problem dimensionality.\nIf the dimensionality gets so large that maintaining a full covariance matrix becomes computationally\ninfeasible, one has to resort to low-dimensional approximations [e.g., Loshchilov, 2015], which,\nhowever, bear the risk of a signi\ufb01cant drop in optimization performance. Thus, we advocate our new\nCholesky-CMA-ES for scaling up CMA-ES to large optimization problems for which updating and\nstoring the covariance matrix is still possible, for example, for training neural networks in direct\npolicy search.\n\nAcknowledgement. We acknowledge support from the Innovation Fund Denmark through the\nprojects \u201cPersonalized breast cancer screening\u201d (OK, CI) and \u201cCyber Fraud Detection Using Ad-\nvanced Machine Learning Techniques\u201d (DRA, CI).\n\n8\n\n\fReferences\nY. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi. Theoretical foundation for CMA-ES from\n\ninformation geometry perspective. Algorithmica, 64(4):698\u2013716, 2012.\n\nY. Akimoto, A. Auger, and N. Hansen. Comparison-based natural gradient optimization in high\ndimension. In Proceedings of the 16th Annual Genetic and Evolutionary Computation Conference\n(GECCO), pages 373\u2013380. ACM, 2014.\n\nA. Auger. Analysis of Comparison-based Stochastic Continous Black-Box Optimization Algorithms.\n\nHabilitation thesis, Facult\u00e9 des Sciences d\u2019Orsay, Universit\u00e9 Paris-Sud, 2015.\n\nH.-G. Beyer. Evolution strategies. Scholarpedia, 2(8):1965, 2007.\nH.-G. Beyer. Convergence analysis of evolutionary algorithms that are based on the paradigm of\n\ninformation geometry. Evolutionary Computation, 22(4):679\u2013709, 2014.\n\nK. Bringmann, T. Friedrich, C. Igel, and T. Vo\u00df. Speeding up many-objective optimization by Monte\n\nCarlo approximations. Arti\ufb01cial Intelligence, 204:22\u201329, 2013.\n\nA. E. Eiben and Jim Smith. From evolutionary computation to the evolution of things. Nature,\n\n521:476\u2013482, 2015.\n\nF. Gomez, J. Schmidhuber, and R. Miikkulainen. Accelerated neural evolution through cooperatively\n\ncoevolved synapses. Journal of Machine Learning Research, 9:937\u2013965, 2008.\n\nN. Hansen and A. Ostermeier. Adapting arbitrary normal mutation distributions in evolution strate-\ngies: The covariance matrix adaptation. In Proceedings of IEEE International Conference on\nEvolutionary Computation (CEC 1996), pages 312\u2013317. IEEE, 1996.\n\nN. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies.\n\nEvolutionary Computation, 9(2):159\u2013195, 2001.\n\nN. Hansen. The CMA evolution strategy: A tutorial. Technical report, Inria Saclay \u2013 \u00cele-de-France,\n\nUniversit\u00e9 Paris-Sud, LRI, 2015.\n\nV. Heidrich-Meisner and C. Igel. Hoeffding and Bernstein races for selecting policies in evolutionary\ndirect policy search. In Proceedings of the 26th International Conference on Machine Learning\n(ICML 2009), pages 401\u2013408, 2009.\n\nV. Heidrich-Meisner and C. Igel. Neuroevolution strategies for episodic reinforcement learning.\n\nJournal of Algorithms, 64(4):152\u2013168, 2009.\n\nC. Igel, T. Glasmachers, and V. Heidrich-Meisner. Shark. Journal of Machine Learning Research,\n\n9:993\u2013996, 2008.\n\nC. Igel. Evolutionary kernel learning. In Encyclopedia of Machine Learning. Springer-Verlag, 2010.\nO. Krause and C. Igel. A more ef\ufb01cient rank-one covariance matrix update for evolution strategies.\nIn Proceedings of the 2015 ACM Conference on Foundations of Genetic Algorithms (FOGA XIII),\npages 129\u2013136. ACM, 2015.\n\nI. Loshchilov. A computationally ef\ufb01cient limited memory CMA-ES for large scale optimization. In\nProceedings of the 16th Annual Genetic and Evolutionary Computation Conference (GECCO),\npages 397\u2013404. ACM, 2014.\n\nI. Loshchilov. LM-CMA: An alternative to L-BFGS for large scale black-box optimization. Evolu-\n\ntionary Computation, 2015.\n\nM. N. Omidvar and X. Li. A comparative study of CMA-ES on large scale global optimisation. In AI\n2010: Advances in Arti\ufb01cial Intelligence, volume 6464 of LNAI, pages 303\u2013312. Springer, 2011.\nJ. Poland and A. Zell. Main vector adaptation: A CMA variant with linear time and space complexity.\nIn Proceedings of the 10th Annual Genetic and Evolutionary Computation Conference (GECCO),\npages 1050\u20131055. Morgan Kaufmann Publishers, 2001.\n\nR. Ros and N. Hansen. A simple modi\ufb01cation in CMA-ES achieving linear time and space complexity.\n\nIn Parallel Problem Solving from Nature (PPSN X), pages 296\u2013305. Springer, 2008.\n\nS. U. Stich and C. L. M\u00fcller. On spectral invariance of randomized Hessian and covariance matrix\nadaptation schemes. In Parallel Problem Solving from Nature (PPSN XII), pages 448\u2013457. Springer,\n2012.\n\nY. Sun, T. Schaul, F. Gomez, and J. Schmidhuber. A linear time natural evolution strategy for\nnon-separable functions. In 15th Annual Conference on Genetic and Evolutionary Computation\nConference Companion, pages 61\u201362. ACM, 2013.\n\nT. Suttorp, N. Hansen, and C. Igel. Ef\ufb01cient covariance matrix update for variable metric evolution\n\nstrategies. Machine Learning, 75(2):167\u2013197, 2009.\n\n9\n\n\f", "award": [], "sourceid": 234, "authors": [{"given_name": "Oswin", "family_name": "Krause", "institution": "University of Copenhagen"}, {"given_name": "D\u00eddac Rodr\u00edguez", "family_name": "Arbon\u00e8s", "institution": "University of Copenhagen"}, {"given_name": "Christian", "family_name": "Igel", "institution": "University of Copenhagen"}]}