{"title": "Learning to Optimize in Swarms", "book": "Advances in Neural Information Processing Systems", "page_first": 15044, "page_last": 15054, "abstract": "Learning to optimize has emerged as a powerful framework for various optimization and machine learning tasks. Current such \"meta-optimizers\" often learn in the space of continuous optimization algorithms that are point-based and uncertainty-unaware. To overcome the limitations, we propose a meta-optimizer that learns in the algorithmic space of both point-based and population-based optimization algorithms. The meta-optimizer targets at a meta-loss function consisting of both cumulative regret and entropy. Specifically, we learn and interpret the update formula through a population of LSTMs embedded with sample- and feature-level attentions. Meanwhile, we estimate the posterior directly over the global optimum and use an uncertainty measure to help guide the learning process. Empirical results over non-convex test functions and the protein-docking application demonstrate that this new meta-optimizer outperforms existing competitors. The codes are publicly available at: https://github.com/Shen-Lab/LOIS", "full_text": "Learning to Optimize in Swarms\n\nYue Cao, Tianlong Chen, Zhangyang Wang, Yang Shen\n\nDepartments of Electrical and Computer Engineering & Computer Science and Engineering\n\nTexas A&M University, College Station, TX 77840\n\n{cyppsp,wiwjp619,atlaswang,yshen}@tamu.edu\n\nAbstract\n\nLearning to optimize has emerged as a powerful framework for various optimization\nand machine learning tasks. Current such \u201cmeta-optimizers\u201d often learn in the\nspace of continuous optimization algorithms that are point-based and uncertainty-\nunaware. To overcome the limitations, we propose a meta-optimizer that learns\nin the algorithmic space of both point-based and population-based optimization\nalgorithms. The meta-optimizer targets at a meta-loss function consisting of both\ncumulative regret and entropy. Speci\ufb01cally, we learn and interpret the update\nformula through a population of LSTMs embedded with sample- and feature-level\nattentions. Meanwhile, we estimate the posterior directly over the global optimum\nand use an uncertainty measure to help guide the learning process. Empirical results\nover non-convex test functions and the protein-docking application demonstrate\nthat this new meta-optimizer outperforms existing competitors. The codes are\npublicly available at: https://github.com/Shen-Lab/LOIS.\n\n1\n\nIntroduction\n\nOptimization provides a mathematical foundation for solving quantitative problems in many \ufb01elds,\nalong with numerical challenges. The no free lunch theorem indicates the non-existence of a univer-\nsally best optimization algorithm for all objectives. To manually design an effective optimization\nalgorithm for a given problem, many efforts have been spent on tuning and validating pipelines,\narchitectures, and hyperparameters. For instance, in deep learning, there is a gallery of gradient-based\nalgorithms speci\ufb01c to high-dimensional, non-convex objective functions, such as Stochastic Gradient\nDescent [1], RmsDrop [2], and Adam [3]. Another example is in ab initio protein docking whose\nenergy functions as objectives have extremely rugged landscapes and are expensive to evaluate.\nGradient-free algorithms are thus popular there, including Markov chain Monte Carlo (MCMC) [4]\nand Particle Swarm Optimization (PSO) [5].\nTo overcome the laborious manual design, an emerging approach of meta-learning (learning to learn)\ntakes advantage of the knowledge learned from related tasks. In meta-learning, the goal is to learn\na meta-learner that could solve a set of problems, where each sample in the training or test set is a\nparticular problem. As in classical machine learning, the fundamental assumption of meta-learning\nis the generalizability from solving the training problems to solving the test ones. For optimization\nproblems, a key to meta-learning is how to ef\ufb01ciently utilize the information in the objective function\nand explore the space of optimization algorithms.\nIn this study, we introduce a novel framework in meta-learning, where we train a meta-optimizer that\nlearns in the space of both point-based and population-based optimization algorithms for continuous\noptimization. To that end, we design a novel architecture where a population of RNNs (speci\ufb01cally,\nLSTMs) jointly learn iterative update formula for a population of samples (or a swarm of particles).\nTo balance exploration and exploitation in search, we directly estimate the posterior over the optimum\nand include in the meta-loss function the differential entropy of the posterior. Furthermore, we\nembed feature- and sample-level attentions in our meta-optimizer to interpret the learned optimization\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fstrategies. Our numerical experiments, including global optimization for nonconvex test functions\nand an application of protein docking, endorse the superiority of the proposed meta-optimizer.\n\n2 Related work\n\nMeta-learning originated from the \ufb01eld of psychology [6, 7]. [8, 9, 10] optimized a learning rule\nin a parameterized learning rule space. [11] used RNN to automatically design a neural network\narchitecture. More recently, learning to learn has also been applied to sparse coding [12, 13, 14, 15],\nplug-and-play optimization [16], and so on.\nIn the \ufb01eld of learning to optimize, [17] proposed the \ufb01rst framework where gradients and function\nvalues were used as the features for RNN. A coordinate-wise structure of RNN relieved the burden\nfrom the enormous number of parameters, so that the same update formula was used independently\nfor each coordinate. [18] used the history of gradients and objective values as states and step vectors\nas actions in reinforcement learning. [19] also used RNN to train a meta-learner to optimize black-\nbox functions, including Gaussian process bandits, simple control objectives, and hyper-parameter\ntuning tasks. Lately, [20] introduced a hierarchical RNN architecture, augmented with additional\narchitectural features that mirror the known structure of optimization tasks.\nThe target applications of previous methods are mainly focused on training deep neural networks,\nexcept [19] focusing on optimizing black-box functions. There are three limitations of these methods.\nFirst, they learn in a limited algorithmic space, namely point-based optimization algorithms that use\ngradients or not (including SGD and Adam). So far there is no method in learning to learn that re\ufb02ects\npopulation-based algorithms (such as evolutionary and swarm algorithms) proven powerful in many\noptimization tasks. Second, their learning is guided by a limited meta loss, often the cumulative regret\nin sampling history that primarily drives exploitation. One exception is the expected improvement\n(EI) used by [19] under Gaussian processes. Last but not the least, these methods do not interpret the\nprocess of learning update formula, despite the previous usage of attention mechanisms in [20].\nTo overcome aforementioned limitations of current learning-to-optimize methods, we present a new\nmeta-optimizer with the following contributions:\n\noptimization algorithms;\n\n\u2022 (Where to learn): We learn in an extended space of both point-based and population-based\n\u2022 (How to learn): We incorporate the posterior into meta-loss to guide the search in the\n\u2022 (What more to learn): We design a novel architecture where a population of LSTMs jointly\nlearn iterative update formula for a population of samples and embedded sample- and\nfeature-level attentions to explain the formula.\n\nalgorithmic space and balance the exploitation-exploration trade-off.\n\n3 Method\n\n3.1 Notations and background\n\nWe use the following convention for notations throughout the paper. Scalars, vectors (column vectors\nunless stated otherwise), and matrices are denoted in lowercase, bold lowercase, and bold uppercase,\nrespectively. Superscript (cid:48) indicates vector transpose.\nOur goal is to solve the following optimization problem:\n\nx\u2217 = arg min\nx\u2208Rn\n\nf (x).\n\nIterative optimization algorithms, either point-based or population-based, have the same generic\nupdate formula:\n\nxt+1 = xt + \u03b4xt,\n\nwhere xt and \u03b4xt are the sample (or a single sample called \u201cparticle\" in swarm algorithms) and the\nupdate (a.k.a. step vector) at iteration t, respectively. The update is often a function g(\u00b7) of historic\nsample values, objective values, and gradients. For instance, in point-based gradient descent,\n\n\u03b4xt = g({x\u03c4 , f (x\u03c4 ),\u2207f (x\u03c4 )}t\n\n\u03c4 =1) = \u2212\u03b1\u2207f (xt),\n\n2\n\n\fj )}k\n\nj , f (x\u03c4\n\nj=1}t\n\nj \u2212 xt\u2217\n\nj ) + r2(xt\n\nj + r1(xt\n\nj \u2212 xt\u2217),\n\nj ),\u2207f (x\u03c4\n\nwhere \u03b1 is the learning rate. In particle swarm optimization (PSO), assuming that there are k samples\n(particles), then for particle j, the update is determined by the entire population:\n\n\u03c4 =1) = w\u03b4xt\u22121\n\nj = g({{x\u03c4\n\u03b4xt\nj and xt\u2217 are the best position (with the smallest objective value) of particle j and among\nwhere xt\u2217\nall particles, respectively, during the \ufb01rst t iterations; and w, r1, r2 are the hyper-parameters often\nrandomly sampled from a \ufb01xed distribution (e.g. standard Gaussian distribution) during each iteration.\nIn most of the modern optimization algorithms, the update formula g(\u00b7) is analytically determined\nand \ufb01xed during the whole process. Unfortunately, similar to what the No Free Lunch Theorem\nsuggests in machine learning, there is no single best algorithm for all kinds of optimization tasks.\nEvery state-of-art algorithm has its own best-performing problem set or domain. Therefore, it\nmakes sense to learn the optimal update formula g(\u00b7) from the data in the speci\ufb01c problem domain,\nwhich is called \u201clearning to optimize\u201d. For instance, in [17], the function g(\u00b7) is parameterized by\na recurrent neural network (RNN) with input \u2207f (xt) and the hidden state from the last iteration:\ng(\u00b7) = RNN(\u2207f (xt), ht\u22121). In [19], the inputs of RNN are xt, f (xt) and the hidden state from the\nlast iteration: g(\u00b7) = RNN(xt, f (xt), ht\u22121).\n\n3.2 Population-based learning to optimize with posterior estimation\n\nj = (st\n\n(cid:0)\u03b1inter\n\nWe describe the details of our population-based meta-optimizer in this section. Compared to previous\nmeta-optimizers, we employ k samples whose update formulae are learned from the population\nhistory and are individually customized, using attention mechanisms. Speci\ufb01cally, our update rule for\nparticle i could be written as:\n\nj=1), ht\u22121\n\nj4) is a n \u00d7 4 feature matrix for particle j at iteration t, \u03b1intra\n\ngi(\u00b7) = RNNi\nj3, st\nis the hidden state of the ith LSTM at iteration t \u2212 1.\n\nwhere St\nj2, st\nintra-particle attention function for particle j, and \u03b1inter\nattention function. ht\u22121\nFor typical population-based algorithms, the same update formula is adopted by all particles. We\nfollow the convention to set g1(\u00b7) = g2(\u00b7) = ... = gk(\u00b7), which suggests RNNi = RNN and\n\u03b1intra\nWe will \ufb01rst introduce the feature matrix S\u03c4\n\n(\u00b7) is the\n(\u00b7) is the i-th output of the inter-particle\n\nj and then describe the intra- and inter- attention modules.\n\n(\u00b7) = \u03b1intra(\u00b7).\n\n({S\u03c4\nj }t\n\n\u03c4 =1)}k\n\n({\u03b1intra\n\nj1, st\n\n(cid:1)\n\ni\n\nj\n\nj\n\ni\n\nj\n\ni\n\ni\n\n3.2.1 Features from different types of algorithms\n\nConsidering the expressiveness and the searchability of the algorithmic space, we consider the update\nformulae of both point- and population-based algorithms and choose the following four features for\nparticle i at iteration t:\n\ni =(cid:80)t\n\u2022 gradient: \u2207f (xt\ni)\n\u2022 momentum: mt\n\u2022 velocity: vt\ni \u2212 xt\u2217\n(cid:80)\ni = xt\n(cid:80)\ni\nj(exp(\u2212\u03b1d2\ni\u2212xt\n\u2022 attraction:\nij )(xt\nj exp(\u2212\u03b1d2\nij )\ndij = ||xt\ni \u2212 xt\nj||2.\n\nj ))\n\n\u03c4 =1(1 \u2212 \u03b2)\u03b2t\u22121\u2207f (x\u03c4\ni )\n\n, for all j that f (xt\n\nj) < f (xt\n\ni). \u03b1 is a hyperparameter and\n\nThese four features include two from point-based algorithms using gradients and the other two from\npopulation-based algorithms. Speci\ufb01cally, the \ufb01rst two are used in gradient descent and Adam. The\nthird feature, velocity, comes from PSO, where xt\u2217\nis the best position (with the lowest objective\nvalue) of particle i in the \ufb01rst t iterations. The last feature, attraction, is from the Fire\ufb02y algorithm [21].\nThe attraction toward particle i is the weighted average of xt\ni);\nj) < f (xt\nand the weight of particle j is the Gaussian similarity between particle i and j. For the particle of the\nsmallest f (xt\ni), we simply set this feature vector to be zero. In this paper, we use \u03b2 = 0.9 and \u03b1 = 1.\nIt is noteworthy that each feature vector is of dimension n\u00d7 1, where n is the dimension of the search\nspace. Besides, the update formula in each base-algorithm is linear w.r.t. its corresponding feature.\nTo learn a better update formula, we will incorporate those features into our model of deep neural\nnetworks, which is described next.\n\nj over all j such that f (xt\n\ni\u2212xt\n\ni\n\n3\n\n\fFigure 1: (a) The architecture of our meta-optimizer for one step. We have k particles here. For each\nparticle, we have gradient, momentum, velocity and attraction as features. Features for each particle\nwill be sent into an intra-particle (feature-level) attention module, together with the hidden state of the\nprevious LSTM. The outputs of k intra-particle attention modules, together with a kernelized pairwise\nsimilarity matrix Qt (yellow box in the \ufb01gure), will be the input of an inter-particle (sample-level)\nattention module. The role of inter-particle attention module is to capture the cooperativeness of all\nparticles in order to reweight features and send them into k LSTMs. The LSTM\u2019s outputs, \u03b4x, will be\nused for generating new samples. (b) The architectures of intra- and inter-particle attention modules.\n\n3.2.2 Overall model architecture\n\nFig. 1a depicts the overall architecture of our proposed model. We use a population of LSTMs and\ndesign two attention modules here: feature-level (\u201cintra-particle\u201d) and sample-level (\u201cinter-particle\u201d)\nattentions. For particle i at iteration t, the intra-particle attention module is to reweight each feature\nbased on the context vector ht\u22121\n, which is the hidden state from the i-th LSTM in the last iteration.\nThe reweight features of all particles are fed into an inter-particle attention module, together with a\nk \u00d7 k distance similarity matrix. The inter-attention module is to learn the information from the rest\nk \u2212 1 particles and affect the update of particle i. The outputs of inter-particle attention module will\nbe sent into k identical LSTMs for individual updates.\n\ni\n\n3.2.3 Attention mechanisms\n\nFor the intra-particle attention module, we use the idea from [22, 23, 24]. As shown in Fig. 1b, given\nthat the jth input feature of the ith particle at iteration t is st\n\nij, we have:\n\nbt\nij = vT\n\na tanh (Wast\n\nij + Uaht\n\nij),\n\nexp(bt\nij)\nr=1 exp(bt\nir)\nwhere va \u2208 Rn, Wa \u2208 Rn\u00d7n and Ua \u2208 Rn\u00d7n are the weight matrices, ht\u22121\nstate from the ith LSTM in iteration t \u2212 1, bt\nis the output after the softmax layer. We then use pt\n\n\u2208 Rn is the hidden\nij is the output of the fully-connected (FC) layer and pt\nij\n\nij to reweight our input features:\n\npt\nij =\n\n,\n\ni\n\n(cid:80)4\n\n4(cid:88)\n\nct\ni =\n\npt\nirst\n\nir,\n\ni \u2208 Rn is the output of the intra-particle attention module for the ith particle at iteration t.\ni for each particle i under the impacts of the rest k \u2212 1\n\nwhere ct\nFor inter-particle attention, we model \u03b4xt\nparticles. Speci\ufb01c considerations are as follows.\n\nr=1\n\n4\n\n...Particle 1Intra-particleAttentionParticle 2Particle kLSTMLSTMLSTM......Inter-particle\u00a0Attention(a)TransposesoftmaxInter-particle attentionMatrix MultiplicationPoint-wise SummationIntra-particle attentionsoftmaxFCPoint-wise Summation(b)\f\u2022 The closer two particles are, the more they impact each other\u2019s update. Therefore, we\nconstruct a kernelized pairwise similarity matrix Qt \u2208 Rk\u00d7k (column-normalized) as the\nweight matrix. Its element is qt\n\n.\n\nij = exp(\u2212 ||xt\ni\u2212xt\n(cid:80)k\nr=1 exp(\u2212 ||xt\n\n2l\n\nj||2\n)\nr\u2212xt\n||2\n\nj\n\n2l\n\n)\n\n\u2022 The similar two particles are in their intra-particle attention outputs (ct\n\ni, local suggestions\nfor updates), the more they impact each other\u2019s update. Therefore, we introduce another\nweight matrix M t \u2208 Rk\u00d7k whose element is mij =\n(normalized after\ncolumn-wise softmax).\n\ni)(cid:48)ct\nj)\nexp((ct\nr)(cid:48)ct\nj)\nr=1 exp((ct\n\n(cid:80)k\n\nAs shown in Fig. 1b, the output of the inter-particle module for the jth particle will be:\n\nk(cid:88)\n\net\nj = \u03b3\n\nmt\n\nrjqt\n\nrjct\n\nr + ct\nj,\n\nwhere \u03b3 is a hyperparameter which controls the ratio of contribution of rest k-1 particles to the jth\nparticle. In this paper, \u03b3 is set to be 1 without further optimization.\n\nr=1\n\n3.2.4 Loss function, posterior estimation, and model training\n\nCumulative regret is a common meta loss function: L(\u03c6) =(cid:80)T\n\n(cid:80)k\n\nt=1\n\nj=1 f (xt\n\nj). However, this loss\nfunction has two main drawbacks. First, the loss function does not re\ufb02ect any exploration. If the\nsearch algorithm used for training the optimizer does not employ exploration, it can be easily trapped\nin the vicinity of a local minimum. Second, for population-based methods, this loss function tends to\ndrag all the particles to quickly converge to the same point.\nTo balance the exploration-exploitation tradeoff, we bring the work from [25] \u2014 it built a Bayesian\nt=1 Dt), where Dt denotes the samples\n. We claim that, in order to reduce the uncertainty about the\nwhereabouts of the global minimum, the best next sample can be chosen to minimize the entropy of\nthe posterior, h\n\nposterior distribution over the global optimum x\u2217 as p(x\u2217|(cid:83)T\nat iteration t: Dt =(cid:8)(cid:0)xt\np(x\u2217|(cid:83)T\n\n. Therefore, we propose a loss function for function f (\u00b7) as:\n\nj, f (xt\n\n(cid:16)\n\nj=1\n\nt=1 Dt)\n\nj)(cid:1)(cid:9)k\n(cid:17)\nT(cid:88)\n\nk(cid:88)\n\n(cid:32)\np(x\u2217| T(cid:91)\n\n(cid:33)\n\nDt)\n\n,\n\n(cid:96)f (\u03c6) =\n\nf (xt\n\nj) + \u03bbh\n\nt=1\n\nj=1\n\nt=1\n\nwhere \u03bb controls the balance between exploration and exploitation and \u03c6 is a vector of model\nparameters.\nFollowing [25], the posterior over the global optimum is modeled as a Boltzmann distribution:\n\np(x\u2217| T(cid:91)\n\nDt) \u221d exp(\u2212\u03c1 \u02c6f (x)),\n\nt=1\n\nwhere \u02c6f (x) is a function estimator and \u03c1 is the annealing constant. In the original work of [25], both\n\u02c6f (x) and \u03c1 are updated over iteration t for active sampling. In our work, they are \ufb01xed since the\ncomplete training sample paths are available at once.\n\nSpeci\ufb01cally, for a function estimator based on samples in(cid:83)T\n\nt=1 Dt, we use a Kriging regressor [26]\n\nwhich is known to be the best unbiased linear estimator (BLUE):\n\n(cid:48)\n\u02c6f (x) = f0(x) + (\u03ba(x))\n\n(K + \u00012I)\u22121(y \u2212 f0),\n\nwhere f0(x) is the prior for E[f (x)] (we use f0(x) = 0 in this study); \u03ba(x) is the kernel vector with\nthe ith element being the kernel, a measure of similarity, between x and xi; K is the kernel matrix\nwith the (i, j)-th element being the kernel between xi and xj; y and f0 are the vector consisting of\ny1, . . . , ynt and f0(x1), . . . , f0(xnt), respectively; and \u0001 re\ufb02ects the noise in the observation and is\noften estimated to be the average training error (set at 2.1 in this study).\n\n5\n\n\fFor \u03c1, we follow the annealing schedule in [25] with one-step update:\n\n\uf8eb\uf8ed(h0)\u22121\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T(cid:91)\n\nt=1\n\nn\uf8f6\uf8f8 ,\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\nDt\n\n\u03c1 = \u03c10 \u00b7 exp\n\nwhere \u03c10 is the initial parameter of \u03c1 (\u03c10 = 1 without further optimization here); h0 is the initial\nentropy of the posterior with \u03c1 = \u03c10; and n is the dimensionality of the search space.\nIn total, our meta loss for m functions fq(\u00b7) (q = 1, . . . , m) (analogous to m training examples) with\nL2 regularization is\n\nm(cid:88)\n\nq=1\n\nL(\u03c6) =\n\n1\nm\n\n(cid:96)fq (\u03c6) + C||\u03c6||2\n2.\n\nTo train our model we use the optimizer Adam which requires gradients. The \ufb01rst-order gradients\nare calculated numerically through TensorFlow following [17]. We use coordinate-wise LSTM to\nreduce the number of parameters. In our implementation the length of LSTM is set to be 20. For all\nexperiments, the optimizer is trained for 10,000 epochs with 100 iterations in each epoch.\n\n4 Experiments\n\nWe test our meta-optimizer through convex quadratic functions, non-convex test functions and an\noptimization-based application with extremely noisy and rugged landscapes: protein docking.\n\n4.1 Learn to optimize convex quadratic functions\n\nIn this case, we are trying to minimize a convex quadratic function:\n\nf (x) = ||Aqx \u2212 bq||2\n2,\n\nwhere Aq \u2208 Rn\u00d7n and bq \u2208 Rn\u00d71 are parameters, whose elements are sampled from i.i.d. normal\ndistributions for the training set. We compare our algorithm with SGD, Adam, PSO and DeepMind\u2019s\nLSTM (DM_LSTM) [17]. Since different algorithms have different population sizes, for fair compar-\nison we \ufb01x the total number of objective function evaluations (sample updates) to be 1,000 for all\nmethods. The population size k of our meta-optimizer and PSO is set to be 4, 10 and 10 in the 2D,\n10D and 20D cases, respectively. During the testing stage, we sample another 128 pairs of Aq and bq\nand evaluate the current best function value at each step averaged over 128 functions. We repeat the\nprocedure 100 times in order to obtain statistically signi\ufb01cant results.\nAs seen in Fig. 2, our meta-optimizer performs better than DM_LSTM in the 2D, 10D, and 20D\ncases. Both meta-optimizers perform signi\ufb01cantly better than the three baseline algorithms (except\nthat PSO had similar convergence in 2D).\n\n(a) 2D\n\n(b) 10D\n\n(c) 20D\n\nFigure 2: The performance of different algorithms for quadratic functions in (a) 2D, (b) 10D, and (c)\n20D. The mean and the standard deviation over 100 runs are evaluated every 50 function evaluations.\n\nWe also compare our meta-optimizer\u2019s performances with and without the guiding posterior in\nmeta loss. As shown in the supplemental Fig. S1, including the posterior improves optimization\nperformances especially in higher dimensions. Meanwhile, posterior estimation in higher dimensions\npresents more challenges. The impact of posteriors will be further assessed in ablation study.\n\n6\n\n0200400600800Number of function evaluations0.00.51.01.52.02.5minxf(x)SGDADAMPSODM_LSTMOURS0200400600800Number of function evaluations0246810minxf(x)SGDADAMPSODM_LSTMOURS0200400600800Number of function evaluations10121416182022minxf(x)SGDADAMPSODM_LSTMOURS\f4.2 Learn to optimize non-convex Rastrigin functions\n\nWe then test the performance on a non-convex test function called Rastrigin function:\n\nn(cid:88)\n\ni \u2212 n(cid:88)\n\nx2\n\nf (x) =\n\n\u03b1 cos (2\u03c0xi) + \u03b1n,\n\nwhere \u03b1 = 10. We consider a broad family of similar functions fq(x) as the training set:\n\ni=1\n\ni=1\n\nfq(x) = ||Aqx \u2212 bq||2\n\n2 \u2212 \u03b1cq cos(2\u03c0x) + \u03b1n,\n\n(1)\nwhere Aq \u2208 Rn\u00d7n, bq \u2208 Rn\u00d71 and cq \u2208 Rn\u00d71 are parameters whose elements are sampled\nfrom i.i.d. normal distributions. It is obvious that Rastrigin is a special case in this family with\nA = I, b = {0, 0, . . . , 0}(cid:48), c = {1, 1, . . . , 1}(cid:48).\nDuring the testing stage, 100 i.i.d. trajectories are generated in order to reach statistically signi\ufb01cant\nconclusions. The population size k of our meta-optimizer and PSO is set to be 4, 10 and 10 for 2D,\n10D and 20D, respectively. The results are shown in Fig. 3. In the 2D case, our meta-optimizer and\nPSO perform fairly the same while DM_LSTM performs much worse. In the 10D and 20D cases, our\nmeta-optimizer outperforms all other algorithms. It is interesting that PSO is the second best among\nall algorithms, which indicates that population-based algorithms have unique advantages here.\n\n(a) 2D\n\n(b) 10D\n\n(c) 20D\n\nFigure 3: The performance of different algorithms for a Rastrigin function in (a) 2D, (b) 10D, and (c)\n20D. The mean and the standard deviation over 100 runs are evaluated every 50 function evaluations.\n\n4.3 Transferability: Learning to optimize non-convex functions from convex optimization\n\nWe also examine the transferability from convex to non-convex optimization. The hyperparameter \u03b1\nin Rastrigin family controls the level of ruggedness for training functions: \u03b1 = 0 corresponds to a\nconvex quadratic function and \u03b1 = 10 does the rugged Rastrigin function. Therefore, we choose three\ndifferent values of \u03b1 (0, 5 and 10) to build training sets and test the resulting three trained models\non the 10D Rastrigin function. From the results in the supplemental Fig. S2, our meta-optimizer\u2019s\nperformances improve when it is trained with increasing \u03b1. The meta-optimizer trained with \u03b1 = 0\nhad limited progress over iterations, which indicates the dif\ufb01culty to learn from convex functions to\noptimize non-convex rugged functions. The one trained with \u03b1 = 5 has seen signi\ufb01cant improvement.\n\n4.4\n\nInterpretation of learned update formula\n\nIn an effort to rationalize the learned update formula, we choose the 2D Rastrigin test function to\nillustrate the interpretation analysis. We plot sample paths of our algorithm, PSO and Gradient\nDescent (GD) in Fig 4a. Our algorithm \ufb01nally reaches the funnel (or valley) containing the global\noptimum (x = 0), while PSO \ufb01nally reaches a suboptimal funnel. At the beginning, samples of our\nmeta-optimizer are more diverse due to the entropy control in the loss function. In contrast, GD is\nstuck in a local minimum which is close to its starting point after 80 samples.\nTo further show which factor contributes the most to each update, we plot the feature weight\ndistribution over the \ufb01rst 20 iterations. Since for particle i at iteration t, the output of its intra-\nattention module is a weighted sum of its 4 features: ct\nir for\nthe r-th feature over all particles i. The \ufb01nal weight distribution (normalized) over 4 features\nre\ufb02ecting the contribution of each feature at iteration t is shown in Fig. 4b. In the \ufb01rst 6 iterations,\n\ni = (cid:80)4\n\nir, we hereby sum pt\n\nr=1 pt\n\nirst\n\n7\n\n0200400600800Number of function evaluations010203040minxf(x)SGDAdamPSODM_L2Lours0200400600800Number of function evaluations0255075100125150175200minxf(x)SGDAdamPSODM_L2Lours0200400600800Number of function evaluations50100150200250300350400minxf(x)SGDAdamPSODM_L2Lours\fFigure 4: (a) Paths of the \ufb01rst 80 samples of our meta-optimizer, PSO and GD for 2D Rastrigin\nfunctions. Darker shades indicate newer samples. (b) The feature attention distribution over the \ufb01rst\n20 iterations for our meta-optimizer. (c) The percentage of the trace of \u03b3Qt (cid:12) M t + I (re\ufb02ecting\nself-impact on updates) over iteration t.\n\nthe population-based features contribute to the update most. Point-based features start to play an\nimportant role later.\nFinally, we examine in the inter-particle attention module the level of particles working collaboratively\nor independently. In order to show this, we plot the percentage of the diagonal part of \u03b3Qt (cid:12) M t + I:\n(cid:80) \u03b3Qt(cid:12)M t+I ((cid:12) denotes element-wise product), as shown in Fig. 4c. It can be seen that, at the\ntr(\u03b3Qt(cid:12)M t+I)\n\nbeginning, particles are working more collaboratively. With more iterations, particles become more\nindependent. However, we note that the trace (re\ufb02ecting self impacts) contributes 67%-69% over\niterations and the off-diagonals (impacts from other particles) do above 30%, which demonstrates the\nimportance of collaboration, a unique advantage of population-based algorithms.\n\n4.5 Ablation study\n\nHow and why our algorithm outperforms DM_LSTM is both interesting and important to unveil the\nunderlying mechanism of the algorithm. In order to deeply understand each part of our algorithms,\nwe performed an ablation study to progressively show each part\u2019s contribution. Starting from the\nDM_LSTM baseline (B0), we incrementally crafted four algorithms: running DM_LSTM for k times\nunder different initializations and choosing the best solution (B1); using k independent particles,\neach with the two point-based features, the intra-particle attention module, and the hidden state (B2);\nadding the two population-based features and the inter-particle attention module to B2 so as to convert\nk independent particles into a swarm (B3); and eventually, adding an entropy term in meta loss to B3,\nresulting in our Proposed model.\nWe tested the \ufb01ve algorithms (B0\u2013B3 and the Proposed) on 10D and 20D Rastrigin functions with\nthe same settings as in Section 4.2. We compare the function minimum values returned by these\nalgorithms in the table below (reported are means \u00b1 standard deviations over 100 runs, each using\n1,000 function evaluations).\n\nDimension\n\n10\n20\n\nB0\n\n55.4\u00b113.5\n140.4\u00b110.2\n\nB1\n\n48.4\u00b110.5\n137.4\u00b112.7\n\nB2\n\n40.1\u00b19.4\n108.4\u00b113.4\n\nB3\n\n20.4\u00b16.6\n48.5\u00b17.1\n\nProposed\n12.3\u00b15.4\n43.0 \u00b19.2\n\nOur key observations are as follows. i) B1 v.s. B0: their performance gap is marginal, which proves\nthat our performance gain is not simply due to having k independent runs; ii) B2 v.s. B1 and B3\nv.s. B2: Whereas including intra-particle attention in B2 already notably improves the performance\ncompared to B1, including population-based features and inter-particle attention in B3 results in\nthe largest performance boost. This con\ufb01rms that our algorithm majorly bene\ufb01ts from the attention\nmechanisms; iii) Proposed v.s. B3: adding entropy from the posterior gains further, thanks to its\nbalancing exploration and exploitation during search.\n\n8\n\n(a)(b)(c)\f4.6 Application to protein docking\n\nWe bring our meta-optimizer into a challenging real-world application. In computational biology,\nthe structural knowledge about how proteins interact each other is critical but remains relatively\nscarce [27]. Protein docking helps close such a gap by computationally predicting the 3D structures\nof protein-protein complexes given individual proteins\u2019 3D structures or 1D sequences [28]. Ab\ninitio protein docking represents a major challenge of optimizing a noisy and costly function in a\nhigh-dimensional conformational space [25].\nMathematically, the problem of ab initio protein docking can be formulated as optimizing an extremely\nrugged energy function: f (x) = \u2206G(x), the Gibbs binding free energy for conformation x. We\ncalculate the energy function in a CHARMM 19 force \ufb01eld as in [5] and shift it so that f (x) = 0 at\nthe origin of the search space. And we parameterize the search space as R12 as in [25]. The resulting\nf (x) is fully differentiable in the search space. For computational concern and batch training, we\nonly consider 100 interface atoms. We choose a training set of 25 protein-protein complexes from the\nprotein docking benchmark set 4.0 [29] (see Supp. Table S1 for the list), each of which has 5 starting\npoints (top-5 models from ZDOCK [30]). In total, our training set includes 125 instances. During\ntesting, we choose 3 complexes (with 1 starting model each) of different levels of docking dif\ufb01culty.\nFor comparison, we also use the training set from Eq. 1 (n = 12). All methods including PSO and\nboth versions of our meta-optimizer have k = 10 particles and 40 iterations in the testing stage.\nAs seen in Fig. 5, both meta-optimizers achieve lower-energy predictions than PSO and the perfor-\nmance gains increase as the docking dif\ufb01culty level increases. The meta-optimizer trained on other\nprotein-docking cases performs similarly as that trained on the Rastrigin family in the easy case and\noutperforms the latter in the dif\ufb01cult case.\n\n(a) 1AY7_2\n\n(b) 2HRK_1\n\n(c) 2C0L_1\n\nFigure 5: The performance of PSO, our meta-optimizer trained on Rastrigin function family and\nthat trained on real energy functions for three different levels of docking cases: (a) rigid (easy), (b)\nmedium, and (c) \ufb02exible (dif\ufb01cult).\n\n5 Conclusion\n\nDesigning a well-behaved optimization algorithm for a speci\ufb01c problem is a laborious task. In this\npaper, we extend point-based meta-optimizer into population-based meta-optimizer, where update\nformulae for a sample population are jointly learned in the space of both point- and population-\nbased algorithms. In order to balance exploitation and exploration, we introduce the entropy of the\nposterior over the global optimum into the meta loss, together with the cumulative regret, to guide\nthe search of the meta-optimizer. We further embed intra- and inter- particle attention modules to\ninterpret each update. We apply our meta-optimizer to quadratic functions, Rastrigin functions and a\nreal-world challenge \u2013 protein docking. The empirical results demonstrate that our meta-optimizer\noutperforms competing algorithms. Ablation study shows that the performance improvement is\ndirectly attributable to our algorithmic innovations, namely population-based features, intra- and\ninter-particle attentions, and posterior-guided meta loss.\n\nAcknowledgments\n\nThis work is in part supported by the National Institutes of Health (R35GM124952 to YS). Part of\nthe computing time is provided by the Texas A&M High Performance Research.\n\n9\n\n0200400600800Number of function e aluations\u2212300\u2212200\u22121000100200300400500minxf(x)PSOOur model trained on Rastrigin function familyOur model trained on real energy functions0200400600800Number of function evaluations\u2212200\u22121000100200300minxf(x)PSOOur model trained on Rastrigin function famil Our model trained on real energ functions0200400600800Number of function e aluations\u2212200\u22121000100200300400500minxf(x)PSOOur model trained on Rastrigin function familyOur model trained on real energy functions\fReferences\n[1] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n[2] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running\n\naverage of its recent magnitude. COURSERA: Neural networks for machine learning, 2012.\n\n[3] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[4] Jeffrey J. Gray, Stewart Moughon, Chu Wang, Ora Schueler-Furman, Brian Kuhlman, Carol A.\nRohl, and David Baker. Protein\u2013Protein Docking with Simultaneous Optimization of Rigid-body\nDisplacement and Side-chain Conformations. Journal of Molecular Biology, 2003.\n\n[5] Iain H. Moal and Paul A. Bates. SwarmDock and the Use of Normal Modes in Protein-Protein\nDocking. International Journal of Molecular Sciences, 11(10):3623\u20133648, September 2010.\n\n[6] Lewis B Ward. Reminiscence and rote learning. Psychological Monographs, 49(4):i, 1937.\n\n[7] Harry F Harlow. The formation of learning sets. Psychological review, 56(1):51, 1949.\n\n[8] Samy Bengio, Yoshua Bengio, and Jocelyn Cloutier. On the search for new learning rules for\n\nANNs. Neural Processing Letters, 2(4):26\u201330, 1995.\n\n[9] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. In IJCNN-91-Seattle\n\nInternational Joint Conference on Neural Networks, volume ii, pages 969 vol.2\u2013, July 1991.\n\n[10] Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of\na synaptic learning rule. In Preprints Conf. Optimality in Arti\ufb01cial and Biological Neural\nNetworks, pages 6\u20138. Univ. of Texas, 1992.\n\n[11] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv\n\npreprint arXiv:1611.01578, 2016.\n\n[12] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings\nof the 27th International Conference on International Conference on Machine Learning, pages\n399\u2013406. Omnipress, 2010.\n\n[13] Zhangyang Wang, Qing Ling, and Thomas S Huang. Learning deep 0 encoders. In Thirtieth\n\nAAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n[14] Xiaohan Chen, Jialin Liu, Zhangyang Wang, and Wotao Yin. Theoretical linear convergence\nof unfolded ista and its practical weights and thresholds. In Advances in Neural Information\nProcessing Systems, pages 9061\u20139071, 2018.\n\n[15] Jialin Liu, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Alista: Analytic weights are as\n\ngood as learned weights in lista. ICLR, 2019.\n\n[16] Ernest K Ryu, Jialin Liu, Sicheng Wang, Xiaohan Chen, Zhangyang Wang, and Wotao Yin.\n\nPlug-and-play methods provably converge with properly trained denoisers. ICML, 2019.\n\n[17] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom\nSchaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by\ngradient descent. In Advances in Neural Information Processing Systems, 2016.\n\n[18] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.\n\n[19] Yutian Chen, Matthew W Hoffman, Sergio G\u00f3mez Colmenarejo, Misha Denil, Timothy P\nLillicrap, Matt Botvinick, and Nando de Freitas. Learning to learn without gradient descent by\ngradient descent. In Proceedings of the 34th International Conference on Machine Learning-\nVolume 70, pages 748\u2013756. JMLR. org, 2017.\n\n10\n\n\f[20] Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo,\nMisha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and\ngeneralize. In Proceedings of the 34th International Conference on Machine Learning-Volume\n70, pages 3751\u20133760. JMLR. org, 2017.\n\n[21] Xin-She Yang. Fire\ufb02y algorithms for multimodal optimization. In Proceedings of the 5th\nInternational Conference on Stochastic Algorithms: Foundations and Applications, SAGA\u201909,\npages 169\u2013178, Berlin, Heidelberg, 2009. Springer-Verlag.\n\n[22] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[23] Nitin Bansal, Xiaohan Chen, and Zhangyang Wang. Can we gain more from orthogonality\nregularizations in training deep networks? In Advances in Neural Information Processing\nSystems, pages 4261\u20134271, 2018.\n\n[24] Tianlong Chen, Shaojin Ding, Jingyi Xie, Ye Yuan, Wuyang Chen, Yang Yang, Zhou Ren, and\n\nZhangyang Wang. Abd-net: Attentive but diverse person re-identi\ufb01cation. ICCV, 2019.\n\n[25] Yue Cao and Yang Shen. Bayesian active learning for optimization and uncertainty quanti\ufb01cation\n\nin protein docking. arXiv preprint arXiv:1902.00067, 2019.\n\n[26] Jean-Paul Chil\u00e8s and Pierre Del\ufb01ner. Geostatistics: Modeling Spatial Uncertainty, 2nd Edition.\n\n2012.\n\n[27] Roberto Mosca, Arnaud C\u00e9ol, and Patrick Aloy. Interactome3d: adding structural details to\n\nprotein networks. Nature methods, 10(1):47, 2013.\n\n[28] Graham R Smith and Michael JE Sternberg. Prediction of protein\u2013protein interactions by\n\ndocking methods. Current opinion in structural biology, 12(1):28\u201335, 2002.\n\n[29] Howook Hwang, Thom Vreven, Jo\u00ebl Janin, and Zhiping Weng. Protein-Protein Docking\n\nBenchmark Version 4.0. Proteins, 78(15):3111\u20133114, November 2010.\n\n[30] Brian G. Pierce, Kevin Wiehe, Howook Hwang, Bong-Hyun Kim, Thom Vreven, and Zhiping\ninteractive docking prediction of protein\u2013protein complexes and\n\nWeng. ZDOCK server:\nsymmetric multimers. Bioinformatics, 30(12):1771\u20131773, 02 2014.\n\n11\n\n\f", "award": [], "sourceid": 8583, "authors": [{"given_name": "Yue", "family_name": "Cao", "institution": "Texas A&M University"}, {"given_name": "Tianlong", "family_name": "Chen", "institution": "Texas A&M University"}, {"given_name": "Zhangyang", "family_name": "Wang", "institution": "TAMU"}, {"given_name": "Yang", "family_name": "Shen", "institution": "Texas A&M University"}]}