{"title": "Scaling of Probability-Based Optimization Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 399, "page_last": 406, "abstract": null, "full_text": "Scaling of Probability-Based Optimization \n\nAlgorithms \n\nDepartment of Computer Science University of Manchester \n\nJ. L. Shapiro \n\nManchester, M13 9PL U.K. jls@cs.man.ac.uk \n\nAbstract \n\nPopulation-based Incremental Learning is shown require very sen(cid:173)\nsitive scaling of its learning rate. The learning rate must scale with \nthe system size in a problem-dependent way. This is shown in two \nproblems: the needle-in-a haystack, in which the learning rate must \nvanish exponentially in the system size, and in a smooth function \nin which the learning rate must vanish like the square root of the \nsystem size. Two methods are proposed for removing this sensitiv(cid:173)\nity. A learning dynamics which obeys detailed balance is shown to \ngive consistent performance over the entire range of learning rates. \nAn analog of mutation is shown to require a learning rate which \nscales as the inverse system size, but is problem independent. \n\n1 \n\nIntroduction \n\nThere has been much recent work using probability models to search in optimization \nproblems. The probability model generates candidate solutions to the optimization \nproblem. It is updated so that the solutions generated should improve over time. \nUsually, the probability model is a parameterized graphical model, and updating \nthe model involves changing the parameters and possibly the structure of the model. \nThe general scheme works as follows, \n\n\u2022 Initialize the model to some prior (e.g. a uniform distribution); \n\u2022 Repeat \n\n- Sampling step: generate a data set by sampling from the probability \n\nmodel; \n\n- Testing step: test the data as solutions to the problem; \n- Selection step: create a improved data set by selecting the better \n\nsolutions and removing the worse ones; \n\n- Learning step: create a new probability model from the old model \nand the improved data set (e.g. as a mixture of the old model and the \nmost likely model given the improved data set); \n\n\u2022 until (stopping criterion met) \n\nDifferent algorithms are largely distinguished by the class of probability models \nused. For reviews of the approach including the different graphical models which \n\n\fhave been used, see [3, 6]. These algorithms have been called Estimation of Distri(cid:173)\nbution Algorithms (EDA); I will use that term here. \n\nEDAs are related to genetic algorithms; instead of evolving a population, a gen(cid:173)\nerative model which produces the population at each generation is evolved. A \nmotivation for using EDAs instead of GAs is that is that in EDAs the structure of \nthe graphical model corresponds to the form of the crossover operator in GAs (in \nthe sense that a given graph will produce data whose probability will not change \nmuch under a particular crossover operator). If the EDA can learn the structure of \nthe graph, it removes the need to set the crossover operator by hand (but see [2] \nfor evidence against this). \n\nIn this paper, a very simple EDA is considered on very simple problems. It is shown \nthat the algorithm is extremely sensitive to the value of learning rate. The learning \nrate must vanish with the system size in a problem dependent way, and for some \nproblems it has to vanish exponentially fast. Two correctives measures are consid(cid:173)\nered: a new learning rule which obeys detailed balance in the space of parameters, \nand an operator analogous to mutation which has been proposed previously. \n\n2 The Standard PBIL Algorithm \n\nThe simplest example of a EDA is Population-based Incremental Learning (PBIL) \nwhich was introduced by Baluja [1]. PBIL uses a probability model which is a \nproduct of independent probabilities for each component of the binary search space. \nLet Xi denote the ith component of X, an L-component binary vector which is a \nstate of the search space. The probability model is defined by the L-component \nvector of parameters 'Y~), where 'Yi(t ) denotes the probability that Xi = 1 at time \nt. \n\nThe algorithm works as follows, \n\n\u2022 Initialize 'Yi(O) = 1/2 for all i; \n\u2022 Repeat \n\n- Generate a population of N strings by sampling from the binomial \n\ndistribution defined by 1(t). \n\n- Find the best string in the population x*. \n- Update the parameters 'Yi(t + 1) = 'Yi(t) + a[xi - 'Yi (t)] for all i. \n\n\u2022 until (stopping criterion met) \n\nThe algorithm has only two parameters, the size of the population N and the \nlearning parameter a. \n\n3 The sensitivity of PBIL to the learning rate \n\n3.1 PBIL on a flat landscape \n\nThe source of sensitivity of PBIL to the learning rate lies in its behavior on a flat \nlandscape. In this case all vectors are equally fit , so the \"best\" vector x* is a random \nvector and its expected value is \n\n(1) \n(where (-) denotes the expectation operator) Thus, the parameters remain un(cid:173)\nchanged on average. \n\nIn any individual run, however, the parameters converge \n\n\frapidly to one of the corners of the hypercube. As the parameters deviate from \n1/2 they will move towards a corner of the hypercube. Then the population gen(cid:173)\nerated will be biased towards that corner, which will move the parameters closer \nyet to that corner, etc. All of the corners of the hypercube are attractors which, \nalthough never reached, are increasingly attractive with increasing proximity. Let \nus call this phenomenon drift. (In population genetics, the term drift refers to the \nloss of genetic diversity due to finite population sampling. It is in analogy to this \nthat the term is used here.) \n\nConsider the average distance between the parameters and 1/2, \n\n(2) \n\n)2 \nD(t) == L 2: \"2 - 'Yi (t) \n\n1 (1 \n\n\u2022 \n\nSolving this reveals that on average this converges to 1/4 with a characteristic time \n\nT = -1/ 10g(1 - 0:2) ~ 1/0:2 for 0: ~ O. \n\n(3) \n\nThe rate of search on any other search space will have to compete with drift. \n\n3.2 PBIL and the needle-in-the haystack problem \n\nAs a simple example of the interplay between drift and directed search, consider the \nso-called needle-in-a-haystack problem. Here the fitness of all strings is 0 except for \none special string (the \"needle\") which has a fitness of 1. Assume it is the string \nof all 1 'so It is shown here that PBIL will only find the needle if 0: is exponentially \nsmall, and is inefficient at finding the needle when compared to random search. \nConsider the probability of finding the needle at time t, denoted O(t) = rrf=1 'Yi(t). \nConsider times shorter than T where T is long enough that the needle may be \nfound multiple times, but 0:2T -+ 0 as L -+ 00. It will be shown for small 0: that \nwhen the needle is not found (during drift), 0 decreases by an amount 0:2 LO/2, \nwhereas when the needle is found, 0 increases by the amount o:LO. Since initially, \nthe former happens at a rate 2L times greater than the latter, 0: must be less than \n2 - (L - 1) for the system to move towards the hypercube corner near the optimum, \nrather than towards a random corner. \nWhen the needle is not found, the mean of O(t) is invariant, (O(t + 1)) = O(t). \nHowever, this is misleading, because 0 is not a self-averaging quantity; its mean \nis affected by exponentially unlikely events which have an exponentially big effect. \nA more robust measure of the size of O(t) is the exponentiated mean of the log of \nO(t) . This will be denoted by [0] == exp (log 0). This is the appropriate measure of \nthe central tendency of a distribution which is approximately log-normal [4], as is \nexpected of O(t) early in the dynamics, since the log of 0 is the sum of approximately \nindependent quantities. \nThe recursion for 0 expanded to second order in 0: obeys \n\n[O(t + 1)] = \n\n{ \n\n[O(t)] [1 - 10:2 L] . \n[O(t)] [1 + ~L + ~'0:2 L(L - 1)] ; \n\nneedle not found \nneedle found. \n\n(4) \n\nIn these equations, 'Yi(t) has also been expanded around 1/2. \nSince the needle will be found with probability O(t) and not found with probability \n1 - O(t), the recursion averages to, \n\n[O(t + 1)] = [O(t)] (1 - ~0:2 L) + [0(t)]2 [O:L - ~0:2 L(L + 1)] . \n\n(5) \n\n\fThe second term actually averages to [D(t)] (D(t)) , but the difference between (D) \nand [D] is of order 0:, and can be ignored. \nEquation (5) has a stable fixed point at 0 and an unstable fixed point at 0:/2 + \nO( 0:2 L). If the initial value of D(O) is less than the unstable fixed point, D will \ndecay to zero. If D(O) is greater than the unstable fixed point, D will grow. The \ninitial value is D(O) = 2- \u00a3, so the condition for the likelihood of finding the needle \nto increase rather than decrease is 0: < 2-(\u00a3-1). \n\n1.1 ,-----~-~--~-~--,_________, \n\na \n\n120 \n\nFigure 1: Simulations on PBIL on needle-in-a-haystack problem for L = 8,10,11,12 \n(respectively 0, +, *, 6). The algorithm is run until no parameters are between 0.05 \nand 0.95, and averaged over 1000 runs. Left: Fitness of best population member at \nconvergence versus 0:. The non-robustness of the algorithm is clear; as L increases, \n0: must be very finely set to a very small value to find the optimum. Right: As \nprevious, but with 0: scaled by 2\u00a3. The data approximately collapses, which shows \nthat as L increases, 0: must decrease like 2-\u00a3 to get the same performance. \n\nFigure 1 shows simulations of PBIL on the needle-in-a-haystack problem. These \nconfirm the predictions made above, the optimum is found only if 0: is smaller than \na constant times 2\u00a3. The algorithm is inefficient because it requires such small 0:; \nconvergence to the optimum scales like 4\u00a3. This is because the rate of convergence \nto the optimum goes like Do:, both of which are 0(2-\u00a3). \n\n3.3 PBIL and functions of unitation \n\nOne might think that the needle-in-the-haystack problem is hard in a special way, \nand results on this problem are not relevant to other problems. This is not be true, \nbecause even smooth functions have fiat subspaces in high dimensions. To see this, \nconsider any continuous, monotonic function of unit at ion u, where u = t L~ Xi , the \nnumber of 1 's in the vector. Assume the the optimum occurs when all components \nare l. \nThe parameters 1 can be decomposed into components parallel and perpendicular \nto the optimum. Movement along the perpendicular direction is neutral, Only \nmovement towards or away from the optimum changes the fitness. The random \nstrings generated at the start of the algorithm are almost entirely perpendicular to \nthe global optimum, projecting only an amount of order 1/..JL towards the optimum. \nThus, the situation is like that of the needle-in-a-haystack problem. The perpendic(cid:173)\nular direction is fiat, so there is convergence towards an arbitrary hypercube corner \n\n\fwith a drift rate, \n\nTJ.. '\" a? \n\nfrom equation (3). Movement towards the global optimum occurs at a rate, \n\na \n\nTil '\" VL\u00b7 \n\n(6) \n\n(7) \n\nThus, a must be small compared to l/VL for movement towards the global optimum \nto win. \n\nA rough argument can be used to show how the fitness in the final population \ndepends on a. Making use of the fact that when N random variables are drawn \nfrom a Gaussian distribution with mean m and variance u 2 , the expected largest \nvalue drawn is m + J2u 2 10g(N) for large N (see, for example, [7]) , the Gaussian \napproximation to the binomial distribution, and approximating the expectation of \nthe square root as the square root of the expectation yields, \n(u(t + 1)) = (u(t)) + aJ2 (v(t)) 10g(N), \n\n(8) \nwhere v(t) is the variance in probability distribution, v(t) = -b L i Ii (t)[l - li(t)]. \nAssuming that the convergence of the variance is primarily due to the convergence \non the flat subspace, this can be solved as, \n1 \n\nJlog(N) \n(u(oo)) ~ \"2 + aV'iL . \n\n(9) \n\nThe equation must break down when the fitness approaches one, which is where the \nGaussian approximation to the binomial breaks down. \n\n0.9 \n\n0.8 \n\n0.7 \n\n0.6 \n\n~ \n\n~ 0.5 \n'\" \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0 \n0 \n\n0.9 \n\n0.8 \n\n0.7 \n\n~ 0 . 6 \nI \nu.. \nNO .5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.2 \n\n0.4 \n\na \n\n0.6 \n\n0.8 \n\n20 \n\nFigure 2: Simulations on PBIL on the unitation function for L = 16,32,64,128,256 \n(respectively D , 0, +, *, 6) . The algorithm is run until all parameters are closer to \n1 or 0 than 0.05, and averaged over 100 runs. Left: Fitness of best population \nmember at convergence versus a. The fitness is scaled so that the global optimum \nhas fitness 1 and the expected fitness of a random string is O. As L increases, a \nmust be set to a decreasing value to find the optimum. Right: As previous, but \nwith a scaled by VL. The data approximately collapses, which shows that as L \nincreases, a must decrease like VL to get the same performance. The smooth curve \nshows equation (9). \n\nSimulations of PBIL on the unitation function confirm these predictions. PBIL fails \nto converge to the global optimum unless a is small compared to l/VL. Figure 2 \nshows the scaling of fitness at convergence with aVL, and compares simulations \nwith equation (9). \n\n\f4 Corrective 1 - Detailed Balance PBIL \n\nOne view of the problem is that it is due to the fact that the learning dynamics \ndoes not obey detailed balance. Even on a flat space, the rate of movement of \nthe parameters \"Yi away from 1/2 is greater than the movement back. It is well(cid:173)\nknown that a Markov process on variables x will converge to a desired equilibrium \ndistribution 7r(x) if the transition probabilities obey the detailed balance conditions, \n(10) \nwhere w(x'lx) is the probability of generating x' from x. Thus, any search algorithm \nsearching on a flat space should have dynamics which obeys, \n\nw(x'lx)7r(x) = w(xlx')7r(x'), \n\n(11) \nand PEIL does not obey this. Perhaps the sensitive dependence on a would be \nremoved if it did. \n\nw(x'lx) = w(xlx'), \n\nThere is a difficulty in modifying the dynamics of PBIL to satisfy detailed balance, \nhowever. PEIL visits a set of points which varies from run to run, and (almost) \nnever revisits points. This can be fixed by constraining the parameters to lie on a \nlattice. Then the dynamics can be altered to enforce detailed balance. \n\nDefine the allowed parameters in terms of a set of integers ni. The relationship \nbetween them is. \n\nI - ~(1 - a)ni, \n!(1- a) lni l, \n\n\"Yi = \n\n{\n\n2 ' \n\nni > 0; \nni < 0; \nni = O. \n\n(12) \n\n(13) \n\nLearning dynamics now consists of incrementing and decrementing the n/s by 1; \nwhen xi = 1(0) ni is incremented (decremented). \nTransforming variables via equation (12), the uniform distribution in \"Y becomes in \nn, \n\nP (n) = _a_(I_ a) lnl. \n\n2-a \n\n4.0.1 Detailed balance by rejection sampling \n\nOne of the easiest methods for sampling from a distribution is to use the rejection \nmethod. In this, one has g(x'lx) as a proposal distribution; it is the probability of \nproposing the value x' from x. Then, A(x'lx) is the probability of accepting this \nchange. Detailed balance condition becomes \n\ng(x'lx)A(x'lx)7r(x) = g(xlx')A(xlx')7r(x') . \nFor example, the well-known Metropolis-Hasting algorithm has \n\nA(x'lx) = min (1, :~~}:(~}I~})' \n\nThe analogous equations for PEIL on the lattice are, \n\nmm \n\n\"Y(n) \n\n. [1- \"Y(n+l) \n\n] \nA(n + lin) \n(1 - a), 1 \nA(n-lln) = min[{~;(~~(1-a),I]. \n\n(14) \n\n(15) \n\n(16) \n\n(17) \n\nIn applying the acceptance formula, each component is treated independently. Thus, \nmoves can be accepted on some components and not on others. \n\n\f4.0.2 Results \n\nDetailed Balance PBIL requires no special tuning of parameters, at least when \napplied to the two problems of the opening sections. For the needle-in-a-haystack, \nsimulations were performed for 100 values of (): between 0 and 0.4 equally spaced for \nL = 8,9,10,11,12; 1000 trials of each, population size 20, with the same convergence \ncriterion as before, simulation halts when all \"Ii'S are less than 0.05 or greater than \n0.95. On none of those simulations did the algorithm fail to contain the global \noptimum in the final population. \n\nFor the function of unitation, Detailed Balance PBIL appears to always find the \noptimum if run long enough. Stopping it when all parameters fell outside the range \n(0.05,0.95), the algorithm did not always find the global optimum. It produced \nan average fitness within 1% of the optimum for (): between 0.1 and 0.4 and L = \n32, 64,128,256 over a 100 trials, but for learning rates below 0.1 and L = 256 the \naverage fitness fell as low as 4% below optimum. However, this is much improved \nover standard PBIL (see figure 2) where the average fitness fell to 60% below the \noptimum in that range. \n\n5 Corrective 2 - Probabilistic mutation \n\nAnother approach to control drift is to add an operator analogous to mutation in \nGAs. Mutation has the property that when repeatedly applied, it converges to a \nrandom data set. Muhlenbein [5] has proposed that the analogous operator ED As \nestimates frequencies biased towards a random guess. Suppose ii is the fraction of \nl's at site i. Then, the appropriate estimate of the probability of a 1 at site i is \n\nii + m \n\"Ii = 1 + 2m' \n\n(18) \n\nwhere m is a mutation-like parameter. This will be recognized as the maximum \nposterior estimate of the binomial distribution using as the prior a ,a-distribution \nwith both parameters equal to mN + 1; the prior biases the estimate towards 1/2. \nThis can be applied to PBIL by using the following learning rule, \n\n( \n\n1) \n\n\"Ii t + = \n\n\"Ii(t) + (): [x; - \"Ii (t)] + m \n\n1 + 2m \n\n. \n\n(19) \n\nWith m = 0 it gives the usual PBIL rule; when repeatedly applied on a flat space \nit converges to 1/2. \n\nUnlike Detailed Balance PBIL, this approach does required special scaling of the \nlearning rate, but the scaling is more benign than in standard PBIL and is problem \nindependent. It is determined from three considerations. First, mutation must \nbe large enough to counteract the effects of drift towards random corners of the \nhypercube. Thus, the fixed point of the average distance to 1/2, (D(t + 1)) defined \nin equation (2) , must be sufficiently close to zero. Second, mutation must be small \nenough that it does not interfere with movement towards the parameters near the \noptimum when the optimum is found. Thus, the fixed point of equation (19) must be \nsufficiently close to 0 or 1. Finally, a sample of size N sampled from the fixed point \ndistribution near the hypercube corner containing the optimum should contain the \noptimum with a reasonable probability (say greater than 1 - e- 1 ). Putting these \nconsiderations together yields, \n\nlogN \n(): \n- - \u00bb - \u00bb -. \n4 \n\nm \n(): \n\nL \n\n(20) \n\n\f5.1 Results \n\nTo satisfy the conditions in equation 20, the mutation rate was set to m ex: a 2 , \nand a was constrained to be smaller than log (N)/L. For the needle-in-a-haystack, \nthe algorithm behaved like Detailed Balance PElL. It never failed to find the opti(cid:173)\nmum for the needle-in-a-haystack problems for the sizes given previously. For the \nfunctions of unitation, no improvement over standard PBIL is expected, since the \nscaling using mutation is worse, requiring a < 1/ L rather than a < 1/..fL. How(cid:173)\never, with tuning of the mutation rate, the range of a's with which the optimum \nwas always found could be increased over standard PBIL. \n\n6 Conclusions \n\nThe learning rate of PBIL has to be very small for the algorithm to work, and \nunpredictably so as it depends upon the problem size in a problem dependent way. \nThis was shown in two very simple examples. Detailed balance fixed the problem \ndramatically in the two cases studied. Using detailed balance, the algorithm consis(cid:173)\ntently finds the optimum over the entire range of learning rates. Mutation also fixed \nthe problem when the parameters were chosen to satisfy a problem-independent set \nof inequalities. \n\nThe phenomenon studied here could hold in any EDA, because for any type of \nmodel, the probability is high of generating a population which reinforces the move \njust made. On the other hand, more complex models have many more parame(cid:173)\nters, and also have more sources of variability, so the issue may be less important. \nIt would be interesting to learn how important this sensitivity is in EDAs using \ncomplex graphical models. \n\nOf the proposed correctives, detailed balance will be more difficult to generalize to \nmodels in which the structure is learned. It requires an understanding of algorithm's \ndynamics on a flat space, which may be very difficult to find in those cases. The \nmutation-type operator will easier to generalize, because it only requires a bias \ntowards a random distribution. However, the appropriate setting of the parameters \nmay be difficult to ascertain. \n\nReferences \n[1] S. Baluja. Population-based incremental learning: A method for integrating genetic \n\nsearch based function optimization and competive learning. Technical Report CMU(cid:173)\nCS-94-163, Computer Science Department, Carnegie Mellon University, 1994. \n\n[2] A. Johnson and J. L. Shapiro. The importance of selection mechanisms in distribution \nestimation algorithms. In Proceedings of the 5th International Conference on Artificial \nEvolution AE01, 2001. \n\n[3] P. Larraiiaga and J. A. Lozano. Estimation of Distribution Algorithms, A New Tool \n\nfor Evolutionary Computation. Kluwer Academic Publishers, 2001. \n\n[4] Eckhard Limpert, Werner A. Stahel, and Markus Abbt. Log-normal distributions \n\nacross the sciences: Keys and clues. BioScience, 51(5):341-352, 2001. \n\n[5] H. Miihlenbein. The equation for response to selection and its use for prediction. \n\nEvolutionary Computation, 5(3):303- 346, 1997. \n\n[6] M. Pelikan, D. E . Goldberg, and F. Lobo. A survey of optimization by building \n\nand using probabilistic models. Technical report, University of Illinois at Urbana(cid:173)\nChampaign, Illinois Genetic Algorithms Laboratory, 1999. \n\n[7] Jonathan L. Shapiro and Adam Priigel-Bennett. Maximum entropy analysis of genetic \n\nalgorithm operators. Lecture Notes in Computer Science, 993:14- 24, 1995. \n\n\f", "award": [], "sourceid": 2138, "authors": [{"given_name": "J.", "family_name": "Shapiro", "institution": null}]}