{"title": "Scaling of Probability-Based Optimization Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 399, "page_last": 406, "abstract": null, "full_text": "Scaling of Probability-Based  Optimization \n\nAlgorithms \n\nDepartment of Computer Science  University of Manchester \n\nJ.  L.  Shapiro \n\nManchester,  M13  9PL U.K.  jls@cs.man.ac.uk \n\nAbstract \n\nPopulation-based Incremental Learning is  shown require very sen(cid:173)\nsitive scaling of its learning rate.  The learning rate must scale with \nthe system size in a  problem-dependent way.  This is  shown in two \nproblems:  the needle-in-a haystack, in which the learning rate must \nvanish exponentially in the  system size,  and in  a  smooth function \nin  which  the  learning rate must  vanish  like  the  square root of the \nsystem size.  Two methods are proposed for  removing this sensitiv(cid:173)\nity.  A learning dynamics which obeys detailed balance is shown to \ngive consistent performance over the entire range of learning rates. \nAn  analog  of mutation  is  shown  to  require  a  learning  rate  which \nscales as  the inverse system size, but is  problem independent. \n\n1 \n\nIntroduction \n\nThere has been much recent work using probability models to search in optimization \nproblems.  The probability model generates candidate solutions to the optimization \nproblem.  It is  updated  so  that the  solutions  generated should  improve  over time. \nUsually,  the  probability  model  is  a  parameterized  graphical  model,  and  updating \nthe model involves changing the parameters and possibly the structure of the model. \nThe general scheme works as  follows, \n\n\u2022  Initialize the model to some prior  (e.g.  a  uniform distribution); \n\u2022  Repeat \n\n- Sampling step:  generate a  data set by sampling from the probability \n\nmodel; \n\n- Testing step:  test the data as  solutions to the problem; \n- Selection  step:  create  a  improved  data  set  by  selecting  the  better \n\nsolutions and removing the worse ones; \n\n- Learning  step:  create  a  new  probability  model  from  the  old  model \nand the improved data set  (e.g.  as a  mixture of the old model and the \nmost likely model given the improved data set); \n\n\u2022  until  (stopping criterion met) \n\nDifferent  algorithms  are  largely  distinguished  by  the  class  of  probability  models \nused.  For  reviews  of the  approach including  the  different  graphical  models  which \n\n\fhave been used,  see  [3, 6].  These algorithms have been called  Estimation of Distri(cid:173)\nbution Algorithms  (EDA);  I  will  use  that term here. \n\nEDAs  are  related  to  genetic  algorithms;  instead  of evolving  a  population,  a  gen(cid:173)\nerative  model  which  produces  the  population  at  each  generation  is  evolved.  A \nmotivation for  using EDAs instead of GAs  is  that is  that in EDAs the structure of \nthe  graphical  model  corresponds  to the form  of the  crossover operator  in  GAs  (in \nthe  sense  that  a  given  graph will  produce  data whose  probability  will  not  change \nmuch under a particular crossover operator).  If the EDA can learn the structure of \nthe  graph,  it  removes  the  need  to  set  the  crossover  operator by  hand  (but  see  [2] \nfor  evidence  against this). \n\nIn this paper, a very simple EDA is considered on very simple problems.  It is shown \nthat the algorithm is  extremely sensitive to the value of learning rate.  The learning \nrate must  vanish  with  the system  size  in  a  problem dependent  way,  and  for  some \nproblems it has  to vanish exponentially fast.  Two correctives measures  are consid(cid:173)\nered:  a  new  learning rule which obeys  detailed balance in the space of parameters, \nand an operator analogous to mutation which has been proposed previously. \n\n2  The Standard PBIL Algorithm \n\nThe simplest example of a  EDA  is  Population-based Incremental Learning  (PBIL) \nwhich  was  introduced  by  Baluja  [1].  PBIL  uses  a  probability  model  which  is  a \nproduct of independent probabilities for each component of the binary search space. \nLet  Xi  denote  the  ith component  of X,  an  L-component  binary  vector  which  is  a \nstate  of the  search  space.  The  probability  model  is  defined  by  the  L-component \nvector  of parameters 'Y~), where 'Yi(t ) denotes the probability that  Xi  =  1 at  time \nt. \n\nThe algorithm works  as  follows, \n\n\u2022  Initialize 'Yi(O)  =  1/2 for  all  i; \n\u2022  Repeat \n\n- Generate  a  population  of  N  strings  by  sampling  from  the  binomial \n\ndistribution defined  by 1(t). \n\n- Find the best string in the population x*. \n- Update the parameters 'Yi(t + 1)  =  'Yi(t)  + a[xi - 'Yi (t)]  for  all  i. \n\n\u2022  until  (stopping criterion met) \n\nThe  algorithm  has  only  two  parameters,  the  size  of  the  population  N  and  the \nlearning parameter a. \n\n3  The sensitivity of PBIL to the learning rate \n\n3.1  PBIL  on a  flat  landscape \n\nThe source of sensitivity of PBIL to the learning rate lies  in  its  behavior on a  flat \nlandscape.  In this case all vectors are equally fit , so the \"best\" vector x*  is a random \nvector and its expected value  is \n\n(1) \n(where  (-)  denotes  the  expectation  operator)  Thus,  the  parameters  remain  un(cid:173)\nchanged  on  average. \n\nIn  any  individual  run,  however,  the  parameters  converge \n\n\frapidly  to  one  of  the  corners  of the  hypercube.  As  the  parameters  deviate  from \n1/2 they  will  move  towards  a  corner of the  hypercube.  Then  the  population  gen(cid:173)\nerated  will  be  biased  towards  that  corner,  which  will  move  the  parameters closer \nyet  to  that  corner,  etc.  All  of the  corners  of the  hypercube  are  attractors  which, \nalthough never  reached,  are increasingly  attractive with increasing proximity.  Let \nus  call this  phenomenon  drift.  (In  population genetics,  the term drift  refers  to the \nloss  of genetic  diversity  due  to finite  population sampling.  It is  in  analogy to this \nthat the term is  used here.) \n\nConsider the average distance between the parameters and  1/2, \n\n(2) \n\n)2 \nD(t)  ==  L 2:  \"2  - 'Yi (t) \n\n1  (1 \n\n\u2022 \n\nSolving this reveals that on average this converges to 1/4 with a  characteristic time \n\nT  =  -1/ 10g(1  - 0:2)  ~ 1/0:2 for  0:  ~ O. \n\n(3) \n\nThe rate of search on any other search space will  have to compete with drift. \n\n3.2  PBIL  and the needle-in-the haystack  problem \n\nAs a simple example of the interplay between drift and directed search, consider the \nso-called needle-in-a-haystack problem.  Here the fitness  of all strings is  0 except for \none  special  string  (the  \"needle\")  which  has  a  fitness  of 1.  Assume  it is  the  string \nof all  1 'so  It is  shown here that PBIL will  only find  the needle if 0:  is  exponentially \nsmall,  and is  inefficient  at finding  the needle when  compared to random search. \nConsider the probability of finding  the needle at time t,  denoted O(t)  =  rrf=1 'Yi(t). \nConsider  times  shorter  than  T  where  T  is  long  enough  that  the  needle  may  be \nfound  multiple  times,  but  0:2T  -+  0  as  L  -+  00.  It will  be shown for  small  0:  that \nwhen  the  needle  is  not  found  (during  drift),  0  decreases  by  an  amount  0:2 LO/2, \nwhereas when the needle is  found,  0  increases by the amount o:LO.  Since initially, \nthe former  happens at a  rate 2L  times greater than the latter, 0:  must be less  than \n2 - (L - 1)  for  the  system to  move  towards  the  hypercube  corner near  the  optimum, \nrather than towards a  random corner. \nWhen  the  needle  is  not  found,  the  mean  of O(t)  is  invariant,  (O(t + 1))  =  O(t). \nHowever,  this  is  misleading,  because  0  is  not  a  self-averaging  quantity;  its  mean \nis  affected  by exponentially unlikely events  which  have an exponentially big  effect. \nA  more robust  measure of the size  of O(t)  is  the exponentiated mean of the log of \nO(t) .  This will be denoted by [0] ==  exp (log 0).  This is the appropriate measure of \nthe  central tendency of a  distribution  which  is  approximately log-normal  [4],  as  is \nexpected of O(t) early in the dynamics, since the log of 0  is the sum of approximately \nindependent quantities. \nThe recursion for  0  expanded to second order in 0:  obeys \n\n[O(t + 1)]  = \n\n{ \n\n[O(t)]  [1  - 10:2 L] . \n[O(t)]  [1 + ~L + ~'0:2 L(L - 1)]  ; \n\nneedle  not found \nneedle found. \n\n(4) \n\nIn these equations, 'Yi(t)  has also been expanded around 1/2. \nSince the needle will  be found  with probability O(t)  and not found with probability \n1 - O(t),  the recursion averages to, \n\n[O(t + 1)]  =  [O(t)]  (1  - ~0:2 L) + [0(t)]2  [O:L  - ~0:2 L(L + 1)]  . \n\n(5) \n\n\fThe second term actually averages to  [D(t)] (D(t)) , but the difference between  (D) \nand [D]  is  of order 0:, and can be ignored. \nEquation  (5)  has  a  stable  fixed  point  at  0  and  an  unstable  fixed  point  at  0:/2 + \nO( 0:2 L).  If the  initial  value  of  D(O)  is  less  than  the  unstable  fixed  point,  D will \ndecay  to  zero.  If D(O)  is  greater than  the  unstable fixed  point,  D will  grow.  The \ninitial value is  D(O)  =  2- \u00a3,  so the condition for  the likelihood of finding  the needle \nto increase rather than decrease is  0:  < 2-(\u00a3-1). \n\n1.1 ,-----~-~--~-~--,_________, \n\na \n\n120 \n\nFigure 1:  Simulations on PBIL on needle-in-a-haystack problem for  L  =  8,10,11,12 \n(respectively 0, +, *, 6). The algorithm is  run until no parameters are between 0.05 \nand 0.95, and averaged over 1000 runs.  Left:  Fitness of best population member at \nconvergence versus 0:.  The non-robustness of the algorithm is  clear;  as L  increases, \n0:  must  be  very  finely  set  to  a  very  small  value  to find  the  optimum.  Right:  As \nprevious, but with 0:  scaled by 2\u00a3.  The data approximately collapses, which shows \nthat as L  increases,  0:  must decrease like  2-\u00a3 to get the same performance. \n\nFigure  1  shows  simulations  of PBIL  on  the  needle-in-a-haystack  problem.  These \nconfirm the predictions made above, the optimum is  found only if 0:  is  smaller than \na  constant  times  2\u00a3.  The algorithm is  inefficient  because it  requires  such  small  0:; \nconvergence to the optimum scales like 4\u00a3.  This is  because the rate of convergence \nto the optimum goes  like  Do:,  both of which are 0(2-\u00a3). \n\n3.3  PBIL  and functions  of unitation \n\nOne might think that the needle-in-the-haystack problem is  hard in a  special  way, \nand results on this problem are not relevant to other problems.  This is  not be true, \nbecause even smooth functions  have fiat subspaces in high dimensions.  To see this, \nconsider any continuous, monotonic function of unit at ion u, where u  =  t L~ Xi , the \nnumber of 1 's in the vector.  Assume the the optimum occurs when all  components \nare  l. \nThe parameters 1 can be decomposed  into  components  parallel and perpendicular \nto  the  optimum.  Movement  along  the  perpendicular  direction  is  neutral,  Only \nmovement  towards  or  away  from  the  optimum  changes  the  fitness.  The  random \nstrings generated at the start of the algorithm are almost entirely perpendicular to \nthe global optimum, projecting only an amount of order 1/..JL towards the optimum. \nThus, the situation is like that of the needle-in-a-haystack problem.  The perpendic(cid:173)\nular direction is fiat,  so there is  convergence towards an arbitrary hypercube corner \n\n\fwith a  drift rate, \n\nTJ..  '\" a? \n\nfrom  equation  (3).  Movement towards the global optimum occurs at a  rate, \n\na \n\nTil  '\" VL\u00b7 \n\n(6) \n\n(7) \n\nThus, a  must be small compared to l/VL for movement towards the global optimum \nto win. \n\nA  rough  argument  can  be  used  to  show  how  the  fitness  in  the  final  population \ndepends  on  a.  Making  use  of the  fact  that  when  N  random  variables  are  drawn \nfrom  a  Gaussian  distribution  with  mean  m  and  variance  u 2 ,  the  expected  largest \nvalue  drawn is  m + J2u 2 10g(N)  for  large  N  (see,  for  example,  [7]) , the  Gaussian \napproximation to the binomial distribution,  and approximating the expectation of \nthe square root as the square root of the expectation yields, \n(u(t + 1)) =  (u(t)) + aJ2 (v(t)) 10g(N), \n\n(8) \nwhere v(t)  is the variance in probability distribution,  v(t)  =  -b L i Ii (t)[l - li(t)]. \nAssuming that the convergence of the variance is  primarily due to the convergence \non the flat  subspace, this can be solved as, \n1 \n\nJlog(N) \n(u(oo))  ~  \"2  +  aV'iL  . \n\n(9) \n\nThe equation must break down when the fitness approaches one, which is  where the \nGaussian approximation to the binomial breaks down. \n\n0.9 \n\n0.8 \n\n0.7 \n\n0.6 \n\n~ \n\n~ 0.5 \n'\" \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\n0 \n0 \n\n0.9 \n\n0.8 \n\n0.7 \n\n~ 0 . 6 \nI \nu.. \nNO .5 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.2 \n\n0.4 \n\na \n\n0.6 \n\n0.8 \n\n20 \n\nFigure 2:  Simulations on PBIL on the unitation function for  L  =  16,32,64,128,256 \n(respectively  D , 0, +, *, 6) .  The algorithm is  run until all  parameters are  closer to \n1  or  0  than  0.05,  and  averaged  over  100  runs.  Left:  Fitness  of best  population \nmember at convergence versus a.  The fitness  is scaled so  that the global optimum \nhas  fitness  1  and  the  expected  fitness  of a  random  string  is  O.  As  L  increases,  a \nmust  be  set  to  a  decreasing  value  to find  the  optimum.  Right:  As  previous,  but \nwith  a  scaled  by  VL.  The  data approximately  collapses,  which  shows  that  as  L \nincreases, a  must decrease like VL to get the same performance.  The smooth curve \nshows  equation  (9). \n\nSimulations of PBIL on the unitation function confirm these predictions.  PBIL fails \nto  converge to the global optimum unless  a  is  small  compared to l/VL.  Figure  2 \nshows  the  scaling  of fitness  at  convergence  with  aVL,  and  compares  simulations \nwith equation  (9). \n\n\f4  Corrective 1  - Detailed Balance PBIL \n\nOne  view  of the  problem  is  that  it  is  due  to  the  fact  that  the  learning  dynamics \ndoes  not  obey  detailed  balance.  Even  on  a  flat  space,  the  rate  of  movement  of \nthe  parameters  \"Yi  away  from  1/2 is  greater  than  the  movement  back.  It  is  well(cid:173)\nknown that a  Markov process on variables  x  will  converge to a  desired equilibrium \ndistribution 7r(x)  if the transition probabilities obey the detailed balance conditions, \n(10) \nwhere w(x'lx) is the probability of generating x' from x.  Thus, any search algorithm \nsearching on a  flat  space should have dynamics  which  obeys, \n\nw(x'lx)7r(x)  =  w(xlx')7r(x'), \n\n(11) \nand  PEIL  does  not  obey  this.  Perhaps  the  sensitive  dependence  on  a  would  be \nremoved if it did. \n\nw(x'lx) = w(xlx'), \n\nThere is  a  difficulty in modifying the dynamics of PBIL to satisfy detailed balance, \nhowever.  PEIL  visits  a  set  of points  which  varies  from  run  to  run,  and  (almost) \nnever  revisits  points.  This can be fixed  by  constraining the parameters to lie  on a \nlattice.  Then the dynamics  can be altered to enforce detailed balance. \n\nDefine  the  allowed  parameters  in  terms  of a  set  of integers  ni.  The  relationship \nbetween them is. \n\nI  - ~(1 - a)ni, \n!(1- a) lni l, \n\n\"Yi  = \n\n{\n\n2 ' \n\nni > 0; \nni < 0; \nni =  O. \n\n(12) \n\n(13) \n\nLearning  dynamics  now  consists  of incrementing  and  decrementing  the  n/s by  1; \nwhen xi  =  1(0)  ni  is  incremented  (decremented). \nTransforming variables via equation  (12),  the uniform distribution in \"Y  becomes in \nn, \n\nP (n)  = _a_(I_ a) lnl. \n\n2-a \n\n4.0.1  Detailed balance by  rejection sampling \n\nOne of the easiest methods for  sampling from  a  distribution is  to use  the rejection \nmethod.  In this, one has g(x'lx)  as  a  proposal distribution;  it is  the probability of \nproposing the  value  x'  from  x.  Then,  A(x'lx)  is  the  probability  of accepting  this \nchange.  Detailed balance condition becomes \n\ng(x'lx)A(x'lx)7r(x)  =  g(xlx')A(xlx')7r(x') . \nFor example,  the well-known  Metropolis-Hasting algorithm has \n\nA(x'lx) =  min (1, :~~}:(~}I~})' \n\nThe analogous equations for  PEIL on the lattice are, \n\nmm \n\n\"Y(n) \n\n.  [1- \"Y(n+l) \n\n] \nA(n + lin) \n(1  - a), 1 \nA(n-lln)  =  min[{~;(~~(1-a),I]. \n\n(14) \n\n(15) \n\n(16) \n\n(17) \n\nIn applying the acceptance formula, each component is treated independently.  Thus, \nmoves  can be accepted on some components and not on others. \n\n\f4.0.2  Results \n\nDetailed  Balance  PBIL  requires  no  special  tuning  of  parameters,  at  least  when \napplied to the two  problems of the opening sections.  For the needle-in-a-haystack, \nsimulations were performed for  100 values of ():  between 0 and 0.4 equally spaced for \nL  =  8,9,10,11,12; 1000 trials of each, population size 20, with the same convergence \ncriterion as before,  simulation halts when all \"Ii'S  are less  than 0.05  or greater than \n0.95.  On  none  of those  simulations  did  the  algorithm  fail  to  contain  the  global \noptimum in the final  population. \n\nFor  the  function  of unitation,  Detailed  Balance  PBIL  appears  to  always  find  the \noptimum if run long enough.  Stopping it when all parameters fell  outside the range \n(0.05,0.95),  the  algorithm  did  not  always  find  the  global  optimum.  It produced \nan  average  fitness  within  1%  of the  optimum  for  ():  between  0.1  and  0.4  and  L  = \n32, 64,128,256 over  a  100  trials,  but for  learning  rates  below  0.1  and  L  =  256  the \naverage fitness  fell  as  low  as  4%  below  optimum.  However,  this  is  much  improved \nover standard PBIL  (see  figure  2)  where the  average fitness  fell  to  60%  below  the \noptimum in that range. \n\n5  Corrective 2  - Probabilistic mutation \n\nAnother  approach to  control  drift  is  to  add  an operator  analogous to mutation in \nGAs.  Mutation  has  the  property that  when  repeatedly  applied,  it  converges  to a \nrandom data set.  Muhlenbein  [5]  has proposed that the analogous operator ED As \nestimates frequencies  biased towards a  random guess.  Suppose ii is  the fraction of \nl's at site i.  Then, the appropriate estimate of the probability of a  1 at site i  is \n\nii + m \n\"Ii  =  1 + 2m' \n\n(18) \n\nwhere  m  is  a  mutation-like  parameter.  This  will  be  recognized  as  the  maximum \nposterior  estimate  of the  binomial  distribution  using  as  the  prior a  ,a-distribution \nwith both parameters equal to mN + 1;  the prior biases the estimate towards 1/2. \nThis can be applied to PBIL  by using the following  learning rule, \n\n( \n\n1) \n\n\"Ii  t +  = \n\n\"Ii(t)  + ():  [x; - \"Ii (t)]  + m \n\n1 + 2m \n\n. \n\n(19) \n\nWith m  =  0  it gives  the usual PBIL  rule;  when  repeatedly applied on a  flat  space \nit converges to 1/2. \n\nUnlike  Detailed  Balance  PBIL,  this  approach  does  required  special  scaling  of the \nlearning rate, but the scaling is more benign than in standard PBIL and is problem \nindependent.  It is  determined  from  three  considerations.  First,  mutation  must \nbe  large  enough  to  counteract  the  effects  of drift  towards  random  corners  of the \nhypercube.  Thus, the fixed  point of the average distance to 1/2, (D(t + 1))  defined \nin equation  (2) , must be sufficiently close to zero.  Second,  mutation must be small \nenough  that it does  not  interfere  with  movement  towards the parameters near the \noptimum when the optimum is found.  Thus, the fixed point of equation (19) must be \nsufficiently close to 0 or 1.  Finally, a  sample of size N  sampled from the fixed  point \ndistribution near the hypercube corner containing the optimum should contain the \noptimum  with  a  reasonable  probability  (say  greater  than  1 - e- 1 ).  Putting these \nconsiderations together yields, \n\nlogN \n(): \n- - \u00bb  - \u00bb -. \n4 \n\nm \n(): \n\nL \n\n(20) \n\n\f5.1  Results \n\nTo  satisfy  the  conditions  in  equation  20,  the  mutation  rate  was  set  to  m  ex:  a 2 , \nand a  was  constrained to be smaller than log (N)/L.  For the needle-in-a-haystack, \nthe algorithm behaved like  Detailed Balance PElL. It never failed  to find  the opti(cid:173)\nmum for  the needle-in-a-haystack problems for  the  sizes  given  previously.  For  the \nfunctions  of unitation,  no  improvement over  standard PBIL is  expected,  since  the \nscaling  using  mutation  is  worse,  requiring  a  < 1/ L  rather than a  < 1/..fL.  How(cid:173)\never,  with  tuning  of the  mutation  rate,  the  range of a's  with  which  the  optimum \nwas  always found  could be increased over standard PBIL. \n\n6  Conclusions \n\nThe  learning  rate  of  PBIL  has  to  be  very  small  for  the  algorithm  to  work,  and \nunpredictably so  as it depends upon the problem size in a  problem dependent  way. \nThis  was  shown  in  two  very  simple  examples.  Detailed  balance fixed  the  problem \ndramatically in the two cases studied.  Using detailed balance, the algorithm consis(cid:173)\ntently finds the optimum over the entire range of learning rates.  Mutation also fixed \nthe problem when the parameters were chosen to satisfy a  problem-independent set \nof inequalities. \n\nThe  phenomenon  studied  here  could  hold  in  any  EDA,  because  for  any  type  of \nmodel, the probability is  high of generating a  population which reinforces the move \njust  made.  On  the  other  hand,  more  complex  models  have  many  more  parame(cid:173)\nters,  and also  have more  sources of variability,  so  the issue may be less  important. \nIt would  be  interesting  to  learn  how  important  this  sensitivity  is  in  EDAs  using \ncomplex graphical models. \n\nOf the proposed correctives, detailed balance will  be more difficult  to generalize to \nmodels in which the structure is learned.  It requires an understanding of algorithm's \ndynamics  on  a  flat  space,  which  may  be  very  difficult  to  find  in those  cases.  The \nmutation-type  operator  will  easier  to  generalize,  because  it  only  requires  a  bias \ntowards a random distribution.  However, the appropriate setting of the parameters \nmay be difficult  to ascertain. \n\nReferences \n[1]  S.  Baluja.  Population-based incremental  learning:  A  method for  integrating  genetic \n\nsearch  based function  optimization  and competive learning.  Technical  Report  CMU(cid:173)\nCS-94-163, Computer Science Department, Carnegie  Mellon  University, 1994. \n\n[2]  A.  Johnson and J.  L.  Shapiro.  The importance of selection mechanisms in  distribution \nestimation algorithms.  In Proceedings  of the  5th  International  Conference  on  Artificial \nEvolution  AE01,  2001. \n\n[3]  P.  Larraiiaga  and J.  A.  Lozano.  Estimation  of Distribution  Algorithms,  A  New  Tool \n\nfor  Evolutionary  Computation.  Kluwer  Academic Publishers,  2001. \n\n[4]  Eckhard  Limpert,  Werner  A.  Stahel,  and  Markus  Abbt.  Log-normal  distributions \n\nacross  the sciences:  Keys and clues.  BioScience,  51(5):341-352,  2001. \n\n[5]  H.  Miihlenbein.  The  equation  for  response  to  selection  and  its  use  for  prediction. \n\nEvolutionary  Computation,  5(3):303- 346,  1997. \n\n[6]  M.  Pelikan,  D.  E .  Goldberg,  and  F.  Lobo.  A  survey  of  optimization  by  building \n\nand  using  probabilistic  models.  Technical  report,  University  of  Illinois  at  Urbana(cid:173)\nChampaign, Illinois  Genetic  Algorithms Laboratory,  1999. \n\n[7]  Jonathan L. Shapiro and Adam Priigel-Bennett.  Maximum entropy analysis of genetic \n\nalgorithm operators.  Lecture  Notes  in  Computer Science, 993:14- 24,  1995. \n\n\f", "award": [], "sourceid": 2138, "authors": [{"given_name": "J.", "family_name": "Shapiro", "institution": null}]}