{"title": "Automatic Local Annealing", "book": "Advances in Neural Information Processing Systems", "page_first": 602, "page_last": 609, "abstract": null, "full_text": "602 \n\nAUTOMATIC LOCAL ANNEALING \n\nJared Leinbach \n\nDeparunent of Psychology \nCarnegie-Mellon University \n\nPittsburgh, PA 15213 \n\nABSTRACT \n\nThis research involves a method for finding global maxima \nin  constraint  satisfaction  networks. \nIt  is  an  annealing \nprocess  butt  unlike  most  otherst  requires  no  annealing \nschedule.  Temperature  is  instead  determined  locally  by \nunits at each updatet and thus all processing is done at the \nunit  level.  There  are  two  major  practical  benefits  to \nprocessing  this  way:  1)  processing  can continue  in  'bad t \nareas of the networkt while 'good t areas remain stablet and \n2)  processing  continues  in  the  'bad t  areast as  long  as  the \nconstraints  remain  poorly  satisfied  (i.e.  it  does  not  stop \nafter  some  predetermined  number of cycles).  As a  resultt \nthis  method  not  only  avoids  the  kludge  of requiring  an \nexternally determined annealing schedulet but it also finds \nglobal  maxima  more  quickly  and  consistently \nthan \nexternally  scheduled  systems \nthe \nto \nBoltzmann machine (Ackley et alt 1985) is made).  FinallYt \nimplementation of this method is computationally trivial. \n\n(a  comparison \n\nINTRODUCTION \n\nA  constraint  satisfaction  network,  is  a  network  whose  units  represent  hypotheses, \nbetween  which  there are  various  constraints.  These  constraints  are  represented by  bi(cid:173)\ndirectional connections between the units.  A positive connection weight suggests that if \none  hypothesis  is  accepted  or rejected,  the  other  one  should  be  also,  and  a  negative \nconnection weight suggests that if one hypothesis is accepted or rejected. the other one \nshould not be.  The relative importance of satisfying each constraint is indicated by the \nabsolute size of the corresponding weight.  The acceptance or rejection of a hypothesis is \nindicated by the activation of the corresponding unit  Thus every point in  the activation \nspace  corresponds  to  a  possible  solution  to  the  constraint  problem  represented  by  the \nnetwork.  The quality of any solution can be calculated by summing the  'satisfiedn~ss' of \nall the constraints.  The goal is to find a point in the activation space for which the quality \nis at a maximum. \n\n\fAutomatic Local Annealing \n\n603 \n\nUnfortunately, if units update dettnninistically (i.e. if they always move toward the state \nthat best satisfies their constraints) there is no means of avoiding local quality maxima in \nthe  activation  space.  This  is  simply  a  fimdamental  problem  of all  gradient  decent \nprocedures.  Annealing  systems  attempt  to  avoid  this  problem  by  always  giving  units \nsome probability of not moving towards the state Ihat best satisfaes their constraints.  This \nprobability  is  called  the  'temperature'  of the  network.  When  the  temperature  is high, \nsolutions are generally not good, but the network moves easily throughout the activation \nspace.  When  the  temperature  is  low,  the  network  is  committed  to  one  area  of the \nactivation space, but it is very good at improving its solution within that area.  Thus the \nannealing analogy is born.  The notion is that if you start with the temperature high, and \nlower it slowly enough, the network will gradually replace its  'state mobility' with 'state \nimprovement ability' , in such a way as to guide itself into a globally maximal state (much \nas the atoms in slowly annealed metals find optimal bonding structures). \n\nTo search for solutions this way, requires some means of detennining a temperature for \nthe network, at every update.  Annealing systems simply use a predetennined schedule to \nprovide this information.  However, there are both practical and theoretical problems with \nthis  approach.  The  main  practical  problems  are  the  following:  1)  once  an  annealing \nschedule  comes  to  an  end.  all  processing  is  finished  regardless  of the  quality  of the \ncurrent solution,  and  2)  temperature  must be  unifonn  across  the  network,  even  though \ndifferent parts of the network may  merit different temperatures (this is the case any  time \none part of the network is in a 'better' area of the activation space than another, which is \na natural condition).  The theoretical problem with this approach involves the selection of \nannealing  schedules.  In  order to pick an  appropriate  schedule  for a network. one must \nuse some knowledge about what a good solution for that network is.  Thus in order to get \nthe  system  to find  a solution, you must already know  something about the  solution you \nwant it to find.  The problem is that one of the most critical elements of the process. the \nway  that the  temperature is decreased, is  handled  by  something  other than  the network \nitself.  Thus the quality of the fmal solution must depend. at least in part. on that system's \nunderstanding of the problem. \n\nBy allowing each unit to control its own temperature during processing, Automatic Local \nAnnealing  avoids  this  serious  kludge. \nIn  addition.  by  resolving  the  main  practical \nproblems.  it  also  ends  up  fmding  global  maxima  more  quickly  and  reliably  than \nexternally controlled systems. \n\nMECHANICS \n\nAll  units  take  on  continuous  activations  between  a  unifonn  minimum  and  maximum \nvalue.  There is also a unifonn resting activation for all units (between the minimum and \nmaximum).  Units  start at random  activations.  and  are  updated  synchronously  at each \ncycle in  one of two possible ways.  Either they are updated via any ordinary update rule \nfor which a positive net input (as defined below) increases activation and a negative net \ninput decreases activation, or they are simply reset to their resting activation.  There is an \nupdate  probability  function  that detennines  the  probability  of normal  update  for  a  unit \nbased on its temperature (as defmed below).  It should be noted that once the net input for \n\n\f604 \n\nLeinbach \n\na unit has been calculated, rmding its temperature is trivial (the quantity (a; - rest) in the \nequation for g~ss; can come outside the summation). \n\nDefinitions: \n\n=  ~ .(a-rest)xw .. \nIJ \n\nkJ  \"} \n\ntemperature; = -g~sS;l1ltlUpOsgdnssi \ng~ssilnuuneggdnssi \n\nif  g~ss; ~ 0 \notherwise \n\ngoodness; = Lj(a,rest)xwijx(arrest) \n1ItIUpOsgdnssi = the largest pos. v31ue that goodness; could be \nmaxneggdnss; = the largest neg. value that goodness; could be \n\nMaxposgdnss and maxneggdnss are constants that can be calculated once for each unit at \nthe  beginning  of simulation.  They  depend  only  on  the  weights  into  the  unit,  and  the \nconstant  maximum,  minimum  and  resting  activation  values.  Temperature  is  always  a \nvalue between 1 and -1, with 1 representing high temperature and -1 low. \n\nSIMULATIONS \n\nThe  parameters  below  were  used  in  processing  both  of the  networks  that  were  tested. \nThe first network processed (Figure  la) has two local maxima that are extremely close. to \nits two global maxima.  This is a very  'difficult' network in the sense that the search for a \nglobal maximum must be extremely sensitive to the minute difference between the global \nmaxima and the  next-best local  maxima.  The other network processed (Figure  Ib) has \nmany local maxima, but none of them are especially close to the global maxima.  This is \nan  'easy' network in the sense that the slow and cautious process that was used, was not \nreally  necessary.  A  more  appropriate  set  of  parameters  would  have  improved \nperformance on this second network, but it was not used in order to illustrate the relative \ngenerality of the algorithm. \n\nParameters: \n\nmaximum activation = 1 \nminimum activation = 0 \nresting activation = O.S \n\nnormal update rule: \n\nA activation; = netinput; x (moxactivation - activation;) x k \nnetinput; x (activation; - minactivation) x k \n\nif  netinput;  ~ 0 \notherwise \n\nwith'k = 0.6 \n\n\fupdate probability fwlction: \n\nAutomatic Local Annealing \n\n605 \n\n-I \n\n-.79 \n\no \n\nTE\\fPERA TCRE \n\nThis  function  defines  a  process  that  moves  slowly  towards  a  global  maximum,  moves \naway from even good solutions easily, and 'freezes' units that are colder than -0.79. \n\nRESULTS \n\nThe results of running the Automatic Local Annealing process on these two networks (in \ncomparison to a standard Boltzmann Machine's performance) are summarized in figures \n2a and 2b.  With  Automatic Local Annealing  (ALA),  the probability of having found  a \nstable  global  maximum  departs  from  zero  fairly  soon  after  processing  begins.  and \nincreases  smoothly  up  to  one.  The  Boltzmann  Machine,  instead,  makes  little  'useful' \nprogress until  the end of the annealing schedule, and then quickly moves  into a solution \nwhich  mayor may not be a global maximum.  In order to get its reliability near that of \nALA, the Boltzmann Machine's schedule must be so slow that solutions are found much \nmore slowly than  ALA.  Conversely in ordet to start finding solution as quickly as ALA. \nsuch a short schedule is  necessary that the reliability becomes much  worse than  ALA's. \nFinally, if one makes a more reasonable comparison to the Boltzmann Machine (either by \nchanging  the  parameters  of  the  ALA  process  to  maximize  its  performance  on  each \nnetwork. or by  using a single annealing schedule with  the  Boltzmann  Machine  for both \nnetworks). the overall performance advantage for ALA increases substantially. \n\nDISCUSSION \n\nHOW IT WORKS \nThe characteristics of the approach to a global maximum are determined by  the shape of \nthe update probability function.  By modifying this shape, one can control such things as: \nhow quickly/steadily the network moves towards a global maximum, how easily it moves \naway  from  local  maxima,  how  good  a  solution  must  be  in  order  for  it  to  become \ncompletely  stable,  and  so  on.  The  only  critical  feature  of  the  function,  is  that  as \ntemperature decreases the probability of normal update increases.  In this way, the colder \na unit gets  the  more steadily .it progresses towards  an extreme activation  value, and the \nhoUrz a wit gets the  more time it spends  near resting activation.  From  this you get hot \n\n\f606 \n\nLeinbach \n\n~5 \n\n1 \n\n~5 \n\n-2.5 \n\n1 \n\n-2.5 \n\nFigure la.  A 'Difficult' Network. \n\nGlobal maxima are:  1) all eight upper units on, with the remaining units off, 2) all eight \nlower units on with the remaining units off.  Next best local maxima are:  1) four uppel' \nleft and  four lower right units on, with  the remaiiung units off, 2) four  upper right and \nfour lower left units on, with the remaining units off. \n\n-1.5 \n\n1~ \n\n1 \n\n1 \n\nFigure lb.  An 'Easy' Network. \n\nNecker cube network (McClelland &  Rumelhart  1988).  Each set of four corresponding \nunits  are  connected  as  shown  above.  Connections  for  the  other three  such  sets  were \nomitted for clarity.  The global maxima have all units in one cube on with all units in the \nother off. \n\n\fAutomatic Local Annealing \n\n607 \n\nAutomatic Local Annealing \n\n8 .M. with 125  de sctIedule' \n\n. I , \n\n: \n\n, \n\n0 \n\n50 \n\n100 \n\n150 \n\n200 \n\n250 \n\nFigure la. Performance On A 'Difficult' Network (Figure  la). \n\nCycles 01  ProcesSing \n\n8 \n\n0 .. \n\n0 \n(\\I \n\n0 \n\no o \n\n\u2022 .  WI \n\nY \n\nu \n\n--------crnrwTh~~~re~~~---\n- - - - - - - - - - - - B~M-: with -30 cycle schedul;;a- - --\n---------------------S.M.-w;th 2() cycie -sciieduiili ------\n\n. \n\n. \n\n8 .M  with  10 cycle  schedule \n\nI \" \n\n.  - .,,' \n\n. .. , , \n\no \n\no \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\nCycles 01  Processing \n\nFigure 2b.  Performance On An 'Easy' Network (Figure Ib). \n\nI Each line is based on 100 trials.  A stable global maxima is one that the network \n\nremained in for the rest of the trial. \n\n2 All annealing schedules were the best performing three-leg schedules found. \n\n\f608 \n\nLeinbach \n\nunits  that  have  little effect on  movement in the  activation  space  (since they  conbibute \nlittle  to  any  unit's  net  input),  and  cold  units  that  compete  to  control  this  critical \nmovement. \n\nThe  cold  units  'coor  connected  units  that  are  in  agreement  with  them,  and  'heat' \nconnected units that are in disagreement (see  temperature equation).  As the connected \nagreeing units are cooled, they too begin to cool their connected agreeing units.  In this \nway  coldness  spreads  out.  stabilizing  sets  d  units  whose  hypotheses  agree.  This \nspreading is what makes the ALA algorithm wort.  A units decision about its hypothesis \ncan now be felt by units that are only distantly connected, as must be the case if units are \nto act in  accordance  with  any  global criterion  (e.g.  the  overall  quality  d  the  states of \nthese networks). \n\nIn order to see why global maxima are found, one must consider the network as a whole. \nIn general, the amount of time spent in any state is proportional to the amount of heat in \nthat  state  (since  heat  is  directly  related  to  stability).  The  state(s)  containing  the  least \npossible  heat  for  a  given  network.  will  be  the  most  stable.  These  state(s)  will  also \nrepresent  the  global  maxima  (since  they  have  the  least  total  'dissatisfaction'  of \nconstraints).  Therefore, given infinite processing time, the most commonly visited states \nwill  be  the  global  maxima.  More importantly,  the  'visitedness.'  of ~ry state  will  be \nproportional  to  its  overall  quality  (a mathematical  description  of this  has  not yet been \ndeveloped). \n\nThis later characteristic provides good practical benefits, when one employs a notion of \nsolution  satisficing.  This  is  done  by  using  an  update  probability  function  that  allows \nunits to 'freeze' (i.e. have normal update pobabiUties of 1) at temperatures higher than-I \n(as was done with the simulations described above).  In this condition, states can become \ncompletely stable, without perfectly satisfying all constraints.  As the time of simulation \nincreases,  the  probability  of being  in  any  given  state  approaches  approaches  a  value \nproportional  to its  quality.  Thus, if there are  any  states  good  enough  to be frozen,  the \nchances of not having hit one will decrease with time.  The amount of time necessary to \nsatisfice is directly related to the freezing point used.  Times as small as 0 (for freezing \npoints> 1) and as large as infmity (for freezing points < -1) can be achieved.  This type \nof time/quality trade-off, is extremely useful in many practical applications. \n\nMEASURING PERFORMANCE \nWhile  ALA  finds  global  maxima  faster  and  more  reliably  than  Boltzmann  Machine \nannealing,  these  are  not  the  only  benefits  to  ALA  processing.  A  number  of othex \nelements  make  it preferable  to  externally  scheduled  annealing  processes:  1)  Various \nsolutions to subparts of problems are found and, at least temporarily, maintained during \nIf one  considers  constraint  satisfaction  netwOJks  in  terms  of  schema \nprocessing. \nprocessors,  this  corresponds  nicely  to  the  simultaneous  processing  of  all  levels  of \nscbemas and subschemas.  Subschemas with obvious solutions get filled in quickly, even \nwhen  the  higher level  schemas  have  still  not found  real  solutions.  While these  initial \nsub-solutions  may  not  end  up  as  part  of  the  final  solution,  their  appearance  during \n\n\fAutomatic Local Annealing \n\n609 \n\nprocessing can still be quite useful in some settings.  2) ALA is much more biologically \nfeasible  than  externally  scheduled  systems.  Not only can  units  flDlCtion  on  their own \n(without the use of an intelligent external processor), but the paths travened through the \nactivation  space  (as  described  by  the  schema  example  above)  also  parallel  human \nprocessing  more  closely. \n3)  ALA  processing  may  lend  itself  to  simple  learning \nalgorithms.  During  processing,  units  are  always  acting  in  close  accord  with  the \nconstraints  that  are  present  At  fU'St  distant  corwtraint  are  ignmed  in favor  of more \nimmediate  ones,  but  regardless  the  units  rarely  actually  defy  any  constraints  in  the \nnetwork.  Thus basic  approaches  to  making  weight adjustments,  such  as  continuously \nincreasing  weights  between  units  that  are  in  agreement  about  their  hypotheses,  and \ndecreasing  weights  between  units  that  are  in  disagreement  about  their  hypotheses \n(Minsky  &  Papert,  1968),  may  have  new  power.  This  is  an  area of current ~h, \nwhich would represent an enonnous time savings over Boltzmann Machine type learning \n(Ackley et at 1985) if it were to be found feasible. \n\nREFERENCES \n\nAckley,  D.  H.,  Hinton,  G.  E.,  &  Sejnowski,  T. I. (1985).  A  Learning  Algorithm  for \n\nBoltzmann Machines.  Cognitive Science, 9,141-169. \n\nMcClelland,  I.  L.,  &  Rumelhart.  D.  E.  (1988).  Explorations  in  Parallel  Distributed \n\nProcessing.  Cambridge, MA: MIT Press. \n\nMinsky, M., & Papert, S. (1968).  Perceptrons.  Cambridge, MA: MIT Press. \n\n\f", "award": [], "sourceid": 136, "authors": [{"given_name": "Jared", "family_name": "Leinbach", "institution": null}]}