{"title": "Automatic Local Annealing", "book": "Advances in Neural Information Processing Systems", "page_first": 602, "page_last": 609, "abstract": null, "full_text": "602 \n\nAUTOMATIC LOCAL ANNEALING \n\nJared Leinbach \n\nDeparunent of Psychology \nCarnegie-Mellon University \n\nPittsburgh, PA 15213 \n\nABSTRACT \n\nThis research involves a method for finding global maxima \nin constraint satisfaction networks. \nIt is an annealing \nprocess butt unlike most otherst requires no annealing \nschedule. Temperature is instead determined locally by \nunits at each updatet and thus all processing is done at the \nunit level. There are two major practical benefits to \nprocessing this way: 1) processing can continue in 'bad t \nareas of the networkt while 'good t areas remain stablet and \n2) processing continues in the 'bad t areast as long as the \nconstraints remain poorly satisfied (i.e. it does not stop \nafter some predetermined number of cycles). As a resultt \nthis method not only avoids the kludge of requiring an \nexternally determined annealing schedulet but it also finds \nglobal maxima more quickly and consistently \nthan \nexternally scheduled systems \nthe \nto \nBoltzmann machine (Ackley et alt 1985) is made). FinallYt \nimplementation of this method is computationally trivial. \n\n(a comparison \n\nINTRODUCTION \n\nA constraint satisfaction network, is a network whose units represent hypotheses, \nbetween which there are various constraints. These constraints are represented by bi(cid:173)\ndirectional connections between the units. A positive connection weight suggests that if \none hypothesis is accepted or rejected, the other one should be also, and a negative \nconnection weight suggests that if one hypothesis is accepted or rejected. the other one \nshould not be. The relative importance of satisfying each constraint is indicated by the \nabsolute size of the corresponding weight. The acceptance or rejection of a hypothesis is \nindicated by the activation of the corresponding unit Thus every point in the activation \nspace corresponds to a possible solution to the constraint problem represented by the \nnetwork. The quality of any solution can be calculated by summing the 'satisfiedn~ss' of \nall the constraints. The goal is to find a point in the activation space for which the quality \nis at a maximum. \n\n\fAutomatic Local Annealing \n\n603 \n\nUnfortunately, if units update dettnninistically (i.e. if they always move toward the state \nthat best satisfies their constraints) there is no means of avoiding local quality maxima in \nthe activation space. This is simply a fimdamental problem of all gradient decent \nprocedures. Annealing systems attempt to avoid this problem by always giving units \nsome probability of not moving towards the state Ihat best satisfaes their constraints. This \nprobability is called the 'temperature' of the network. When the temperature is high, \nsolutions are generally not good, but the network moves easily throughout the activation \nspace. When the temperature is low, the network is committed to one area of the \nactivation space, but it is very good at improving its solution within that area. Thus the \nannealing analogy is born. The notion is that if you start with the temperature high, and \nlower it slowly enough, the network will gradually replace its 'state mobility' with 'state \nimprovement ability' , in such a way as to guide itself into a globally maximal state (much \nas the atoms in slowly annealed metals find optimal bonding structures). \n\nTo search for solutions this way, requires some means of detennining a temperature for \nthe network, at every update. Annealing systems simply use a predetennined schedule to \nprovide this information. However, there are both practical and theoretical problems with \nthis approach. The main practical problems are the following: 1) once an annealing \nschedule comes to an end. all processing is finished regardless of the quality of the \ncurrent solution, and 2) temperature must be unifonn across the network, even though \ndifferent parts of the network may merit different temperatures (this is the case any time \none part of the network is in a 'better' area of the activation space than another, which is \na natural condition). The theoretical problem with this approach involves the selection of \nannealing schedules. In order to pick an appropriate schedule for a network. one must \nuse some knowledge about what a good solution for that network is. Thus in order to get \nthe system to find a solution, you must already know something about the solution you \nwant it to find. The problem is that one of the most critical elements of the process. the \nway that the temperature is decreased, is handled by something other than the network \nitself. Thus the quality of the fmal solution must depend. at least in part. on that system's \nunderstanding of the problem. \n\nBy allowing each unit to control its own temperature during processing, Automatic Local \nAnnealing avoids this serious kludge. \nIn addition. by resolving the main practical \nproblems. it also ends up fmding global maxima more quickly and reliably than \nexternally controlled systems. \n\nMECHANICS \n\nAll units take on continuous activations between a unifonn minimum and maximum \nvalue. There is also a unifonn resting activation for all units (between the minimum and \nmaximum). Units start at random activations. and are updated synchronously at each \ncycle in one of two possible ways. Either they are updated via any ordinary update rule \nfor which a positive net input (as defined below) increases activation and a negative net \ninput decreases activation, or they are simply reset to their resting activation. There is an \nupdate probability function that detennines the probability of normal update for a unit \nbased on its temperature (as defmed below). It should be noted that once the net input for \n\n\f604 \n\nLeinbach \n\na unit has been calculated, rmding its temperature is trivial (the quantity (a; - rest) in the \nequation for g~ss; can come outside the summation). \n\nDefinitions: \n\n= ~ .(a-rest)xw .. \nIJ \n\nkJ \"} \n\ntemperature; = -g~sS;l1ltlUpOsgdnssi \ng~ssilnuuneggdnssi \n\nif g~ss; ~ 0 \notherwise \n\ngoodness; = Lj(a,rest)xwijx(arrest) \n1ItIUpOsgdnssi = the largest pos. v31ue that goodness; could be \nmaxneggdnss; = the largest neg. value that goodness; could be \n\nMaxposgdnss and maxneggdnss are constants that can be calculated once for each unit at \nthe beginning of simulation. They depend only on the weights into the unit, and the \nconstant maximum, minimum and resting activation values. Temperature is always a \nvalue between 1 and -1, with 1 representing high temperature and -1 low. \n\nSIMULATIONS \n\nThe parameters below were used in processing both of the networks that were tested. \nThe first network processed (Figure la) has two local maxima that are extremely close. to \nits two global maxima. This is a very 'difficult' network in the sense that the search for a \nglobal maximum must be extremely sensitive to the minute difference between the global \nmaxima and the next-best local maxima. The other network processed (Figure Ib) has \nmany local maxima, but none of them are especially close to the global maxima. This is \nan 'easy' network in the sense that the slow and cautious process that was used, was not \nreally necessary. A more appropriate set of parameters would have improved \nperformance on this second network, but it was not used in order to illustrate the relative \ngenerality of the algorithm. \n\nParameters: \n\nmaximum activation = 1 \nminimum activation = 0 \nresting activation = O.S \n\nnormal update rule: \n\nA activation; = netinput; x (moxactivation - activation;) x k \nnetinput; x (activation; - minactivation) x k \n\nif netinput; ~ 0 \notherwise \n\nwith'k = 0.6 \n\n\fupdate probability fwlction: \n\nAutomatic Local Annealing \n\n605 \n\n-I \n\n-.79 \n\no \n\nTE\\fPERA TCRE \n\nThis function defines a process that moves slowly towards a global maximum, moves \naway from even good solutions easily, and 'freezes' units that are colder than -0.79. \n\nRESULTS \n\nThe results of running the Automatic Local Annealing process on these two networks (in \ncomparison to a standard Boltzmann Machine's performance) are summarized in figures \n2a and 2b. With Automatic Local Annealing (ALA), the probability of having found a \nstable global maximum departs from zero fairly soon after processing begins. and \nincreases smoothly up to one. The Boltzmann Machine, instead, makes little 'useful' \nprogress until the end of the annealing schedule, and then quickly moves into a solution \nwhich mayor may not be a global maximum. In order to get its reliability near that of \nALA, the Boltzmann Machine's schedule must be so slow that solutions are found much \nmore slowly than ALA. Conversely in ordet to start finding solution as quickly as ALA. \nsuch a short schedule is necessary that the reliability becomes much worse than ALA's. \nFinally, if one makes a more reasonable comparison to the Boltzmann Machine (either by \nchanging the parameters of the ALA process to maximize its performance on each \nnetwork. or by using a single annealing schedule with the Boltzmann Machine for both \nnetworks). the overall performance advantage for ALA increases substantially. \n\nDISCUSSION \n\nHOW IT WORKS \nThe characteristics of the approach to a global maximum are determined by the shape of \nthe update probability function. By modifying this shape, one can control such things as: \nhow quickly/steadily the network moves towards a global maximum, how easily it moves \naway from local maxima, how good a solution must be in order for it to become \ncompletely stable, and so on. The only critical feature of the function, is that as \ntemperature decreases the probability of normal update increases. In this way, the colder \na unit gets the more steadily .it progresses towards an extreme activation value, and the \nhoUrz a wit gets the more time it spends near resting activation. From this you get hot \n\n\f606 \n\nLeinbach \n\n~5 \n\n1 \n\n~5 \n\n-2.5 \n\n1 \n\n-2.5 \n\nFigure la. A 'Difficult' Network. \n\nGlobal maxima are: 1) all eight upper units on, with the remaining units off, 2) all eight \nlower units on with the remaining units off. Next best local maxima are: 1) four uppel' \nleft and four lower right units on, with the remaiiung units off, 2) four upper right and \nfour lower left units on, with the remaining units off. \n\n-1.5 \n\n1~ \n\n1 \n\n1 \n\nFigure lb. An 'Easy' Network. \n\nNecker cube network (McClelland & Rumelhart 1988). Each set of four corresponding \nunits are connected as shown above. Connections for the other three such sets were \nomitted for clarity. The global maxima have all units in one cube on with all units in the \nother off. \n\n\fAutomatic Local Annealing \n\n607 \n\nAutomatic Local Annealing \n\n8 .M. with 125 de sctIedule' \n\n. I , \n\n: \n\n, \n\n0 \n\n50 \n\n100 \n\n150 \n\n200 \n\n250 \n\nFigure la. Performance On A 'Difficult' Network (Figure la). \n\nCycles 01 ProcesSing \n\n8 \n\n0 .. \n\n0 \n(\\I \n\n0 \n\no o \n\n\u2022 . WI \n\nY \n\nu \n\n--------crnrwTh~~~re~~~---\n- - - - - - - - - - - - B~M-: with -30 cycle schedul;;a- - --\n---------------------S.M.-w;th 2() cycie -sciieduiili ------\n\n. \n\n. \n\n8 .M with 10 cycle schedule \n\nI \" \n\n. - .,,' \n\n. .. , , \n\no \n\no \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\nCycles 01 Processing \n\nFigure 2b. Performance On An 'Easy' Network (Figure Ib). \n\nI Each line is based on 100 trials. A stable global maxima is one that the network \n\nremained in for the rest of the trial. \n\n2 All annealing schedules were the best performing three-leg schedules found. \n\n\f608 \n\nLeinbach \n\nunits that have little effect on movement in the activation space (since they conbibute \nlittle to any unit's net input), and cold units that compete to control this critical \nmovement. \n\nThe cold units 'coor connected units that are in agreement with them, and 'heat' \nconnected units that are in disagreement (see temperature equation). As the connected \nagreeing units are cooled, they too begin to cool their connected agreeing units. In this \nway coldness spreads out. stabilizing sets d units whose hypotheses agree. This \nspreading is what makes the ALA algorithm wort. A units decision about its hypothesis \ncan now be felt by units that are only distantly connected, as must be the case if units are \nto act in accordance with any global criterion (e.g. the overall quality d the states of \nthese networks). \n\nIn order to see why global maxima are found, one must consider the network as a whole. \nIn general, the amount of time spent in any state is proportional to the amount of heat in \nthat state (since heat is directly related to stability). The state(s) containing the least \npossible heat for a given network. will be the most stable. These state(s) will also \nrepresent the global maxima (since they have the least total 'dissatisfaction' of \nconstraints). Therefore, given infinite processing time, the most commonly visited states \nwill be the global maxima. More importantly, the 'visitedness.' of ~ry state will be \nproportional to its overall quality (a mathematical description of this has not yet been \ndeveloped). \n\nThis later characteristic provides good practical benefits, when one employs a notion of \nsolution satisficing. This is done by using an update probability function that allows \nunits to 'freeze' (i.e. have normal update pobabiUties of 1) at temperatures higher than-I \n(as was done with the simulations described above). In this condition, states can become \ncompletely stable, without perfectly satisfying all constraints. As the time of simulation \nincreases, the probability of being in any given state approaches approaches a value \nproportional to its quality. Thus, if there are any states good enough to be frozen, the \nchances of not having hit one will decrease with time. The amount of time necessary to \nsatisfice is directly related to the freezing point used. Times as small as 0 (for freezing \npoints> 1) and as large as infmity (for freezing points < -1) can be achieved. This type \nof time/quality trade-off, is extremely useful in many practical applications. \n\nMEASURING PERFORMANCE \nWhile ALA finds global maxima faster and more reliably than Boltzmann Machine \nannealing, these are not the only benefits to ALA processing. A number of othex \nelements make it preferable to externally scheduled annealing processes: 1) Various \nsolutions to subparts of problems are found and, at least temporarily, maintained during \nIf one considers constraint satisfaction netwOJks in terms of schema \nprocessing. \nprocessors, this corresponds nicely to the simultaneous processing of all levels of \nscbemas and subschemas. Subschemas with obvious solutions get filled in quickly, even \nwhen the higher level schemas have still not found real solutions. While these initial \nsub-solutions may not end up as part of the final solution, their appearance during \n\n\fAutomatic Local Annealing \n\n609 \n\nprocessing can still be quite useful in some settings. 2) ALA is much more biologically \nfeasible than externally scheduled systems. Not only can units flDlCtion on their own \n(without the use of an intelligent external processor), but the paths travened through the \nactivation space (as described by the schema example above) also parallel human \nprocessing more closely. \n3) ALA processing may lend itself to simple learning \nalgorithms. During processing, units are always acting in close accord with the \nconstraints that are present At fU'St distant corwtraint are ignmed in favor of more \nimmediate ones, but regardless the units rarely actually defy any constraints in the \nnetwork. Thus basic approaches to making weight adjustments, such as continuously \nincreasing weights between units that are in agreement about their hypotheses, and \ndecreasing weights between units that are in disagreement about their hypotheses \n(Minsky & Papert, 1968), may have new power. This is an area of current ~h, \nwhich would represent an enonnous time savings over Boltzmann Machine type learning \n(Ackley et at 1985) if it were to be found feasible. \n\nREFERENCES \n\nAckley, D. H., Hinton, G. E., & Sejnowski, T. I. (1985). A Learning Algorithm for \n\nBoltzmann Machines. Cognitive Science, 9,141-169. \n\nMcClelland, I. L., & Rumelhart. D. E. (1988). Explorations in Parallel Distributed \n\nProcessing. Cambridge, MA: MIT Press. \n\nMinsky, M., & Papert, S. (1968). Perceptrons. Cambridge, MA: MIT Press. \n\n\f", "award": [], "sourceid": 136, "authors": [{"given_name": "Jared", "family_name": "Leinbach", "institution": null}]}