{"title": "LEARNING BY STATE RECURRENCE DETECTION", "book": "Neural Information Processing Systems", "page_first": 642, "page_last": 651, "abstract": "", "full_text": "642 \n\nLEARNING BY ST ATE RECURRENCE DETECfION \n\nBruce E. Rosen, James M. Goodwint, and Jacques J. Vidal \n\nUniversity of California, Los Angeles, Ca. 90024 \n\nABSTRACT \n\nThis research investigates a new technique for unsupervised learning of nonlinear \ncontrol problems. The approach is applied both to Michie and Chambers BOXES \nalgorithm and to Barto, Sutton and Anderson's extension, the ASE/ACE system, and \nhas significantly improved the convergence rate of stochastically based learning \nautomata. \n\nRecurrence learning is a new nonlinear reward-penalty algorithm. It exploits \ninformation found during learning trials to reinforce decisions resulting in the \nrecurrence of nonfailing states. Recurrence learning applies positive reinforcement \nduring the exploration of the search space, whereas in the BOXES or ASE algorithms, \nonly negative weight reinforcement is applied, and then only on failure. Simulation \nresults show that the added information from recurrence learning increases the learning \nrate. \n\nOur empirical results show that recurrence learning is faster than both basic failure \ndriven learning and failure prediction methods. Although recurrence learning has only \nbeen tested in failure driven experiments, there are goal directed learning applications \nwhere detection of recurring oscillations may provide useful information that reduces \nthe learning time by applying negative, instead of positive reinforcement. \n\nDetection of cycles provides a heuristic to improve the balance between evidence \n\ngathering and goal directed search. \n\nINTRODUCflON \n\nThis research investigates a new technique for unsupervised learning of nonlinear \ncon trol problems with delayed feedback. Our approach is compared to both Michie and \nChambers BOXES algorithml, to the extension by Barto, et aI., the ASE (Adaptive \nSearch Element) and to their ASE/ACE (Adaptive Critic Element) system2, and shows \nan improved learning time for stochastically based learning automata in failure driven \ntasks. \n\nWe consider adaptively controlling the behavior of a system which passes \nthrough a sequence of states due to its internal dynamics (which are not assumed to be \nknown a priori) and due to choices of actions made in visited states. Such an adaptive \ncontroller is often referred to as a learning automaton. The decisions can be \ndeterministic or can be made according to a stochastic rule. A learning automaton has \nto discover which action is best in each circumstance by producing actions and \nobserving the resulting information. \n\nThis paper was motivated by the previous work of Barto, et al. to investigate \nneuronlike adaptive elements that affect and learn from their environment. We were \ninspired by their current work and the recent attention to neural networks and \nconnectionist systems, and have chosen to use the cart-pole control problem2, to enable \na comparison of our results with theirs . \n\n... \n! Permanent address: California State University, Stanislaus; Turlock, California. \n\n@ American Institute of Physics 1988 \n\n\f643 \n\nTHE CART \u00b7POLE PROBLEM \n\nIn their work on the cart-pole problem, Barto, Sutton and Anderson considered a \nlearning system composed of an automaton interacting with an environment. The \nproblem requires the automaton to balance a pole acting as an inverted pendulum hinged \non a moveable cart. The cart travels left or right along a bounded one dimensional track; \nthe pole may swing to the left or right about a pivot attached to the cart. The automaton \nmust learn to keep the pole balanced on the cart, and to keep the cart within the bounds \nof the track. The parameters of the cart/pole system are the cart po~ition and velocity, \nand the pole angle and angular velocity. The only actions available to the automaton are \nthe applications of a fixed impulsive force to the cart in either right or left direction; one \nof these actions must be taken. \n\nThis balancing is an extremely difficult problem if there is no a priori knowledge \nof the system dynamics, if these dynamics change with time, or if there is no \npreexisting controller that can be imitated (e.g. Widrow and Smith's3 ADALINE \ncontroller). We assumed no a priori knowledge of the dynamics nor any preexisting \ncontroller and anticipate that the system will be able to deal with any changing \ndynamics. \n\nNumerical simulations of the cart-pole solution via recurrence learning show \nsubstantial improvement over the results of Barto et aI., and of Michie and Chambers, \nas is shown in figure 1. The algorithms used, and the results shown in figure 1, will \nbe discussed in detail below. \n\n500000 \n\n~------------------------------\n\nT'unc \nUntil \nFai1\\R \n\n100000 \n\n25 \n\nso \nTrial No. \n\n75 \n\n110 \n\nFigure 1: Perfonnance of the ASE, ASE/ACE, Constant Recurrence (HI) and \n\nShon Recurrence (H2) Algorithms. \n\nTHE GENERAL PROBLEM: ASSIGNMENT OF CREDIT \n\nThe cart-pole problem is one of a class of problems known as \"credit \nassignment\"4, and in particular temporal credit assignment. The recurrence learning \nalgorithm is an approach to the general temporal credit assignment problem. It is \ncharacterized by seeking to improve learning by making decisions about early actions. \nThe goal is to find actions responsible for improved or degraded perfonnance at a much \nlater time. \n\nAn example is the bucket brigade algorithmS. This is designed to assign credit to \nrules in the system according to their overall usefulness in attaining their goals. This is \ndone by adjusting the strength value (weight) of each rule. The problem is of \nmodifying these strengths is to permit rules activated early in the sequence to result in \nsuccessful actions later. \n\n\f644 \n\nSamuels considered the credit assignment problem for his checkers playing \nprogram6. He noted that it is easy enough to credit the rules that combine to produce a \ntriple jump at some point in the game; it is much harder to decide which rules active \nearlier were responsible for changes that made the later jump possible. \n\nState recurrence learning assigns a strength to an individual rule or action and \nmodifies that action's strength (while the system accumulates experience) on the basis \nof the action's overall usefulness in the situations in which it has been invoked. In this \nit follows the bucket brigade paradigm of Holland. \n\nPREVIOUS WORK \n\nThe problems of learning to control dynamical systems have been studied in the \npast by Widrow and Smith3, Michie and Chambers!, Barto, Sutton, and Anderson2, \nand Conne1l7. Although different approaches have been taken and have achieved \nvarying degrees of success, each investigator used the cart/pole problem as the basis for \nempirically measuring how well their algorithms work. \n\nMichie and Chambersl built BOXES, a program that learned to balance a pole on \na cart. The BOXES algorithm choose an action that had the highest average time until \nfailure. After 600 trials (a trial is a run ending in eventual failure or by some time limit \nexpiration), the program was able to balance the pole for 72,000 time steps. Figure 2a \ndescribes the BOXES learning algorithm. States are penalized (after a system failure) \naccording to recency. Active states immediately preceding a system failure are \npunished most. \n\nBarto, Sutton and Anderson2 used two neuronlike adaptive elements to solve the \ncontrol problem. Their ASE/ACE algorithm chose the action with the highest \nprobability of keeping the pole balanced in the region, and was able to balance the pole \nfor over 60,000 time steps before completion of the lOOth trial. \n\nFigure 2a and 2b: The BOXES and ASE/ACE (Associative Search Elelement -\n\nAdpative Critic Element) algorithms \n\nFigure 2a shows the BOXES (and ASE) learning algorithm paradigm When the \nautomaton enters a failure state (C), all states that it has traversed (shaded rectangles) \nare punished, although state B is punished more than state A. (Failure states are those \nat the edges of the diagram.) Figure 2b describes the ASE/ACE learning algorithm. If \na system failure occurs before a state's expected failure time, the state is penalized. If a \nsystem failure occurs after its expected failure time, the state is rewarded. State A is \npenalized because a failure occurred at B sooner than expected. State A's expected \n\n\f645 \n\nfailure time is the time for the automaton to traverse from state A to failure point C. \nWhen leaving state A, the weights are updated if the new state's expected failure time \ndiffers from that of state A. \n\nAnderson8 used a connectionist system to learn to balance the pole. Unlike the \nprevious experiments, the system did provide well-chosen states a priori. On the \naverage, 10,000 trials were necessary to learn to balance the pole for 7000 time steps. \n\nConnell and Utgoff'7 developed an approach that did not depend on partitioning \nthe state space into discrete regions. They used Shepard's function9,l0 to interpolate \nthe degree of desirability of a cart-pole state. The system learned the control task after \n16 trials. However, their system used a knowledge representation that had a priori \ninformation about the system. \n\nO'n-lER RELATED WORK \n\nKlopfll proposed a more neurological class of differential learning mechanisms \nthat correlates earlier changes of inputs with later changes of outputs. The adaptation \nformula used multiplies the change in outputs by the weighted sum of the absolute \nvalue of the t previous inputs weights (~Wj)' the t previous differences in inputs (~Xj)' \nand the t previous time coefficients (c/ \n\nSutton's temporal differences (TD)12 approach is one of a class of adaptive \nprediction methods. Elements of this class use the sum of previously predicted output \nvalues multiplied by the gradient and an exponentially decaying coefficient to modify \nthe weights. Barto and Sutton 13 used temporal differences as the underlying learning \nprocedure for classical conditioning. \n\nTHERECURRENCELE~G~HOD \n\nDEFINITIONS \n\nA state is the set of values (or ranges) of parameters sufficient to specify the \n\ninstantaneous condition of the system. \n\nThe input decoder groups the environmental states into equivalence classes: \nelements of one class have identical system responses. Every environmental input is \nmapped into one of n input states. (All further references to \"states\" assumes that the \ninput values fall into the discrete ranges determined by the decoder, unless otherwise \nspecified. ) \n\nStates returned to after visiting one or more alternate states recur. \nAn action causes the modification of system parameters, which may change the \nsystem state. However, no change of state need occur, since the altered parameter \nvalues may be decoded within the same ranges. \n\nA weight, wet), is associated with each action for each state, with the probability \n\nof an allowed action dependent on the current value of its weight. \n\nA rule determines which of the allowable actions is taken. The rule is not \n\ndeterministic. It chooses an action stochastically, based on the weights. \n\nWeight changes, ~w(t), are made to reduce the likelihood of choosing an action \nwhich will cause an eventual failure. These changes are made based on the idea that the \nprevious action of an element, when presented with input x(t), had some influence in \ncausing a similar pattern to occur again. Thus, weight changes are made to increase the \nlikelihood that an element produces the same action f(t) when patterns similar to x(t) \noccur in the future. \n\n\f646 \n\nFor example, consider the classic problem of balancing a pole on a moving cart. \nThe state is specified by the positions and velocities of both the cart and the pole. The \nallowable actions are fixed velocity increments to the right or to the left, and the rule \ndetermines which action to take, based on the current weights. \n\nTHE ALGORITHM \n\nThe recurrence learning algorithm presented here is a nonlinear reward-penalty \nmethod 14. Empirical results show that it is successful for stationary environments. In \ncontrast to other methods, it also may be applicable to nonstationary environments'. \nOur efforts have been to develop algorithms that reward decision choices that lead the \ncontroller/environment to quasi-stable cycles that avoid failure (such as limit cycles, \nconverging oscillations and absorbing points). \n\nOur technique exploits recurrence information obtained during learning trials. \nThe system is rewarded upon return to a previous state, however weight changes are \nonly permitted when a state transition occurs. If the system returns to a state, it has \navoided failure. A recurring state is rewarded. A sequence of recurring states can be \nviewed as evidence for a (possibly unstable) cycle. The algorithm forms temporal \n\"cause and effect\" associations. \n\nTo optimize performance, dynamic search techniques must balance between \nchoosing a search path with known solution costs, and exploring new areas of the \nsearch space to find better or cheaper solutions. This is known as the two armed bandit \nproblem l5 , i.e. given a two handed slot machine with one arm's observed reward \nprobabilities higher than the other, one should not exclude playing with the arm with \nthe lesser payoff. Like the ASE/ACE system, recurrence learning learns while \nsearching, in contrast to the BOXES and ASE algorithms which learn only upon \nfailure. \n\nRANGE DECODING \n\nIn our work, as in Barto and others, the real valued input parameters are analyzed \nas members of ranges. This reduces computing resource demands. Only a limited \nnumber of ranges are allowed for each parameter. It is possible for these ranges to \noverlap, although this aspect of range decoding is not discussed in this paper, and the \nranges were considered nonoverlapping. When the parameter value falls into one of the \nranges that range is active. The specification of a state consists of one of the active \nranges for each of the parameters. If the ranges do not overlap, then the set of \nparameter values specify one unique state; otherwise the set of parameter values may \nspecify several states. Thus, the parameter values at any time determine one or several \nactive states Si from the set of n possible states. \n\nThe value of each environmental parameter falls into one of a number of ranges, \nwhich may be different for different parameters. A state is specified by the active range \nfor each parameter. \n\nThe set of input parameter values are decoded into one (or more) of n ranges Si' \n0<= i <= n. For this problem, boolean values are used to describe the activity level of \na state Si. The activity value of a state is 1 if the state is active, or 0 if it is inactive. \n\nACfION DECISIONS \n\nOur model is the same as that of the BOXES and ASE/ACE systems, where only \none input (and state) is active at any given time. All states were nonoverlapping and \nmutually exclusive, although there was no reason to preclude them from overlapping \n\n\fother than for consistency with the two previous models. In the ASE/ACE system and \nin ours as well, the output decision rule for the controller is based on the weighted sum \nof its inputs plus some stochastic noise. The action (output) decision of the controller \nis either + 1 or -1, as given by: \n\n647 \n\nwhere \n\nf( ) = [+ 1 .i f z ~ 0 ] \n-llfz 0), there is a weight change by the amount: CXz multiplied by the reward \nvalue, r2(t), and the current eligibility e2,i(t). For simplicity, the reward value, r2(t), \nmay be taken to be some positive constant, although it need not be; any environmental \nfeedback, yielding a reinforcement value as a function of time could be used instead. \nThe second eligibility function e2,i(t) yields one of three constant values for HI: -P2' 0, \nor P2 according to formula (7) below: \n\nif t-ti,last = 1 or ti,last = \u00b0 } \n\notherwise \n\n(7) \n\nwhere ti,last is the last time that state was active. \n\nactive (i.e. xi(t) = \u00b0 for all t) then ti,last=O. As the formula shows, e2,i(t) = \u00b0 if the state \n\nIf a state has not previously been \n\nhas not been previously visited or if no state transition occurred in the last time step; \notherwise, e2,i(t) = P2Xj(t)y(ti,last)\u00b7 \n\nThe direction (increase or decrease) of the weight change due to the final term in \n\n(6) is that of the last action taken, y(ti,last). \n\n\f649 \n\nHeuristic HI is called constant recurrence learning because the eligibility function \n\nis designed to reinforce any cycle. \n\nHEURISTIC H2: Reward a short cycle more than a longer one. \n\nHeuristic 82 is called short recurrence learning because the eligibility function is \n\ndesigned to reinforce shorter cycle more than longer cycles. \n\nREINFORCEMENT OF SHORTER CYCLES \n\nThe basis of the second heuristic is the conjecture that short cycles converge more \neasily to absorbing points than long ones, and that long cycles diverge more easily than \nshorter ones, although any cycle can \"grow\" or diverge to a larger cycle. The \nfollowing extension to the our basic heuristic is proposed. \nThe formula for the recurrence eligibility function is: \n\n{\ne2,i(t) = \n\no \n\nif t-ti,last = \n\n1 or li,last = 0 } \n\nP2 \n\n(P2+t -ti,last) \n\nxi(t) y(ti,last) otherwise \n\n(8) \n\nThe current eligibility function e2/t) is similar to the previous failure eligibility \nfunction in (7); however, e2 i(t) reinforces shorter cycles more, because the eligibility \ndecays with time. The value'returned from e2it) is inversely proportional to the period \nof the cycle from ti,last to t. H2 reinforces converging oscillations; the term \n(X.2r2(t)e2/t) in (6) ensures weight reinforcement for returning to an already visited \nstate. \n\nFigure 3a and 3b: The Constant Recurrence algorithm and Short Recurrence \n\nalgorithms \n\nFigure 3A shows the Constant Recurrence algorithm (HI). A state is rewarded \nwhen it is reactivated by a transition from another state. In the example below. state A \nis reward by a constant regardless of weather the cycle traversed states B or C. Figure \n3b describes the Short Recurrence algorithm (m). A state is rewarded according to the \ndifference between the current time and its last activation time. Small differences are \nrewarded more than large differences In the example below, state A is rewarded more \n\n\f650 \n\nwhen the cycle (through state C) traverses the states shown by the dark heavy line \nrather than when the cycle (through state B) traverses the lighter line, since state A \nrecurs sooner when traversing the darker line. \n\nSIMULATION RESULTS \n\nWe simulated four algorithms: ASE, ASE/ACE and the two recurrence \nalgorithms. Each experiment consisted of ten runs of the cart-pole balancing task, each \nconsisting of 100 trials. Each trial lasted for 500,000 time steps or until the cart-pole \nsystem failed (i.e. the pole fell or the cart went beyond the track boundaries). In an \neffort to conserve cpu time, simulations were also terminated when the system achieved \ntwo consecutive trials each lasting for over 500,000 time steps; all remaining trials were \nassumed to also last 500,000 time steps. This assumption was reasonable: the resulting \nweight space causes the controller to become deterministic regardless of the influence of \nstochastic noise. Because of the long time require to run simulations, no attempts were \nmade to optimize parameters of the algorithm. \n\nAs in Bart02, each trial began with the cart centered, and the pole upright. No \nassumptions were made as to the state space configuration, the desirability of the initial \nstates, or the continuity of the state space. \n\nThe first experiment consisted of failure and recurrence reward learning. The \nASE failure learning runs averaged 1578 time steps until failure after 100 trials*. Next, \nthe predictive ASE/ACE system was run as a comparative metric, and it was found that \nthis method caused the controller to average 131,297 time steps until failure; the results \nare comparable to that described by Barto, Sutton and Anderson. \n\nIn the next experiment, short recurrence learning system was added to the ASE \nsystem. Again, ten 100 trial learning session were executed. On the average, the short \nrecurrence learning algorithm ran for over 400,736 time steps after 100th trial, bettering \nthe ASE/ACE system by 205%. \n\nIn the final experiment, constant recurrence learning with the ASE system was \nsimulated. The constant recurrence learning eliminated failure after only 207,562 time \nsteps. \n\nFigure 1 shows the ASE, ASE/ACE, Constant recurrence learning (HI) and \n\nShort recurrence learning (H2) failure rates averaged over 10 simulation runs. \n\nDISCUSSION \n\nDetection of cycles provides a heuristic for the \"two armed bandit\" problem to \ndecide between evidence gathering, and goal directed search. The algorithm allows the \nautomaton to search outward from the cycle states (states with high probability of \nrevisitation) to the more unexplored search space. The rate of exploration is \nproportional to the recurrence learning parameter~; as ~ is decreased, the influence \nof the cycles governing the decision process also decreases and the algorithm explores \nmore of the search space that is not part of any cycle or oscillation path. \n\n* However, there was a relatively large degree of variance in the final trials. The last \n10 trails (averaged over each of the 10 simulations) ranged from 607 \nto 15,459 time \nsteps until \n\nfailure \n\n\f651 \n\nTHEFUfURE \n\nOur future experiments will study the effects of rewarding predictions of cycle \nlengths in a manner similar to the prediction of failure used by the ASE/ACE system. \nThe effort will be to minimize the differences of predicted time of cycles in order to \npredict their period. Results of this experiment will be shown in future reports. We \nhope to show that this recurrence prediction system is generally superior to either the \nASE/ACE predictive system or the short recurrence system operating alone. \n\nCONCLUSION \n\nThis paper presented an extension to the failure driven learning algorithm based \non reinforcing decisions that cause an automaton to enter environmental states more \nthan once. The controller learns to synthesize the best values by reinforcing areas of \nthe search space that produce recurring state visitation. Cycle states, which under \nnormal failure driven learning algorithms do not learn, achieve weight alteration from \nsuccess. Simulations show that recurrence reward algorithms show improved overall \nlearning of the cart-pole task with a substantial decrease in learning time. \n\nREFERENCES \n\n1. \n\n2 . \n3. \n\nD. Michie and R. Chambers, Machine Intelligence, E. Dale and D. Michie, Ed.: \n(Oliver and Boyd, Edinburgh, 1968), p. 137. \nA. Barto, R. Sutton, and C. Anderson, Coins Tech. Rept., No. 82-20, 1982. \nB. Widrow and F. Smith, in Computer and Information Sciences, 1. Tou and \nR. Wilcox Eds., (Clever Hume Press, 1964). \nM. Minsky, in Proc. IRE, 49, 8, (1961) . \nJ. Holland, in Proc. Int. Conj., Genetic Algs. and their Appl., 1985, p. 1. \nA. Samuel, IBM Journ. Res.and Dev. 3, 211, (1959) \nM. Connell and P. Utgoff, in Proc. AAAl-87 (Seattle, 1987), p. 456. \nC. Anderson, Coins Tech . Rept., No. 86-50: Amherst, MA. 1986. \nR. Barnhill, in Mathematical Software I II, (Academic Press, 1977). \nL. Schumaker, in Approximation Theory II. (Academic Press, 1976). \nA. H. Klopf, in IEEE Int. Conf. Neural Networks\" June 1987. \nR. Sutton, GTE Tech. Rept.TR87-509.1, GTE Labs. Inc., Jan. 1987 \nR. Sutton and A. G. Barto, Tech . Rept. TR87-5902.2 March 1987 \nA. Barto and P. Anandan, IEEE Trans. SMC 15, 360 (1985). \n\n4. \n5 . \n6. \n7 . \n8. \n9. \n10. \n11. \n12. \n13. \n14. \n15. M. Sato, K. Abe, and H. Takeda, IEEE Trans .SMC 14,528 (1984). \n\n\f", "award": [], "sourceid": 33, "authors": [{"given_name": "Bruce", "family_name": "Rosen", "institution": null}, {"given_name": "James", "family_name": "Goodwin", "institution": null}, {"given_name": "Jacques", "family_name": "Vidal", "institution": null}]}