{"title": "Q-Learning with Hidden-Unit Restarting", "book": "Advances in Neural Information Processing Systems", "page_first": 81, "page_last": 88, "abstract": null, "full_text": "Q-Learning with Hidden-Unit Restarting \n\nCharles W. Anderson \n\nDepartment of Computer Science \n\nColorado State University \n\nFort Collins, CO 80523 \n\nAbstract \n\nPlatt's resource-allocation network (RAN) (Platt, 1991a, 1991b) \nis modified for a reinforcement-learning paradigm and to \"restart\" \nexisting hidden units rather than adding new units. After restart(cid:173)\ning, units continue to learn via back-propagation. The resulting \nrestart algorithm is tested in a Q-Iearning network that learns to \nsolve an inverted pendulum problem. Solutions are found faster on \naverage with the restart algorithm than without it. \n\n1 \n\nIntroduction \n\nThe goal of supervised learning is the discovery of a compact representation that \ngeneralizes well . Such representations are typically found by incremental, gradient(cid:173)\nbased search, such as error back-propagation. However, in the early stages of learn(cid:173)\ning a control task, we are more concerned with fast learning than a compact rep(cid:173)\nresentation. This implies a local representation with the extreme being the mem(cid:173)\norization of each experience. An initially local representation is also advantageous \nwhen the learning component is operating in parallel with a conventional, fixed \ncontroller. A learning experience should not generalize widely; the conventional \ncontroller should be preferred for inputs that have not yet been experienced. \n\nPlatt's resource-allocation network (RAN) (Platt, 1991a, 1991b) combines gradient \nsearch and memorization. RAN uses locally tuned (gaussian) units in the hidden \nlayer. The weight vector of a gaussian unit is equal to the input vector for which the \nunit produces its maximal response. A new unit is added when the network's error \nmagnitude is large and the new unit's radial domain would not significantly overlap \ndomains of existing units. Platt demonstrated RAN on the supervised learning task \n\n81 \n\n\f82 \n\nAnderson \n\nof predicting values in the Mackey-Glass time series. \n\nWe have integrated Platt's ideas with the reinforcement-learning algorithm called \nQ-Iearning (Watkins, 1989). One major modification is that the network has a \nfixed number of hidden units, all in a single-layer, all of which are trained on every \nstep. Rather than adding units, the least useful hidden unit is selected and its \nweights are set to new values, then continue the gradient-based search. Thus, the \nunit's search is restarted. The temporal-difference errors control restart events in a \nfashion similar to the way supervised errors control RAN's addition of new units. \n\nThe motivation for starting with all units present is that in a parallel implementa(cid:173)\ntion, the computation time for a layer of one unit is roughly the same as that for \na layer with all of the units. All units are trained from the start. Any that fail to \nlearn anything useful are re-allocated when needed. \n\nHere the Q-Iearning algorithm with restarts is applied to the problem of learning \nto balance a simulated inverted pendulum. In the following sections, the inverted \npendulum problem and Watkin's Q-Learning algorithm are described. Then the \ndetails of the restart algorithm are given and results of applying the algorithm to \nthe inverted pendulum problem are summarized. \n\n2 \n\nInverted Pendulum \n\nThe inverted pendulum is a classic example of an inherently unstable system. The \nproblem can be used to study the difficult credit assignment problem that arises \nwhen performance feedback is provided only by a failure signal. This problem has \noften used to test new approaches to learning control (from early work by Widrow \nand Smith, 1964, to recent studies such as Jordan and Jacobs, 1990, and Whitley, \nDominic, Das, and Anderson, 1993). It involves a pendulum hinged to the top \nof a wheeled cart that travels along a track of limited length. The pendulum is \nconstrained to move within the vertical plane. The state is specified by the position \nand velocity of the cart and the angle between the pendulum and vertical and the \nangular velocity of the pendulum. \n\nThe only information regarding the goal of the task is provided by the failure signal, \nor reinforcement, rt, which signals either the pendulum falling past \u00b112\u00b0 or the cart \nhitting the bounds of the track at \u00b11 m. The state at time t of the pendulum is \npresented to the network as a vector, Xt, of the four state variables scaled to be \nbetween 0 and 1. \n\nFor further details of this problem and other reinforcement learning approaches to \nthis problem, see Barto, Sutton, and Ande!'-son (1983) and Anderson (1987). \n\n3 Q-Learning \n\nThe objective of many control problems is to optimize a performance measure over \ntime. For the inverted pendulum problem, we define a reinforcement signal to be -1 \nwhen the pendulum angle or the cart position exceed their bounds, and 0 otherwise. \nThe objective is to maximize the sum of this reinforcement signal over time. \n\n\fQ-Learning with Hidden-Unit Restarting \n\n83 \n\nIf we had complete knowledge of state transition probabilities we could apply dy(cid:173)\nnamic programming to find the sequence of pushes that maximize the sum of rein(cid:173)\nforcements. Reinforcement learning algorithms have been devised to learn control \nstrategies when such knowledge is not available. In fact, Watkins has shown that one \nform of his Q-Iearning algorithm converges to the dynamic programming solution \n(Watkins, 1989; Watkins and Dayan, 1992). \nThe essence of Q-Iearning is the learning and use of a Q function, Q(x, a), that is \na prediction of a weighted sum of future reinforcement given that action a is taken \nwhen the controlled system is in a state represented by x. This is analogous to the \nvalue function in dynamic programming. Specifically, the objective of Q-Iearning is \nto form the following approximation: \n\nQ(Xt, at) :::::: L .. l7't+k+1 \n\n00 \n\nk=O \n\nwhere 0 < 'Y < 1 is a discount rate and 7't is the reinforcement received at time t. \nWatkins (1989) presents a number of algorithms for adjusting the parameters of Q. \nHere we focus on using error back-propagation to train a neural network to learn the \nQ function. For Q-Iearning, the following temporal-difference error (Sutton, 1988) \n\net = 7't+1 + 'Y max [Q(Xt+1, at+t)] - Q(Xt, at). \n\nat+l \n\nis derived by using max [Q(Xt+l, at+t)] as an approximation to L~=o 'Yk 7't+k+2. See \n(Barto, Bradtke, and Singh, 1991) for further discussion ofthe relationships between \nreinforcement learning and dynamic programming. \n\nat+l \n\n4 Q-Learning Network \n\nFor the inverted pendulum experiments reported here, a neural network with a \nsingle hidden layer was used to learn the Q( x, a) function. As shown in Figure 1, \nthe network has four inputs for the four state variables of the inverted pendulum, \nand two outputs corresponding to the two possible actions for this problem, similar \nto Lin (1992). In addition to the weights shown, wand v, the two units in the \noutput layer each have a single weight with a constant input of 0.5. \n\nThe activation function of the hidden units is the approximate gaussian function \nused by Platt. Let dj be the squared distance between the current input vector, x, \nand the weights in hidden unit j. \n\n4 \n\ndj = L(Xi - Wj,i)2 \n\ni=l \n\nHere Xi is the ith component of x at the current time. The output, Yj, of hidden \nunit j is \n\nYj = { \n\nif dj < P; \notherwise, \n\n\f84 \n\nAnderson \n\nXl \nx2 \n\nx3 \n\nX \n4 \n\nQ(x,-lO) \n\nQ(x,+10) \n\nFigure 1: Q-Learning Network \n\nwhere p controls the radius of the region in which the unit's output is nonzero. \nUnlike Platt, p is constant and equal for all units. \n\nThe output units calculate weighted sums of the hidden unit outputs and the \nconstant input. The output values are the current estimates of Q(Xt, -10) and \nQ(Xt, 10), which are predictions of future reinforcement given the current observed \nstate of the inverted pendulum and assuming a particular action will be applied in \nthat state. \n\nThe action applied at each step is selected as the one corresponding to the larger \nof Q(Xt, -10) and Q(Xt, 10). To explore the effects of each action, the action with \nthe lower Q value is applied with a probability that decreases with time: \n\n_ { 1 - 0.5At, \n\nif Q(Xt, 10) > Q(Xt, -10); \n\nP -\n\n0.5At , otherwise, \n\nat = \n{ \n\n10, \n-10, \n\nwith probability p; \nwith probability 1 - p. \n\nTo update all weights, error back-propagation is applied at each step using the \nfollowing temporal-difference error \n\n{ \n\nGt+l \n\n,max[Q(xt+l,at+l)] - Q(Xt, at), \nrt+l - Q(Xt, at), \n\net = \nNote that rt = 0 for all non-failure steps and drops out of the first expression. \nWeights are updated by the following equations, assuming Unit j is the output unit \ncorresponding to the action taken, and all variables are for the current time t. \n\nif failure does not occur on step t + 1, \nif failure occurs on step t + l. \n\n~WL . \n'\" ,I \n\n~V\u00b7 . J,I \n\ne yL V\u00b7 L (x\u00b7 - W\u00b7 .) \nJ ,I \n\n'\" J,'\" \n\nI \n\nf3h \n-\nP \nf3 e Yi \n\nIn all experiments, p = 2, A = 0.99999, and, \ndiscussed in Section 6. \n\n0.9. Values of f3 and f3h are \n\n\fQ-Learning with Hidden-Unit Restarting \n\n85 \n\n5 Restart Algorithm \n\nAfter weights are modified by back-propagation, conditions for a restart are checked. \nIf conditions are met, a unit is restarted, and processing continues with the next \ntime step. Conditions and primary steps of the restart algorithm appear below as \nthe numbered equations. \n\n5.1 When to Restart \n\nSeveral conditions must be met before a restart is performed. First, the magnitude \nof the error, et, must be larger than usual. To detect this, exponentially-weighted \naverages of the mean, J1., and variance, u 2, of et are maintained and used to calculate \na normalized error, e~ \n\ne' t \n\nJ1.t+l \n\nut+l \nFor our experiments, Ie = 0.99. \n\n2 \n\n(1 _ let)' \n\net -\nleJ1.t + (1 - Ie)et, \nleU; + (1 - Ie )e? , \n\nNow we can state the first restart condition. A restart is considered on steps for \nwhich the magnitude of the error is greater than 0.01 and greater than a constant \nfactor of the error's standard deviation, i.e., whenever \n\nle,1 > om \n\nand \n\nle,1 > aV(1 ~l\"n)' \n\n(1) \n\nOf a small number of tested values, a = 0.2 resulted in the best performance. \nBefore choosing a unit to restart for this step, we determine whether or not the \ncurrent input vector is already \"covered\" by a unit. Assuming Yj is the output of \nUnit j for the current input vector, the restart procedure is continued only if \n\nYj < 0.5, for j = 1, ... ,20 \n\n(2) \n\n5.2 Which Ullit to Restart \n\nAs stated by Mozer and Smolensky (1989), ideally we would choose the least useful \nunit as the one that results in the largest error when removed from the network. \nFor the Q-network, this requires the removal of one unit at a time, making multiple \nattempts to balance the pendulum, and determining which unit when removed \nresults in the shortest balancing times. Rather than following this computationally \nexpensive procedure, we simply took the sum of the magnitudes of a hidden unit's \noutput weights as a measure of it's utility. This is one of several utility measures \nsuggested by Mozer and Smolensky and others (e.g., Kloph and Gose, 1969). \n\nAfter a unit is restarted, it may require further learning experience to acquire a \nuseful function in the network. The amount of learning experience is defined as a \nsum of magnitudes of the error et. The sum of error magnitudes since Unit j was \n\n\f86 \n\nAnderson \n\nrestarted is given by Cj. Once this sum surpasses a maxImum, Cmax , the unit is \nagain eligible for restarting. Thus, Unit j is restarted when \n\n(IVI,j 1+ IV2,j I) \n\n(3) \n\nUj \n\n. min \nJE{1 \u2022...\u2022 20} \nand \nCj > Cmax . \n\n( 4 ) \nWithout a detailed search, a value of Cmax = 10 was found to result in good perfor(cid:173)\nmance. \n\n5.3 New Weights for Restarted Unit \n\nSay Unit j is restarted. It's input weights are set equal to the current input vector, \nx, the one for which the output of the network was in error. One of the two output \nweights of Unit j is also modified. The output weight through which Unit j modifies \nthe output of the unit corresponding to the action actually taken is set equal to the \nerror, et. The other output weight is not modified. \n\nW\u00b7\u00b7 }.' \n\nXi, for i = 1, ... , 4, \n\nwhere k \n\n{ I, \n2, \n\nif at = -10; \nif at = 10. \n\n(5) \n(6) \n\n6 Results \n\nThe pendulum is said to be balanced when 90,000 steps (1/2 hour of simulated \ntime) have elapsed without failure. After every failure, the pendulum is reset to the \ncenter ofthe track with a zero angle (straight up) and zero velocities. Performance is \njudged by the average number of failures before the pendulum is balanced. Averages \nwere taken over 30 runs. Each run consists of choosing initial values for the hidden \nunits' weights from a uniform distribution from 0 to 1, then training the net until \nthe pendulum is balanced for 90,000 steps or a maximum number of 50,000 failures \nis reached. \n\nTo determine the effect of restarting, we ccmpare the performance of the Q-Iearning \nalgorithm with and without restarts. Back-propagation learning rates are given by \n13 for the output units and 13h for the hidden units. 13 and 13h were optimized for the \nalgorithm without restarts by testing a large number of values. The best values of \nthose tried are 13 = 0.05 and 13h = 1.0. These values were used for both algorithms. \nA small number of values for the additional restart parameters were tested, so the \nrestart algorithm is not optimized for this problem. \n\nFigure 2 is a graph of the number of steps between failures versus the number of \nfailures. Each algorithm was initialized with the same hidden unit weights. Without \nrestarts the pendulum is balanced for this run after 6,879 failures. With restarts it \nis balanced after 3,415 failures. \n\nThe performances of the algorithms were averaged over 30 runs giving the following \nresults. The restart algorithm balanced the pendulum in all 30 runs, within an \n\n\f100,000-\n\n10,000-\n\nSteps \n\nBetween 1,000 -\nFailures \n\n100 -\n\n10 - I \no \n\nQ-Learning with Hidden-Unit Restarting \n\n87 \n\nWith Restarts \n\nWithout Restarts ... \n\nI \n\n'\u00a5 \n\n, \n\nI \nI \nI \nJ \nI , \n\n1,,\\ \n\" \n~ I \n'.. \n\n. \nI ' , \n:~' \nI \n\n\" \n\n_', \n~ \n\" \n\nI \n\n\\ \n\n\"'-------, ... -\n\nI \n\n2,000 \n\nI \n\n4,000 \n\nI \n\n6,000 \n\nFailures \n\nFigure 2: Learning Curves of Balancing Time Versus Failures (averaged over bins \nof 100 failures) \n\naverage of 3,303 failures. The algorithm without restarts was unsuccessful within \n50,000 failures for two of the 30 runs. Not counting the unsuccessful runs, this \nalgorithm balanced the pendulum within an average of 4,923 failures. Considering \nthe unsuccessful runs, this average is 7 ,928 failures. \n\nIn studying the timing of restarts, we observe that initially the number of restarts \nis small, due to the high variance of et in the early stages of learning. During later \nstages, we see that a single unit might be restarted many times (15 to 20) before it \nbecomes more useful (at least aecording to our measure) than some other unit. \n\n7 Conclusion \n\nThis first test of an algorithm for restarting hidden units in a reinforcement-learning \nparadigm led to a decrease in learning time for this task. However, much work \nremains in studying the effects of each step of the restart procedure. Many alter(cid:173)\nnatives exist, most significantly in the method for determining the utility of hidden \nunits. A significant extension of this algorithm would be to consider units with \nvariable-width domains, as in Platt's RAN algorithm. \n\nAcknowledgenlents \n\nThe work was supported in part by the National Science Foundation through Grant \nIRI-9212191 and by Colorado State University through Faculty Research Grant 1-\n38592. \n\n\f88 \n\nAnderson \n\nReferences \n\nC. W. Anderson. (1987). Strategy learning with multilayer connectionist repre(cid:173)\n\nA. G. Barto, S. J. Bradtke, and S. P. Singh. \n\nsentations. Technical Report TR87-509.3, GTE Laboratories, Waltham, MA, \n1987. Corrected version of article that was published in Proceedings of the \nFourth International Workshop on Machine Learning, pp. 103-114, June, 1987. \n(1991). Real-time learning and \ncontrol using asynchronous dynamic programming. Technical Report 91-57, \nDepartment of Computer Science, University of Massachusetts, Amherst, MA, \nAug. \n\nA. G. Barto, R. S. Sutton, and C. W. Anderson. (1983). Neuronlike elements that \ncan solve difficult learning control problems. IEEE Transactions on Systems, \nMan, and Cybernetics, 13:835-846. Reprinted in J. A. Anderson and E. Rosen(cid:173)\nfeld, Neurocomputing: Foundations of Research, MIT Press, Cambridge, MA, \n1988. \n\nM. I. Jordan and R. A. Jacobs. (1990). Learning to control an unstable system with \nforward modeling. In D. S. Touretzky, editor, Advances in Neural Information \nProcessing Systems, volume 2, pages 324-331. Morgan Kaufmann, San Mateo, \nCA. \n\nA. H. Klopf and E. Gose. (1969). An evolutionary pattern recognition network. \n\nIEEE Transactions on Systems, Science, and Cybernetics, 15:247-250. \n\nL.-J. Lin. (1992). Self-improving reactive agents based on reinforcement learning, \n\nplanning, and teaching. Machine Learning, 8(3/4):293-32l. \n\nM. C. Mozer and P. Smolensky. (1989). Skeltonization: A technique for trimming \nthe fat from a network via relevance assessment. In D. S. Touretzky, editor, \nAdvances in Neural Information Systems, volume 1, pages 107-115. Morgan \nKaufmann, San Mateo, CA, 1989. \n\nJ. C. Platt. (1991a). Learning by combining memorization and gradient descent. \nIn R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in \nNeural Information Processing Systems 3, pages 714-720. Morgan Kaufmann \nPublishers, San Mateo, CA. \n\nJ. C. Platt. (1991 b) A resource-allocating network for function interpolation. N eu(cid:173)\n\nral Computation, 3:213-225. \n\nR. S. Sutton. (1988). Learning to predict by the method of temporal differences. \n\nMachine Learning, 3:9-44. \n\nC. J. C. H. Watkins. (1989). Learning with Delayed Rewards. PhD thesis, Cam(cid:173)\n\nbridge University Psychology Department. \n\nC. J. C. H. Watkins and P. Dayan. \n\n(1992). Q-Iearning. Machine Learning, \n\n8(3/4):279-292. \n\nD. Whitley, S. Dominic, R. Das, and C. Anderson. (1993). Genetic reinforcement \n\nlearning for neurocontrol problems. Machine Learning, to appear. \n\nB. Widrow and F. W. Smith. (1964). Pattern-recognizing control systems. In Pro(cid:173)\n\nceedings of the 1963 Computer and Information Sciences (COINS) Symposium, \npages 288-317, Washington, DC. Spartan. \n\n\f", "award": [], "sourceid": 597, "authors": [{"given_name": "Charles", "family_name": "Anderson", "institution": null}]}