{"title": "Fast, Robust Adaptive Control by Learning only Forward Models", "book": "Advances in Neural Information Processing Systems", "page_first": 571, "page_last": 578, "abstract": null, "full_text": "Fast, Robust Adaptive Control by Learning only \n\nForward Models \n\nAndrew W. Moore \n\nMIT Artificial Intelligence Laboratory \n\n545 Technology Square, Cambridge, MA 02139 \n\nawmGai.JD.it.edu \n\nAbstract \n\nA large class of motor control tasks requires that on each cycle the con(cid:173)\ntroller is told its current state and must choose an action to achieve a \nspecified, state-dependent, goal behaviour. This paper argues that the \noptimization of learning rate, the number of experimental control deci(cid:173)\nsions before adequate performance is obtained, and robustness is of prime \nimportance-if necessary at the expense of computation per control cy(cid:173)\ncle and memory requirement. This is motivated by the observation that \na robot which requires two thousand learning steps to achieve adequate \nperformance, or a robot which occasionally gets stuck while learning, will \nalways be undesirable, whereas moderate computational expense can be \naccommodated by increasingly powerful computer hardware. It is not un(cid:173)\nreasonable to assume the existence of inexpensive 100 Mflop controllers \nwithin a few years and so even processes with control cycles in the low \ntens of milliseconds will have millions of machine instructions in which to \nmake their decisions. This paper outlines a learning control scheme which \naims to make effective use of such computational power. \n\n1 MEMORY BASED LEARNING \nMemory-based learning is an approach applicable to both classification and func(cid:173)\ntion learning in which all experiences presented to the learning box are explic(cid:173)\nitly remembered. The memory, Mem, is a set of input-output pairs, Mem = \n{(Xl, YI), (X21 Y2), ... , (Xb Yk)}. When a prediction is required of the output of a \nnovel input Xquery, the memory is searched to obtain experiences with inputs close to \nXquery. These local neighbours are used to determine a locally consistent output for \nthe query. Three memory-based techniques, Nearest Neighbour, Kernel Regression, \nand Local Weighted Regression, are shown in the accompanying figure. \n\n571 \n\n\f572 \n\nMoore \n\nj. \n\u2022 o \u2022 \n\ni \u00b7 \u2022 o \u2022 \n\nj. \n\u2022 o \u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\nI \n\n\u2022 \n\nI \n\n\u2022 \n\n, \n\n\u2022 \n\n, w \n\n\u2022 \n\n, \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n, \n\n\u2022 \n\n, \n\n.e \n\n\u2022 \n\n, \n\n\u2022 \n\nI \n\n\u2022 \n\nI \n\n\u2022 \nlap.t \n\nY \n\n\u2022 \n\n\u2022 M \n\nlaput \n\nNearest \nNeighbour: \nYpredict(Xquery) = Yi where \ni minimizes {( Xi - x query) 2 : \n(Xi, Yi) E Mem}. There \nis a general \nintroduction \nin [5], some recent appli(cid:173)\ncations in [11], and recent \nrobot learning work in [9, 3]. \n\nlap.t \n\nKernel Regression: Also Local Weighted Regres(cid:173)\nknown as Shepard's interpo- sion: finds the linear map(cid:173)\nlation or Local Weighted Av- ping Y = Ax to minimize \nerages. Y;.;edict(Xquery) = the sum of weighted squares \nC\u00a3 w.y.)/ L w. where Wi = of residua!s E Wj(Yi - AXi)2. \nexp( -(Xi - X query )2 / K width 2)Yp!~dict IS ~hen AXquery. \nLWR was mtroduced \n[6] describes some variants \nfor \nrobot learning control by [1]. \n. \n\n2 A MEMORY-BASED INVERSE MODEL \nAn inverse model maps State x Behaviour ~ Action (8 x b ~ a). Behaviour is \nthe output of the system, typically the next state or time derivative of state. The \nlearned inverse model provides a conceptually simple controller: \n\n1. Observe 8 and b goa1 . \n2. a : -\n3. Perform action a and observe actual behaviour bactual. \n4. Update MEM with (8, b actual -\n\ninverse-model(s, bgoal) \n\nrequire behaviour bactual we should apply action a. \n\na): If we are ever again in state 8 and \n\nMemory-based versions of this simple algorithm have used nearest neighbour [9] \nand LWR [3]. bgoal is the goal behaviour: depending on the task it may be fixed \nor it may vary between control cycles, perhaps as a function of state or time. The \nalgorithm provides aggressive learning: during repeated attempts to achieve the \nsame goal behaviour, the action which is applied is not an incrementally adjusted \nversion of the previous action, but is instead the action which the memory and the \nmemory-based learner predicts will directly achieve the required behaviour. If the \nfunction is locally linear then the sequence of actions which are chosen are closely \nrelated to the Secant method [4] for numerically finding the zero of a function by \nbisecting the line between the closest approximations that bracket the y = 0 axis. If \nlearning begins with an initial error Eo in the action choice, and we wish to reduce \nthis error to Eo/I<, the number of learning steps is O(log log I<): subject to benign \nconditions, the learner jumps to actions close to the ideal action very quickly. \n\nA common objection to learning the inverse model is that it may be ill-defined. For \na memory-based method the problems are particularly serious because of its update \nrule. It updates the inverse model near bactual and therefore in those cases in which \nbgoal and bactual differ greatly, the mapping near bgoal may not change. As a result, \n\n\fFast, Robust Adaptive Control by Learning only Forward Models \n\n573 \n\nsubsequent cycles will make identical mistakes. [10] discusses this further. \n3 A MEMORY-BASED FORWARD MODEL \nOne fix for the problem of inverses becoming stuck is the addition of random noise \nto actions prior to their application. However, this can result in a large proportion \nof control cycles being wasted on experiments which the robot should have been able \nto predict as valueless, defeating the initial aim of learning as quickly as possible. \nAn alternative technique using multilayer neural nets has been to learn a forward \nmodel, which is necessarily well defined, to train a partial inverse. Updates to the \nforward model are obtained by standard supervised training, but updates to the \ninverse model are more sophisticated. The local Jacobian of the forward model \nis obtained and this value is used to drive an incremental change to the inverse \nmodel [8]. In conjunction with memory-based methods such an approach has the \ndisadvantage that incremental changes to the inverse model loses the one-shot learn(cid:173)\ning behaviour, and introduces the danger of becoming trapped in a local minimum. \n\nInstead, this investigation only relies on learning the forward model. Then the \ninverse model is implicitly obtained from it by online numerical inversion instead of \ndirect lookup. This is illustrated by the following algorithm: \n\n1. Observe sand bgoal. \n2. Perform numerical inversion: \n\nSearch among a series of candidate actions \na1, a2 .. , ak: \nbrredict : _ forvard-llodel( s, a1, MEM) \nb~redict : = forvard-llodel(s, a2, MEM) \n\nUntil \n\nI TIME-OUT I \nor I beredict = bgoal I \n\nberedict : _ forvard-llodel( s, ak, MEM) \n\n3. If TIME-OUT then perform experimental action else perform ak. \n4. Update MEM with (s, ak - bactual) \n\nA nice feature of this method is the absence of a preliminary training phase such \nas random flailing or feedback control. A variety of search techniques for numerical \ninversion can be applied. Global random search avoids local minima but is very slow \nfor obtaining accurate actions, hill climbing is a robust local procedure and more \naggressive procedures such as Newton's method can use partial derivative estimates \nfrom the forward model to make large second-order steps. The implementation used \nfor subsequent results had a combination of global search and local hill climbing. \n\nIn very high speed applications in which there is only time to make a small number \nof forward model predictions, it is not difficult to regain much of the speed advantage \nof directly using an inverse model by commencing the action search with ao as the \naction predicted by a learned inverse model. \n\n4 OTHER CONSIDERATIONS \nActions selected by a forward memory-based learner can be expected to converge \nvery quickly to the correct action in benign cases, and will not become stuck in dif(cid:173)\nficult cases, provided that the memory based representation can fit the true forward \n\n\f574 \n\nMoore \n\nmodel. This proviso is weak compared with incremental learning control techniques \nwhich typically require stronger prior assumptions about the environment, such as \nnear-linearity, or that an iterative function approximation procedure will avoid local \nminima. One-shot methods have an advantage in terms of number of control cy(cid:173)\ncles before adequate performance whereas incremental methods have the advantage \nof only requiring trivial amounts of computation per cycle. However, the simple \nmemory-based formalism described so far suffers from two major problems which \nsome forms of adaptive and neural controllers may avoid . \n\n\u2022 Brittle behaviour in the presence of outliers . \n\u2022 Poor resistance to non-stationary environments. \n\nMany incremental methods implicitly forget all experiences beyond a certain hori(cid:173)\nzon. For example, in the delta rule ~Wij = lI(y~ctual - yrredict) X j, the age beyond \nwhich experiences have a negligible effect is determined by the learning rate 1I. As \na result, the detrimental effect of misleading experiences is presen t for only a fixed \namount of time and then fades awayl . In contrast, memory-based methods remem(cid:173)\nber everything for ever. Fortunately, two statistical techniques: robust regression \nand cross-validation allow extensions to the numerical inversion method in which \nwe can have our cake and eat it too. \n\n5 USING ROBUST REGRESSION \nWe can judge the quality of each experience (Xi, yd E Mem by how well it is \npredicted by the rest of the experiences. A simple measure of the ith error is the \ncross validation error, in which the experience is first removed from the memory \nbefore prediction. efve =1 Predict(xi, Mem - {(Xi, Yin) I. With the memory(cid:173)\nbased formalism, in which all work takes place at prediction time, it is no more \nexpensive to predict a value with one datapoint removed than with it included. \nOnce we have the measure efve of the quality of each experience, we can decide \nif it is worth keeping. Robust statistics [7] offers a wide range of methods: this \nimplementation uses the Median Absolute Deviation (MAD) procedure. \n\n6 FULL CROSS VALIDATION \nThe value e~~ial = L.: efve, summed over all \"good\" experiences, provides a measure \nof how well the current representation fits the data. By optimizing this value with \nrespect to internal learner parameters, such as the width of the local weighting \nfunction [(width used by kernel regression and LWR, the internal parameters can be \nfound automatically. Another important set of parameters that can be optimized is \nthe relative scaling of each input variable: an example of this procedure applied to a \ntwo-joint arm task may be found in Reference [2]. A useful feature of this procedure \nis its quick discovery (and subsequent ignoring) of irrelevant input variables. \n\nCross-validation can also be used to selectively forget old inaccurate experiences \ncaused by a slowly drifting or suddenly changing environment. We have already \nseen that adaptive control algorithms such as the LMS rule can avoid such problems \nbecause the effects of experiences decay with time. Memory based methods can also \nforget things according to a forgetfulness parameter: all observations are weighted \n\nIThis also has disadvantages: persistence of excitation is required and multiple tasks \n\ncan often require relearning if they have not been practised recently. \n\n\fFast, Robust Adaptive Control by Learning only Forward Models \n\n575 \n\nby not only the distance to the Xquery but also by their age: \n\nWi = exp( -(Xi - Xquery)2 / Kwidth 2 -\n\n(n - i)/ Krecau) \n\n(1) \nwhere we assume the ordering of the experiences' indices i is temporal, with expe(cid:173)\nrience n the most recent. \nWe find the K recall that minimizes the recen t weighted average cross validation error \n\nL:?=o efve exp( -en - i)/,), where, is a human assigned 'meta-forgetfulness' con(cid:173)\n\nstant, reflecting how many experiences the learner would need in order to benefit \nfrom observation of an environmental change. It should be noted that, is a sub(cid:173)\nstantially less task dependent prescription of how far back to forget than would be \na human specified Krecall. Some initial tests of this technique are included among \nthe experiments of Section 8. \n\nArchitecture selection is another use of cross validation. Given a family of learners, \nthe member with the least cross validation error is used for subsequent predictions. \n\n7 COMPUTATIONAL CONSIDERATIONS \nUnless the real time between control cycles is longer than a few seconds, cross vali(cid:173)\ndation is too expensive to perform after every cycle. Instead it can be performed as \na separate parallel process, updating the best parameter values and removing out(cid:173)\nliers every few real control cycles. The usefulness of breaking a learning control task \ninto an online realtime processes and offline mental simulation was noted by [12]. \nInitially, the small number of experiences means that cross validation optimizes \nthe parameters very frequently, but the time between updates increases with the \nmemory size. The decreasing frequency of cross validation updates is little cause \nfor concern, because as time progresses, the estimated optimal parameter values are \nexpected to become decreasingly variable. \nIf there is no time to make more than one memory based query per cycle, then \nmemory based learning can nevertheless proceed by pushing even more of the com(cid:173)\nputation into the offline component. If the offline process can identify meaningful \nstates relevant to the task, then it can compute, for each of them, what the optimal \naction would be. The resulting state-action pairs are then used as a policy. The \nonline process then need only look up the recommended action in the policy, apply \nit and then insert (s, a, b) into the memory. \n8 COMPARATIVE TESTS \nThe ultimate goal of the investigation is to produce a learning control algorithm \nwhich can learn to control a fairly wide family of different tasks. Some basic, very \ndifferent, tasks have been used for the initial tests. \n\nThe HARD task, graphed in Figure 1, is a one-dimensional direct relationship between \naction and behaviour which is both non-monotonic and discontinuous. The VARIER \ntask (Figure 2) is a sinusoidal relation for which the phase continuously drifts, and \noccasionally alters catastrophically. \n\nLINEAR is a noisy linear relation between 4-d states, 4-d actions and 4-d behaviours. \nFor these first three tasks, the goal behaviour is selected randomly on each control \ncycle. ARM (Figure 3) is a simulated noisy dynamic two-joint arm acting under \ngravity in which state is perceived in cartesian coordinates and actions are produced \n\n\f576 \n\nMoore \n\nin joint-torque coordinates. Its task is to follow the circular trajectory. BILLIARDS is \na simulation of the real billiards robot described shortly in which 5% of experiences \nare entirely random outliers. \n-.-----------------, \n\n~ . o .. . J \n\nG1I \u2022 \n\n~ . o .. . J \n\nG1I \u2022 \n\nGoal Trajectory \n\n011 14\" ' . ' . \n\nAction \n\nFigure 1: The HARD relation. Figure 2: VARIER relation. \n\nFigure 3: The ARM task. \n\nThe following learning methods were tested: nearest neighbour, kernel regression \nand LWR, all searching the forward model and using a form of uncertainty-based in(cid:173)\ntelligent experimentation [10] when the forward search proved inadequate. Another \nmethod under test was sole use of the inverse, learned by LWR. Finally a \"best(cid:173)\npossible\" value was obtained by numerically inverting the real simulated forward \nmodel instead of a learned model. \n\nAll tasks were run for only 200 control cycles. In each case the quality of the learner \nwas measured by the number of successful actions in the final hundred cycles, where \n\"successful\" was defined as producing behaviour within a small tolerance of b goal . \nResults are displayed in Table 1. There is little space to discuss them in detail, \nbut they generally support the arguments of the previous sections. The inverse \nmodel on its own was generally inferior to the forward method, even in those cases \nin which the inverse is well-defined. Outlier removal improved performance on \nthe BILLIARDS task over non-robustified versions. \nInterestingly, outlier removal \nalso greatly benefited the inverse only method. The selectively forgetful methods \nperformed better than than their non-forgetful counterparts on the VARIER task, but \nin the stationary environments they did not pay a great penalty. Cross validation \nfor K width was useful: for the HARD task, LWR found a very small K width but in the \nLINEAR task it unsurprisingly preferred an enormous Kwidth. \nSome experiments were also performed with a real billiards robot shown in Figure 4. \nSensing is visual: one camera looks along the cue stick and the other looks down \nat the table. The cue stick swivels around the cue ball, which starts each shot \nat the same position. At the start of each attempt the object ball is placed at a \nrandom position in the half of the table opposite the cue stick. The camera above \nthe table obtains the (x, y) image coordinates of the object ball, which constitute \nthe state. The action is the x-coordinate of the image of the object ball on the cue \nstick camera. A motor swivels the cue stick until the centroid of the actual image \nof the object ball coincides with the chosen x-coordinate value. The shot is then \nperformed and observed by the overhead camera. The behaviour is defined as the \ncushion and position on the cushion with which the object ball first collides. \n\n\fFast, Robust Adaptive Control by Learning only Forward Models \n\n577 \n\n100 \u00b1O \n\nController type. (K = use MAD VARIER \noutlier removal, X = use cross-\nvalidation for K width, R = use cross-\nvalidation for K recall , IF = obtain \ninitial candidate action from the in-\nverse model then search the forward \nmodel.) \nBest Possible: Obtamed from nu-\nmerically inverting simulated world \nInverse only, learned WIth LWR \nInverse only, learned WIth LW R, KRX \nLWR: IF \nLWR: IF X \nLWK: IF KX \nLWK: IF KRX \nL W K: r'orward only, KRX \nKernel KegresslOn: IF \nKernel RegreSSion: IF KRX \nNearest Neigh bour: IF \nNearest Nelghbour: IF K \nNearest Neigh bour: IF KR \nNearest Neighbour: Forward only, \nKR \nGlobal Lmear RegresslOn: IF \nGlobal Lmear RegresslOn: IF KR \nGlobal Quadrattc RegresslOn: IF \n\n15 \u00b1 9 \n48 \u00b1 16 \n14\u00b1 10 \n19\u00b1 9 \n22 \u00b1 15 \n54\u00b1 8 \n56 \u00b1 9 \n8\u00b12 \n15 \u00b1 8 \n22\u00b1 4 \n26 \u00b1 10 \n44\u00b1 8 \n43 \u00b1 8 \n\np \n\nHARD \n\nLINEAR \n\nARM \n\nBIL'DS \n\n100 \u00b1O \n\n75 \u00b1 3 \n\n94\u00b1 1 \n\n82 \u00b1 4 \n\n24 \u00b1 11 \n72\u00b1 8 \n11 \u00b1 5 \n72\u00b1 4 \n51 \u00b1 27 \n65 \u00b128 \n53 \u00b1 17 \n6\u00b12 \n42 \u00b1 21 \n92\u00b1 2 \n69\u00b1 4 \n68\u00b1 3 \n66\u00b1 5 \n\n7\u00b16 \n70 \u00b1 4 \n58 \u00b1 4 \n70 \u00b1 4 \n73 \u00b1 3 \n70 \u00b1 5 \n73 \u00b1 1 \n13 \u00b1 3 \n14\u00b1 2 \nO\u00b1O \nO\u00b1O \nO\u00b1O \nO\u00b1O \n\n76 \u00b1 28 \n89\u00b1 4 \n83 \u00b1 4 \n89 \u00b1 3 \n90\u00b1 3 \n89\u00b1 2 \n89\u00b1 1 \n3\u00b12 \n23 \u00b1 10 \n44\u00b1 6 \n40\u00b1 6 \n40\u00b1 7 \n37 \u00b1 3 \n\n71 \u00b1 5 \n70\u00b1 10 \n55\u00b1 12 \n61 \u00b1 9 \n75 \u00b1 7 \n69 \u00b1 7 \n69\u00b1 7 \n1\u00b11 \n30\u00b1 5 \n10 \u00b1 2 \n9\u00b13 \n11 \u00b1 3 \n8\u00b11 \n\n23 \u00b1 6 \n21 \u00b1 4 \n40\u00b1 11 \n\n74\u00b1 5 \n73 \u00b1 4 \n64\u00b1 2 \nTable 1: Relative erformance of a famil o learners on a famil \ny of tasks. Each \ncombination of learner and task was run ten times to provide the mean number \nof successes and standard deviation shown in the table. \n\n8\u00b13 \n20 \u00b1 13 \n14\u00b17 \ny \n\n60 \u00b1 17 \n72\u00b1 3 \n70 \u00b1 22 \n\n7\u00b13 \n9\u00b12 \n5\u00b13 \n\nThe controller uses the memory based learner to choose the action to maximize the \nprobability that the ball will enter the nearer of the two pockets at the end of the \ntable. A histogram of the number of successes against trial number is shown in \nFigure 5. In this experiment, the learner was LWR using outlier removal and cross \nvalidation for [(width. After 100 experiences, control choice running on a Sun-4 was \ntaking 0.8 seconds2 . Sinking the ball requires better than 1 % accuracy in the choice \nof action, the world contains discontinuities and there are random outliers in the \ndata and so it is encouraging that within less than 100 experiences the robot had \nreached a 70% success rate--substantially better than the author can achieve. \nACKNOWLEDGEMENTS \nSome of the work discussed in this paper is being performed in collaboration with Chris \nAtkeson. The robot cue stick was designed and built by Wes Huang with help from Ger(cid:173)\nrit van Zyl. Dan Hill also helped considerably with the billiards robot. The author is \nsupported by a Postdoctoral Fellowship from SERC/NATO. Support was provided un(cid:173)\nder Air Force Office of Scien tific Research gran t AFOSR-89-0500 and a National Science \nFoundation Presidential Young Investigator Award to Christopher G. Atkeson. \n\n2This could have been greatly improved with more appropriate hardware or better \n\nsoftware techniques such as kd-trees for structuring data [11, 9]. \n\n\f578 \n\nMoore \n\n10 \n9 \n\n8 ! 7 \n&\" \nr- 5 \nJ .. \ne 3 \n::I Z :z \no 1 \n\n1 \n\no \n\n:zo \n\n40 \n\n60 \n\n80 \n\n100 \n\nFigure 4: The billiards robot. In the \nforeground is the cue stick which at(cid:173)\ntempts to sink balls in the far pockets. \n\nFigure 5: Frequency of successes versus \ncon trol cycle for the billiards task. \n\nTrial number (batches of 10) \n\nReferences \n\n[1] C. G. Atkeson. Using Local Models to Control Movement. In Proceedings of Neural \n\nInformation Processing Systems Conference, November 1989. \n\n[2] C. G. Atkeson. Memory-Based Approaches to Approximating Continuous Functions. \n\nTechnical report, M. I. T. Artificial Intelligence Laboratory, 1990. \n\n[3] C. G. Atkeson and D. J. Reinkensmeyer. Using Associative Content-Addressable \nMemories to Control Robots. In Miller, Sutton, and Werbos, editors, Neural Networks \nfor Control. MIT Press, 1989. \n\n[4] S. D. Conte and C. De Boor. Elementary Numerical Analysis. McGraw Hill, 1980. \n[5] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley \n\n& Sons, 1973. \n\n[6] R. Franke. Scattered Data Interpolation: Tests of Some Methods. Mathematics of \n\nComputation, 38(157), January 1982. \n\n[7] F. Hampbell, P. Rousseeuw, E. Ronchetti, and W. Stahel. Robust Statistics. Wiley \n\nInternational, 1985. \n\n[8] M. 1. Jordan and D. E. Rumelhart. Forward Models: Supervised Learning with a \n\nDistal Teacher. Technical report, M. I. T., July 1990. \n\n[9] A. W. Moore. Efficient Memory-based Learning for Robot Control. PhD. Thesis; \nTechnical Report No. 209, Computer Laboratory, University of Cambridge, October \n1990. \n\n[10] A. W. Moore. Knowledge of Knowledge and Intelligent Experimentation for Learning \nControl. In Proceedings of the 1991 Seattle International Joint Conference on Neural \nNetworks, July 1991. \n\n[11] S. M. Omohundro. Efficient Algorithms with Neural Network Behaviour. Journal of \n\nComplex Systems, 1(2):273-347, 1987. \n\n[12] R. S. Sutton. Integrated Architecture for Learning, Planning, and Reacting Based \non Approximating Dynamic Programming. In Proceedings of the 7th International \nConference on Machine Learning. Morgan Kaufman, June 1990. \n\n\f", "award": [], "sourceid": 585, "authors": [{"given_name": "Andrew", "family_name": "Moore", "institution": null}]}