{"title": "Assessing the Quality of Learned Local Models", "book": "Advances in Neural Information Processing Systems", "page_first": 160, "page_last": 167, "abstract": null, "full_text": "Assessing the Quality of Learned Local Models \n\nStefan Schaal \n\nChristopher G. Atkeson \n\nDepartment of Brain and Cognitive Sciences & The Artifical Intelligence Laboratory \n\nMassachusetts Institute of Technology \n\n545 Technology Square, Cambridge, MA 02139 \n\nemail: sschaal@ai.mit.edu, cga@ai.mit.edu \n\nAbstract \n\nAn approach is presented to learning high dimensional functions in the case \nwhere the learning algorithm can affect the generation of new data. A local \nmodeling algorithm, locally weighted regression, is used to represent the learned \nfunction. Architectural parameters of the approach, such as distance metrics, are \nalso localized and become a function of the query point instead of being global. \nStatistical tests are given for when a local model is good enough and sampling \nshould be moved to a new area. Our methods explicitly deal with the case where \nprediction accuracy requirements exist during exploration: By gradually shifting \na \"center of exploration\" and controlling the speed of the shift with local pre(cid:173)\ndiction accuracy, a goal-directed exploration of state space takes place along the \nfringes of the current data support until the task goal is achieved. We illustrate \nthis approach with simulation results and results from a real robot learning a \ncomplex juggling task. \n\nINTRODUCTION \n\n1 \nEvery learning algorithm faces the problem of sparse data if the task to be learned is suf(cid:173)\nficiently nonlinear and high dimensional. Generalization from a limited number of data \npoints in such spaces will usually be strongly biased. If, however, the learning algorithm \nhas the ability to affect the creation of new experiences, the need for such bias can be re(cid:173)\nduced. This raises the questions of (1) how to sample data the most efficient, and (2) how \nto assess the quality of the sampled data with respect to the task to be learned. To address \nthese questions, we represent the task to be learned with local linear models. Instead of \nconstraining the number of linear models as in other approaches, infinitely many local \nmodels are permitted. This corresponds to modeling the task with the help of (hyper-) \ntangent planes at every query point instead of representing it in a piecewise linear fash(cid:173)\nion. The algorithm applied for this purpose, locally weighted regression (L WR), stems \nfrom nonparametric regression analysis (Cleveland, 1979, Muller, 1988, Hardie 1990, \nHastie&Tibshirani, 1991). In Section 2, we will briefly outline LWR. Section 3 discusses \n\n160 \n\n\fAssessing the Quality of Learned Local Models \n\n161 \n\nseveral statistical tools for assessing the quality of a learned linear L WR model, how to \noptimize the architectural parameters of L WR, and also how to detect outliers in the data. \nIn contrast to previous work, all of these statistical methods are local, i.e., they depend on \nthe data in the proximity of the current query point and not on all the sampled data. A \nsimple exploration algorithm, the shifting setpoint algorithm (SSA), is used in Section 4 \nto demonstrate how the properties of L WR can be exploited for learning control. The \nSSA explicitly controls prediction accuracy during learning and samples data with the \nhelp of optimal control techniques. Simulation results illustrate that this method work \nwell in high dimensional spaces. As a final example, the methods are applied to a real \nrobot learning a complex juggling task in Section 5. \n\nLOCALLY WEIGHTED REGRESSION \n\n2 \nLocally linear models constitute a good compromise between locally constant models \nsuch as nearest neighbors or moving average and locally higher order models; the former \ntend to introduce too much bias while the latter require fitting many parameters which is \ncomputationally expensive and needs a lot of data. The algorithm which we explore here, \nlocally weighted regression (LWR) (Atkeson, 1992, Moore, 1991, Schaal&Atkeson, \n1994), is closely related to versions suggested by Cleveland et al. (1979, 1988) and \nFarmer&Siderowich (1987). A LWR model is trained by simply storing every experi(cid:173)\nence as an input/output pair in memory. If an output Y, is to be generated from a given \ninput x\" the it is computed by fitting a (hyper-) tangent plane at x by means of weight-\nd \ne regressIOn: \n\n. \n\n, \n\n(1) \n\nwhere X is an mx(n+ 1) matrix of inputs to the regression, y the vector of corresponding \noutputs, P(x,) the vector of regression parameters, and W the diagonal mxm matrix of \nweights. The requested Y,results from evaluating the tangent plane at x ,i.e., Y = x~p. \nThe elements of W give points which are close to the current query poi~t x, a l~ger in(cid:173)\nfluence than those which are far away. They are determined by a Gaussian kernel: \n\nw;(x,) = exp( (x; - x,lD(x,)(x; - x,) / 2k(x,)2) \n\n(2) \n\nw; is the weight 'for the i rh data point (xj,Yj) in memory given query point x . The ma(cid:173)\ntrix D(x,) weights the contribution of the individual input dimensions, and the factor \nk(x,) determines how local the regression will be. D and k are architectural parameters \nof L WR and can be adjusted to optimize the fit of the local model. In the following we \nwill just focus on optimizing k, assuming that D normalizes the inputs and needs no fur(cid:173)\nther adjustment; note that, with some additional complexity, our methods would also hold \nfor locally tuning D. \n\nASSESSING THE LOCAL FIT \n\n3 \nIn order to measure the goodness of the local model, several tests have been suggested. \nThe most widely accepted one is leave-one-out cross validation (CV) which calculates the \nprediction error of every point in memory after recalculating (1) without this point \n(Wahba&Wold 1975, Maron&Moore 1994). As an alternative measure, Cleveland et al. \n(1988) suggested Mallow's Cp-test, originally developed as a way to select covariates in \nlinear regression analysis (Mallow, 1966). Hastie&Tibshirani (1991) showed that CV and \nthe Cp-test are closely related for certain classes of analyses. Hastie&Tibshirani (1991) \n\n\f162 \n\nSchaal and Atkeson \n\nalso presented pointwise standard-error bands to assess the confidence in a fitted value \nwhich correspond to confidence bands in the case of an unbiased fit All these tests are \nessentially global by requiring statistical analysis over the entire range of data in mem(cid:173)\nory. Such a global analysis is computationally costly, and it may also not give an ade(cid:173)\nquate measure at the current query site Xq: the behavior of the function to be approxi(cid:173)\nmated may differ significantly in different places, and an averaging over all these behav(cid:173)\niors is unlikely to be representative for all query sites (Fan&Gijbels, 1992). \nIt is possible to convert some of the above measures to be local. Global cross validation \nhas a relative in linear regression analysis, the PRESS residual error (e.g., Myers, 1990), \nhere formulated as a mean squared local cross validation error: \n\nn is the number of data points in memory contributing with a weight Wj greater than \nsome small constant (e.g., Wi> 0.01) to the regression, and p is the dimensionality of ~. \nThe PRESS statistic performs leave-one-out cross validation computationally very effi(cid:173)\ncient by not requiring the recalculation of ~ (Eq.(1)) for every excluded point. \n\nAnalogously, prediction intervals from linear regression analysis (e.g., Myers, 1990) can \nbe transformed to be a local measure too: \n\n1'1 = x;~ \u00b1 (a/2,11'-p' S~1 + x: (XTWTWXfl Xq \n\nwhere S2 is an estimate of the variance at x'I: \n\nS2(X ) = (X~ - ytWTW(X~ - y) \n\nq \n\nn' - p' \n\n(4) \n\n(5) \n\nand (a/2,,.'-' isStudent'st-valueof n'-p' degrees of freedom fora l00(I-a)% predic(cid:173)\ntion bound. The direct interpretation of (4) as prediction bounds is only possible if y is \nan unbiased estimate, which is usually hard to determine. \nFinally, the PRESS statistic can also be used for local outlier detection. For this PUIJJOse it \nis reformulated as a standardized individual PRESS residual: \n\n'I \n\neiC,..,.. .. (x q )= ~ \n, \n\nS 1- w\u00b7x. (X W wx) X.W. \n\nT T T \n\n-1 \n\nI \n\nI \n\nI \n\nI \n\n(6) \n\nThis measure has zero mean and unit variance. If it exceeds a certain threshold for a point \nXi' the point can be called an outlier. \nAn important ingredient to forming the measures (3)-(6) lies in the definition of n' and \np' as given in (3). Imagine that the weighting function (2) is not Gaussian but rather a \nfunction that clips data points whose distance from the current query point exceeds a cer(cid:173)\ntain threshold and that the remaining r data points all contribute with unit weight. This \nreduced data regression coincides correctly with a r -data regression since n' = r . In the \ncase of the soft-weighting (2). the definition of n' ensures the proper definition of the \nmoments of the data. However, the definition of p', i.e., the degrees of freedom of the re(cid:173)\ngression, is somewhat arbitrary since it is unclear how many degrees of freedom have ac-\n\n\fAssessing the Quality of Learned Local Models \n\n163 \n\ntually been used. Defining p' as in (3) guarantees that p' < n' and renders all results \nmore pessimistic when only a small number of data points contribute to the regression. \n\nJ; ., \n\nA , ,.. \n\n0.5 \n\n1.5 \n\nA , ,.. \n\n0.5 \n\n(b) \n\n(a) : \n\n2 \n\n1.5 \n\n1 \\. \n.\n:, \n\nThe statistical tests (3) and (4) can not only be \nused as a diagnostic tool, but they can also \nserve to optimize the architectural parameters \nofLWR. This results in a function fitting tech-\nnique which is called supersmoothing in statis(cid:173)\ntics (Hastie&Tibshirani, 1991). Fan&Gijbels \n(1992) investigated a method for this purpose \nthat required estimation of the second deriva-\n.0.S.o+.2~+-+-0 ~\"0.\"\"2 c..,.....,....,..O' \u2022 ..;..;........~0T-.8~\"0.~e ~-i--r'~1.2 tive of the function to be approximated and the \ndata density distribution. These two measures \nare not trivially obtained in high dimensions \nand we would like to avoid using them. Figure \n1 shows fits of noisy data from the function \ny = x- sin\\2n:x 3 ) COS(2n:x3) exp(x4) with \n95% prediction intervals around the fitted val(cid:173)\nues. In Figure la, global one-leave-out cross \nvalidation was applied to optimize k (cf. \n.o.5.+0.2~\"\"\"\"~\"0.-'2 ~\"\"'0 .\u2022 \"\"\"\"\"\"~0f-.8~\"0.8~\"\"\"\"\"\"\".....-'r-...-112 Eq.(2\u00bb. In the left part of the graph the fit \nstarts to follow noise. Such behavior is to be \nexpected since the global optimization of k \nalso took into account the quickly changing \nregions on the right side of the graph and thus \nchose a rather small k. In Figure 1b mini(cid:173)\nmization of the local one-leave-out cross vali(cid:173)\ndation error was applied to fit the data, and in \nFigure 1c prediction intervals were mini-\n.0.5.+0.2..,......,.-.,.....,~...,0.2-.-,......,.,..0 .\u2022 ...,.....,....~0r-.8,....,.....,r-r0.8~...,...,-..,......,.-.,....,1.2 mized. These two fits cope nicely with both \n\n-_. _. predcton int.rv. \n\n\" nai., data \n\n0.5 \n\no \n\nx \u00b7\u00b7> \n\n(c) \n\n1.5 \n\nJ(--> \n\ntion; (c) local prediction intervals. \n\nFigure 1: Optimizing the L WR fit using: (a) \nglobal cross validation; (b) local cross valida(cid:173)\n\nthe high frequency and the low frequency re(cid:173)\ngions of the data and recover the true function \nrather well. The extrapolation properties of lo(cid:173)\ncal cross validation are the most appropriate \ngiven that the we know the true function. \nInterestingly, at the right end of Figure 1c, the minimization of the prediction intervals \nsuddenly detects that global regression has a lower prediction interval than local regres(cid:173)\nsion and jumps into the global mode by making k rather large. In both local methods \nthere is always a competition between local and global regression. But sudden jumps take \nplace only when the prediction interval is so large that the data is not trustworthy anyway. \nTo some extend, the statistical tests (3)-(6) implicitly measure the data density at the cur(cid:173)\nrent query point and are thus sensitive towards little data support, characterized by a \nsmall n'. This property is desirable as a diagnostic tool, particularly if the data sampling \nprocess can be directed towards such regions. However, if a fixed data set is to be analyz(cid:173)\ned which has rather sparse and noisy data in several regions, a fit of the data with local \noptimization methods may result in too jagged an approximation since the local fitting \nmistakes the noise in such regions as high frequency portion of the data. Global methods \navoid this effect by biasing the function fitting in such unfavorable areas with knowledge \nfrom other data regions and will produce better results if this bias is appropriate. \n\n\f164 \n\nSchaal and Atkeson \n\n4 \nTHE SHIFTING SETPOINT EXPLORATION ALGORITHM \nIn this section we want to give an example of how LWR and its statistical tools can be \nused for goal directed data sampling in learning control. If the task to be learned is high \ndimensional it is not possible to leave data collection to random exploration; on the one \nhand this would take too much time. and on the other hand it may cause the system to en(cid:173)\nter unsafe or costly regions of operation. We want to develop an exploration algorithm \nwhich explicitly avoids with such problems. The shifting setpoint algorithm (SSA) at(cid:173)\ntempts to decompose the control problem into two separate control tasks on different time \nscales. At the fast time scale. it acts as a nonlinear regulator by trying to keep the con(cid:173)\ntrolled system at some chosen setpoints in order to increase the data density at these set(cid:173)\npoints. On a slower time scale. the setpoints are shifted by controlling local prediction ac(cid:173)\ncuracy to accomplish a desired goal. In this way the SSA builds a narrow tube of data \nsupport in which it knows the world. This data can be used by more sophisticated control \nalgorithms for planning or further exploration. \nThe algorithm is graphically illustrated in the example of a mountain car in Figure 2. The \ntask of the car is to drive at a given constant horizontal speed xdesired from the left to the \nright of Figure 2a. xduired need not be met precisely; the car should also minimize its fuel \nconsumption. Initially. the car knows nothing about the world and cannot look ahead. but \nit has noisy feedback of its position and velocity. Commands. which correspond to the \nthrust F of the motor. can be generated at 5Hz. The mountain car starts at its start point \nwith one arbitrary initial action for the first time step; then it brakes and starts all over \nagain. assuming the system can be reset somehow. The discrete one step dynamics of the \ncar are modeled by an L WR forward model: \n\nx...,xt = f(Xc..,.,.elll. F ). where x = (x.xl \n\n(7) \n\nAfter a few trials~ the SSA searches the data in memory for the point (x;u\"elll.F,x~\u00abxt)resl \nwhose outcome x lI\u00abxt can be predicted with the smallest local prediction interval. This \nbest point is declared the setpoint of this stage: \n(T \n\n)T \n\nFAT)T \n= XC~IIl' 'X llm bltSl \n\n( T F \nXS,ill' S ,XS,OIl' \n\nT \n\n(8) \n\nand its local linear model results from a corresponding LWR lookup: \n\nA \n\nXS,OIll = f(xS,u.,Fs):::: AxS;1I + BFs + C \n\n(9) \n\nBased on this liDear model. an optimal LQ controller (e.g., Dyer&McReynolds. 1970) can \nbe constructed. This results in a control law of the form: \n\n(10) \n\nAfter these calculations. the mountain car learned one controlled action for the first time \nstep, However. since the initial action was chosen arbitrarily, XS,OIII will be significantly \naway from the desired speed Xdesir\u00abd. A reduction of this error is achieved as follows, \nFirst, the SSA repeats one step actions with the LQ controller until suffjcient data is col(cid:173)\nlected to reduce the prediction intervals ofLWR lookups for (x~,ill,Fs) (Eq.(9)) below a \ncertain threshold. Then it shifts the setpoint towards the goal according to the procedure: \n1) calculate the error of the predicted output state: \n2) take the derivfltive of the error with respect to the comm'and Fs sr;om a LWR lookup \n\nerr S o,d = xde . d - Xs III \n\nfor (XIill.FS) (cf. (9)): \n\n\fAssessing the Quality of Learned Local Models \n\n165 \n\naerr S,OI\" = aerr S,Old aXS,OMI = _ aXS,Old = _ B \n\naFs \n\naXSpld \n\naFs \n\naFs \n\nand calculate a correction Ms from solving: -BMs = a errs old ; a E [0,1] deter(cid:173)\nmines how much of the error should be compensated for in one step. \n\n3) update Fs: Fs = Fs - Ms and calculate the new X SOM1 with LWR (Eq.(9\u00bb. \n4) assess the fit for the updated setpoint with prediction intervals. If the quality is above \n\na certain threshold, continue with I), otherwise terminate shifting. \n\nFigure 2: The mountain car: (a) landscape across which the car has to drive at constant velocity \n\nof 0.8 mIs, (b) contour plot of data density in phase space as generated by using multistage \n\nSSA, (c) contour plot of data density in position-action space, (d) 2-dimensional mountain car \n\n10 \n\n0.1 \n\nIn this way, the output state of the setpoint \nshifts towards the goal until the data support \nfalls below a threshold. Now the mountain \ncar perfonns several new trials with the new \nsetpoint and the correspondingly updated \nLQ controller. After the quality of fit statis-\ntics rise above a threshold, the setpoint can \nbe shifted again. As soon as the first stage's \nsetpoint reduces the error Xdesj~d - Xs old suf-\nficiently, a new stage is created and the \nmountain car tries to move one step further in its world. The entire procedure is repeated \nfor each new stage until the car knows how to move across the landscape. Figure 2b and \nFigure 2c show the thin band of data which the algorithm collected in state space and po(cid:173)\nsition-action space, These two pictures together form a narrow tube of knowledge in the \ninput space of the forward model. \n\nFigure 3: Mean prediction error of local models \n\n[J Ylloclty EITOf [\"'\") \n\n\u2022 Polltlon E\" ... [III) \n\n2D \n\n30 \n\n40 \n\n10 \n\n\f166 \n\nSchaal and Atkeson \n\nThe example of the mountain car can easily be scaled up to arbitrarily high dimensions by \nmaking the mountain a multivariate function. We tried versions up to a 5-dimensional \nmountain corresponding to a 9\\15 ~ 9\\10 forward model; Figure 2d shows the 2-dimen(cid:173)\nsional version. The results of learning had the same quality as in the ID example. Figure \n3 shows the prediction errors of the local models after learning for the ID. 2D \u2022...\u2022 and 5D \nmountain car. To obtain these errors. the car was started at random positions within its \ndata support from where it drove along the desired trajectory. The difference between the \npredicted next state and the actual outcome at each time step was averaged. Position er(cid:173)\nrors stayed within 2-4 cm on the 10m long landscape. and velocity errors within 0.02-\n0.05 m/s. The dimensionality of the problem did not affect the outcome significantly. \n\n(a) \n\n(b) \n\nROBOT JUGGLING \n\n5 \nTo test our algorithms in a real world exper(cid:173)\niment. we implemented them on a juggling \nrobot. The juggling task to be performed. \ndevil sticking. is illustrated in Figure 4a. For \nthe robot. devil sticking was slightly simpli(cid:173)\nfied by attaching the devil stick to a boom. \nas illustrated in Figure 4b. The task state was \nencoded as a 5-dimensional state vector. \ntaken at the moment when the devilstick hit \none of the hand sticks; the throw action was \nparameterized as 5-dimensional action vec(cid:173)\ntor. This resulted in a 9\\10 ~ 9\\5 discrete \nforward model of the task. Initially the robot \nwas given default actions for the left-hand \nand right-hand throws; the quality of these \nthrows. however. was far away from achiev(cid:173)\ning steady juggling. The robot started with \nno initial experiences and tried to build con(cid:173)\ntrollers to perform continuous juggling. The \ngoal states for the SSA developed automati(cid:173)\ncally from the requirement that the left hand \nhad to learn to throw the devilstick to a place \nwhere the right hand had sufficient data sup(cid:173)\nport to control the devilstick. and vice versa. \nFigure 4c shows a typical learning curve for \nthis task. It took about 40 trials before the \nleft and the right hand learned to throw the \ndevilstick such that both hands were able to \nFigure 4: (a) illustration of devilsticking, (b) a \ncooperate. Then. performance quickly went \ndevils ticking robot, (c) learning curve of robot up to long runs up to 1200 consecutive hits. \nHumans usually need about one week of one \nhour practicing per day before they achieve decent juggling performance. In comparison \nto this. the learning algorithm performed very well. However. it has to be pointed out that \nthe learned controllers were only local and could not cope with larger perturbations. A de(cid:173)\ntailed description of this experiment can be found in Schaal&Atkeson (1994). \n\n(;j \n~,~~--------------------~ \n~1OIIJ \n\n21 \nTrial Number \n\n3, \n\n4' \n\n51 \n\n\" \n\n(C) \n\n\fAssessing the Quality of Learned Local Models \n\n167 \n\nCONCLUSIONS \nOne of the advantages of memory-based nonparametric learning methods lies in the least \ncommitment strategy which is associated with them. Since all data is kept in memory, a \nlookup can be optimized with respect to the architectural parameters. Parametric ap(cid:173)\nproaches do not have this ability if they discard their training data; if they retain it, they \nessentially become memory-based. The origin of nonparametric modeling in traditional \nstatistics provides many established statistical methods to inspect the quality of what has \nbeen learned by the system. Such statistics formed the backbone of the SSA exploration \nalgorithm. So far we have only examined some of the most obvious statistical tools which \ndirectly relate to regression analysis. Many other methods from other statistical frame(cid:173)\nworks may be suitable as well and will be explored by our future work. \n\nAcknowledgements \nSupport was provided by the Air Force Office of Scientific Research, by the Siemens \nCorporation, the German Scholarship Foundation and the Alexander von Humboldt \nFoundation to Stefan Schaal, and a National Science Foundation Presidential Young \nInvestigator Award to Christopher G. Atkeson. We thank Gideon Stein for implementing \nthe first version of L WR on a DSP board, and Gerrie van Zyl for building the devil \nsticking robot and implementing the first version of learning of devil sticking. \n\nReferences \nAtkeson, C.G. (1992), \"Memory-Based Approaches to Approximating Continuous Functions\", in: \nCasdagli, M.; Eubank, S. (eds.): Nonlinear Modeling and Forecasting. Redwood City, CA: Addi(cid:173)\nson Wesley (1992). \nCleveland, W.S., Devlin, S.l, Grosse, E. (1988), \"Regression by Local Fitting: Methods, Proper(cid:173)\nties, and Computational Algorithms\". Journal of &onometrics 37,87 -114, North-Holland (1988). \nCleveland, W.S. (1979), \"Robust Locally-Weighted Regression and Smoothing Scatterplots\". \nJournal of the American Statistical Association ,no.74, pp.829-836 (1979). \nDyer, P., McReynolds, S.R. (1970), The Computation and Theory of Optima I Comrol, New York: \nAcademic Press (1970). \nFan, J., Gijbels, I. (1992), \"Variable Bandwidth And Local Linear Regression Smoothers\", The \nAnnals of Statistics, vol.20, no.4, pp.2008-2036 (1992). \nFarmer, J.D., Sidorowich, J.I (1987), \"Predicting Chaotic Dynamics\", Kelso, IA.S., Mandell, AJ., \nShies inger, M.F., (eds.):Dynamic Patterns in Complex Systems, World Scientific Press (1987). \nHardIe, W. (1991), Smoothing Techniques with Implementation in S, New York, NY: Springer. \nHastie, T.l; Tibshirani, R.J. (1991), Generalized Additive Models, Chapman and Hall. \nMallows, C.L. (1966), \"Choosing a Subset Regression\", unpublished paper presented at the annual \nmeeting of the American Statistical Association, Los Angles (1966). \nMaron, 0., Moore, A.W. (1994), \"Hoeffding Races: Accelerating Model Selection Search for \nClassification and Function Approximation\", in: Cowan, J. , Tesauro, G., and Alspector, 1. (eds.) \nAdvances in Neural Information Processing Systems 6, Morgan Kaufmann (1994). \nMuller, H.-G. (1988), Nonparametric Regression Analysis of Longitudinal Data, Lecture Notes in \nStatistics Series, vo1.46, Berlin: Springer (1988). \nMyers, R.H. (1990), Classical And Modern Regression With Applications, PWS-KENT (1990). \nSchaal, S., Atkeson, C.G. (1994), \"Robot Juggling: An Implementation of Memory-based \nLearning\", to appear in: Control Systems Magazine, Feb. (1994). \nWahba, G., Wold, S. (1975), \"A Completely Automatic French Curve: Fitting Spline Functions By \nCross-Validation\", Communications in Statistics, 4(1) (1975). \n\n\f", "award": [], "sourceid": 789, "authors": [{"given_name": "Stefan", "family_name": "Schaal", "institution": null}, {"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}