{"title": "Learning from Demonstration", "book": "Advances in Neural Information Processing Systems", "page_first": 1040, "page_last": 1046, "abstract": null, "full_text": "Learning From Demonstration \n\nsschaal @cc .gatech.edu; http://www.cc.gatech.edulfac/Stefan.Schaal \n\nStefan Schaal \n\nCollege of Computing, Georgia Tech, 801 Atlantic Drive, Atlanta, GA 30332-0280 \n\nATR Human Information Processing, 2-2 Hikaridai, Seiko-cho, Soraku-gun, 619-02 Kyoto \n\nAbstract \n\nBy now it is widely accepted that learning a task from scratch, i.e., without \nany prior knowledge, is a daunting undertaking. Humans, however, rarely at(cid:173)\ntempt to learn from scratch. They extract initial biases as well as strategies \nhow to approach a learning problem from instructions and/or demonstrations \nof other humans. For learning control, this paper investigates how learning \nfrom demonstration can be applied in the context of reinforcement learning. \nWe consider priming the Q-function, the value function, the policy, and the \nmodel of the task dynamics as possible areas where demonstrations can speed \nup learning. In general nonlinear learning problems, only model-based rein(cid:173)\nforcement learning shows significant speed-up after a demonstration, while in \nthe special case of linear quadratic regulator (LQR) problems, all methods \nprofit from the demonstration. In an implementation of pole balancing on a \ncomplex anthropomorphic robot arm, we demonstrate that, when facing the \ncomplexities of real signal processing, model-based reinforcement learning \noffers the most robustness for LQR problems. Using the suggested methods, \nthe robot learns pole balancing in just a single trial after a 30 second long \ndemonstration of the human instructor. \n\n1. INTRODUCTION \nInductive supervised learning methods have reached a high level of sophistication. Given \na data set and some prior information about its nature, a host of algorithms exist that can \nextract structure from this data by minimizing an error criterion. In learning control, how(cid:173)\never, the learning task is often less well defined. Here, the goal is to learn a policy, i.e., the \nappropriate actions in response to a perceived state, in order to steer a dynamical system to \naccomplish a task. As the task is usually described in terms of optimizing an arbitrary per(cid:173)\nformance index, no direct training data exist which could be used to learn a controller in a \nsupervised way. Even worse, the performance index may be defined over the long term \nbehavior of the task, and a problem of temporal credit assignment arises in how to credit \nor blame actions in the past for the current performance. In such a setting, typical for rein(cid:173)\nforcement learning, learning a task from scratch can require a prohibitively time(cid:173)\nconsuming amount of exploration of the state-action space in order to find a good policy. \nOn the other hand, learning without prior knowledge seems to be an approach that is rarely \ntaken in human and animal learning. Knowledge how to approach a new task can be trans(cid:173)\nferred from previously learned tasks, and/or it can be extracted from the performance of a \nteacher. This opens the questions of how learning control can profit from these kinds of in(cid:173)\nformation in order to accomplish a new task more quickly. In this paper we will focus on \nlearning from demonstration. \nLearning from demonstration, also known as \"programming by demonstration\", \"imitation \nlearning\", and \"teaching by showing\" received significant attention in automatic robot as(cid:173)\nsembly over the last 20 years. The goal was to replace the time-consuming manual pro-\n\n\fLearningfrom Demonstration \n\n1041 \n\na,s \n\n(a) \nmI2jj = -/.tiH mg/sin9 + r . 9 E [-tr.tr] \nr(9.r)= (;)' _{;)210gCO{~ r:) \nm =1 = I. g = 9.81, J.I = 0.05. r..., =5Nm \ndefine : x = (9.8)'. U = r \n\n(b) \nmlXcos9 + ml2e - mglsin9 = 0 \n(m + me)x + m/ecos9 - mU'J2 sin9 = F \ndefine: x = (x,i,9.e) T, u = F \nr(x.u) = XTQX + uTRu \n1= 0.75m. m = 0.15kg, me = LOkg \nQ = diag(1.25, I. 12. 0.25). R = om \nFigure 1 : a) pendulum swing up, \n\nb) cart pole balancing \n\ngramming of a robot by an automatic programming proc(cid:173)\ness, solely driven by showing the robot the assembly task \nby an expert. In concert with the main stream of Artificial \nIntelligence at the time, research was driven by symbolic \napproaches: the expert's demonstration was segmented \ninto primitive assembly actions and spatial relationships \nbetween manipulator and environment, and subsequently \nsubmitted to symbolic reasoning processes (e.g., Lozano(cid:173)\nPerez, 1982; Dufay & Latombe, 1983; Segre & Dejong, \n1985). More recent approaches to programming by dem(cid:173)\nonstration started to include more inductive learning \ncomponents (e.g., Ikeuchi, 1993; Dillmann, Kaiser, & \nUde, 1995). In the context of human skill learning, \nteaching by showing was investigated by Kawato, Gan(cid:173)\ndolfo, Gomi, & Wada (1994) and Miyamoto et al. (1996) \nfor a complex manipulation task to be learned by an an(cid:173)\nthropomorphic robot arm. An overview of several other \nprojects can be found in Bakker & Kuniyoshi (1996). \nIn this paper, the focus lies on reinforcement learning and \nhow learning from demonstration can be beneficial in this \ncontext. We divide reinforcement learning into two cate(cid:173)\ngories: \ntasks \n(Section 2) and for (approximately) linear tasks (Section \n3), and investigate how methods like Q-Iearning, value(cid:173)\nfunction learning, and model-based reinforcement learn(cid:173)\ning can profit from data from a demonstration. In Section \n2.3, one example task, pole balancing, is placed in the \ncontext of using an actual, anthropomorphic robot to learn \nit, and we reconsider the applicability of learning from \ndemonstration in this more complex situation. \n\nfor nonlinear \n\nreinforcement \n\nlearning \n\n2. REINFORCEMENT LEARNING FROM DEMONSTRATION \nTwo example tasks will be the basis of our investigation of learning from demonstration. \nThe nonlinear task is the \"pendulum swing-up with limited torque\" (Atkeson, 1994; Doya, \n19%), as shown in Figure 1a. The goal is to balance the pendulum in an upright position \nstarting from hanging downward. As the maximal torque available is restricted such that \nthe pendulum cannot be supported against gravity in all states, a \"pumping\" trajectory is \nnecessary, similar as in the mountain car example of Moore (1991), but more delicately in \nits timing since building up too much momentum during pumping will overshoot the up(cid:173)\nright position. The (approximately) linear example, Figure 1b, is the well-known cart-pole \nbalancing problem (Widrow & Smith, 1964; Barto, Sutton, & Anderson, 1983). For both \ntasks, the learner is given information about the one-step reward r (Figure 1), and both \ntasks are formulated as continuous state and continuous action problems. The goal of each \ntask is to find a policy which minimizes the infinite horizon discounted reward: \n\nv(x(t)) = J e --~ r(x(s), u(s))ds or V(x(t)) = L ri-1r(x(i), u(i)) \n\n(S-I) \n\nco \n\n00 \n\n(1) \n\nwhere the left hand equation is the continuous time formulation, while the right hand \nequation is the corresponding discrete time version, and where x and u denote a n(cid:173)\ndimensional state vector and a m-dimensional command vector, respectively. For the \nSwing-Up, we assume that a teacher provided us with 5 successful trials starting from dif-\n\ni=t \n\n\f1042 \n\nS. Schaal \n\nferent initial conditions. Each trial consists of a time series of data vectors (0, e, -r) sam(cid:173)\npled at 60Hz. For the Cart-Pole, we have a 30 second demonstration of successful balanc(cid:173)\ning, represented as a 60Hz time series of data vectors (x, X, 0, e, F). How can these demon(cid:173)\nstrations be used to speed up reinforcement learning? \n\n2.1 THE NONLINEAR TASK: SWING-UP \nWe applied reinforcement learning based on learning a value function (V-function) (Dyer \n& McReynolds, 1970) for the Swing-Up task, as the alternative method, Q-learning \n(Watkins, 1989), has yet received very limited research for continuous state-action spaces. \nThe V-function assigns a scalar reward value V(x(t\u00bb) to each state x such that the entire V(cid:173)\nfunction fulfills the consistency equation: \n\nV(x(t)) = arg min(r(x(t), u(t)) + r V(x(t + 1))) \n\nu(t) \n\n(2) \n\nFor clarity, this equation is given for a discrete state-action system; the continuous formu(cid:173)\nlation can be found, e.g., in Doya (1996). The optimal policy, u =Jt(x), chooses the action \nu in state x such that (2) is fulfilled. Note that this computation involves an optimization \nstep that includes knowledge of the subsequent state x(t+ 1). Hence, it requires a model of \nthe dynamics of the controlled system, x(t+ 1)=f(x(t),u(t\u00bb. From the viewpoint of learning \nfrom demonstration, V-function learning offers three candidates which can be primed from \na demonstration: the value function V(x), the policy 1t(x), and the modelf(x,u). \n\n60 \n\no \n\n100 \n\n10 \n\n...,-\" \n\n- - a)scratch \n\n20+-----~~~---+------------~ \n\n50-+-----------------+------~~'\u00a7.JJl~ \n\nr~30+----------r~-+------------~ \n\n2.1.1 V-Learning \nIn order to assess the benefits of a demon(cid:173)\nstration for \nthe Swing-Up, we imple(cid:173)\nmented V-learning as suggested in Doya's \n(1996) continuous TD (CTD) learning al(cid:173)\ngorithm. The V-function and the dynam(cid:173)\nics model were incrementally learned by a \nnonlinear function approximator, Recep-\ntive Field Weighted Regression (RFWR) \n(Schaal & Atkeson (1996\u00bb). Differing \nfrom Doya's (1996) implementation, we \nFigure 2: Smoothed learning curves of the average used the optimal action suggested by CTD \nto learn a model of the policy 1t (an \nof 10 learning trials for the learning conditions a) \nto d) (see text). Good performance is characterized \n\"actor\" as in Barto et al. (1983\u00bb, again re(cid:173)\npresented by RFWR. The following learn(cid:173)\nby T up >45s; below this value the system is usu-\ning conditions were tested empirically: \nally able to swing up properly but it does not know \na) \n\nScratch: Trial by trial learning of \n\nhow to stop in the upright position. \n\n-\n\n-\n\nb) primed actor -\n\n10 \nTrial \n\n-\n\nc) primed model \n\nd) primed actor&model \n\nvalue function V, model f, and actor 1t from scratch. \nPrimed Actor: Initial training of 1t from the demonstration, then trial by trial learning. \nPrimed Model: Initial training of f from the demonstration, then trial by trial learning. \nPrimed Actor&Model: Priming of 1t and f as in b) and c), then trial by trial learning. \n\nb) \nc) \nd) \nFigure 2 shows the results of learning the Swing-Up. Each trial lasted 60 seconds. The \n\ntime Tup the pole spent in the interval \u00b0 E [-7r / 2, 7r /2] during each trial was taken as the \n\nperformance measure (Doya, 1996). Comparing conditions a) and c), the results demon(cid:173)\nstrate that learning the pole model from the demonstration did not speed up learning. This \nis not surprising since learning the V-function is significantly more complicated than \nlearning the model, such that the learning process is dominated by V-function learning. \nInterestingly, priming the actor from the demonstration had a significant effect on the ini(cid:173)\ntial performance (condition a) vs. b\u00bb). The system knew right away how to pump up the \npendUlum, but, in order to learn how to balance the pendulum in the upright position, it fi(cid:173)\nnally took the same amount of time as learning from scratch. This behavior is due to the \n\n\fLearning from Demonstration \n\n1043 \n\nfact that, theoretically, the V-function can only be approximated correctly if the entire \nstate-action space is explored densely. Only if the demonstration covered a large fraction \nof the entire state space one would expect that V-learning can profit from it. We also in(cid:173)\nvestigated using the demonstration to prime the V-function by itself or in combination \nwith the other functions. The results were qualitatively the same as in shown in Figure 2: \nif the policy was included in the priming, the learning traces were like b) and d), otherwise \nlike a) and c). Again, this is not totally surprising. Approximating a V-function is not just \nsupervised learning as for ;t and f, it requires an iterative procedure to ensure the validity \nof (2) and amounts to a complicated nonstationary function approximation process. Given \nthe limited amount of data from the demonstration, it is generally very unlikely to ap(cid:173)\nproximate a good value function. \n\n-\n\n10 \nTrial \n\n100) \n\n1--~30 \n\nV \n\na)scratch \n\n. -\n\nb) primed model \n\n60 \n\n50 \n\n40 \n\n20 \n\n10 \n\no \n\nI -\n\n/ ' .....,.. --\n\n-\n\n- ~ \n/' \n\n-- --\n/ \n\n2.1.2 Model-Based V-Learning \nIf learning a model f is required, one can \nmake more powerful use of it. According \nto the certainty equivalence principle, f \ncan substitute the real world, and planning \ncan be run in \"mental simulations\" instead \nof interaction with the real world. In rein(cid:173)\nforcement learning, this idea was origi(cid:173)\nnally pursued by Sutton's (1990) DYNA \nalgorithms for discrete state-action spaces. \nHere we will explore in how far a con(cid:173)\ntinuous version of DYNA, DYNA-CTD, \ncan help in learning from demonstration. \nThe only difference compared to CTD in \nSection 2.1.1 is that after every real trial, \nDYNA-CTD performs five \"mental trials\" \nin which the model of the dynamics ac(cid:173)\nquired so far replaces the actual pole dynamics. Two learning conditions we be explored: \na) Scratch: Trial by trial learning of V, model f, and policy ;t from scratch. \nb) Primed Model: Initial training of f from the demonstration, then trial by trial learning. \nFigure 3 demonstrates that in contrast to V-learning in the previous section, learning from \ndemonstration can make a significant difference now: after the demonstration, it only \ntakes about 2-3 trials to accomplish a good swing-up with stable balancing, indicated by \nT up >45s. Note that also learning from scratch is significantly faster than in Figure 2. \n\nFigure 3: Smoothed learning curves of the average \nof 10 learning trials for the learning conditions a) \nand b) (see text) of the Swing-Up problem using \n\"mental simulations\". See Figure 2 for explana-\n\ntions how to interpret the graph. \n\n2.2 THE LINEAR TASK: CART -POLE BALANCING \nOne might argue that applying reinforcement learning from demonstration to the Swing(cid:173)\nUp task is premature, since reinforcement learning with nonlinear function approximators \nhas yet to obtain appropriate scientific understanding. Thus, in this section we turn to an \neasier task: the cart-pole balancer. The task is approximately linear if the pole is started in \na close to upright position, and the problem has been well studied in the dynamic pro(cid:173)\ngramming literature in the context of linear quadratic regulation (LQR) (Dyer & McRey(cid:173)\nnolds, 1970). \n\n2.2.1 Q-Learning \nIn contrast to V-learning, Q-Iearning (Watkins, 1989; Singh & Sutton, 1996) learns a more \ncomplicated value function, Q(x,u), which depends both on the state and the command. \nThe analogue of the consistency equation (2) for Q-Iearning is: \n\nQ(x(t), u(t\u00bb) = r(x(t), u(t\u00bb) + r arg min(Q(x(t + 1), u(t + 1\u00bb)) \n\nu(I+1) \n\n(3) \n\n\f1044 \n\nS. Schaal \n\nAt every state x, picking the action u which minimizes Q is the optimal action under the \nreward function (l). As an advantage, evaluating the Q-function to find the optimal pol(cid:173)\nicy does not require a model the dynamical system f that is to be controlled; only the \nvalue of the one-step reward r is needed. For learning from demonstration, priming the Q(cid:173)\nfunction and/or the policy are the two candidates to speed up learning. \nFor LQR problems, Bradtke (1993) suggested a Q-Iearning method that is ideally suited \nfor learning from demonstration, based on extracting a policy. He observed that for LQR \nthe Q-function is quadratic in the states and commands: \n\nQ(x,u) = [xT,uTl[HHIl H I2][XT,UTY, HIl =nxn, H22 =mxm, HI2 =H;I =nxm \n\n21 H22 \n\n(4) \n\n0.045 \n\n0.04 \n\nKdOmo = [\u00b70.59. -1.81. -18.71. \n\u00b76.67) \n~nal = [\u00b75.76, -11.37, -83.05, -21.92) \n\nand that the (linear) policy, represented as a \ngain matrix K, can be extracted from (4) as: \n\n20 \n\n40 \n\n80 \n\n100 \n\n0.035 \n\n'0 \n~ 0.03 \n\u00a3 0.025 \na. \n~ 0.02 \n~ 0.Q15 \no \n0.01 \n0.005 \n\nuopt = -K x = -H;;H2Ix \n\n(5) \nConversely, given a stabilizing initial policy \nK demo' the current Q-function can be approxi-\nmated by a recursive least squares procedure, \nand it can be optimized by a policy iteration \nprocess with guaranteed convergence (Bradkte, \no-tJcr~Dl\"!NI(fj~~~~II!~ 1993). As a demonstration allows one to extract \n120 an initial policy K demo by linearly regressing \no \nthe observed command u against the corre(cid:173)\nsponding observed states x, one-shot learning \nof pole balancing is achievable. As shown in \nFigure 4, after about 120 seconds (12 policy it(cid:173)\neration steps), the policy is basically indistin(cid:173)\nguishable from the optimal policy. A caveat of \n\nFigure 4: Typical learning curve of a noisy \nsimulation of the cart-pole balancer using Q(cid:173)\nlearning. The graph shows the value of the \n\none-step reward over time for the first \nlearning trial. The pole is never dropped. \n\n60 \n\nTime[s) \n\nthis Q-Iearning, however, is that it cannot not learn without a stabilizing initial policy. \n\n2.2.2 Model-based V -Learning \nLearning an LQR task by learning the V-function is one of the classic forms of dynamic \nprogramming (Dyer & McReynolds, 1970). Using a stabilizing initial policy K demo' the \ncurrent V-function can be approximated by recursive least squares in analogy with \nBradtke (1993). Similarly as for K demo' a (linear) model f demo of the cart-pole dynamics \ncan be extracted from a demonstration by linear regression of the cart-pole state x(t) vs. \nthe previous state and command vector (x(t-1), u(t-1\u00bb, and the model can be refined with \nevery new data point experienced during learning. The policy update becomes: \nK= y(R + yBTHBtBTHA, where Vex) = xTHx, idemo = [AB], A = n x n,B = n X m \n(6) \nThus, a similar process as in Bradtke (1993) can be used to find the optimal policy K, and \nthe system accomplishes one shot learning, qualitatively indistinguishable from Figure 4. \nAgain, as pointed out in Section 2.1.2, one can make more efficient use of the learned \nmodel by performing mental simulations. Given the model f demo' the policy K can be cal-\nculated by off-line policy iteration from an initial estimate ofH, e.g., taken to be the iden(cid:173)\ntity matrix (Dyer & McReynolds, 1970). Thus, no initial (stabilizing) policy is required, \nbut rather an estimate of the task dynamics. Also this method achieves one shot learning. \n\n2.3 POLE BALANCING WITH AN ACTUAL ROBOT \nAs a result of the previous section, it seems that there are no real performance differences \nbetween V-learning, Q-Iearning, and model-based V-learning for LQR problems. To ex(cid:173)\nplore the usefulness of these methods in a more realistic framework, we implemented \n\n\fLeamingfrom Demonstration \n\n1045 \n\nlearning from demonstration of pole balancing on an anthropomorphic robot arm. The ro(cid:173)\nbot is equipped with a 60 Hz video-based stereo vision. The pole is marked by two color \nblobs which can be tracked in real-time. A 30 second long demonstration of pole balaoc(cid:173)\ning was is provided by a human standing in front of the two robot cameras. \nThere are a few crucial differences in comparison with the simulations. First, as the dem(cid:173)\nonstration is vision-based, only kinematic variables can be extracted from the demonstra(cid:173)\ntion. Second, visual signal processing has about 120ms time delay. Third, a command \ngiven to the robot is not executed with very high accuracy due to unknown nonlinearities \nof the robot. And lastly, humans use internal state for pole balancing, i.e., their policy is \npartially based on non-observable variables. These issues have the following impact: \n\nKinematic Variables: In this implementation, the robot arm \nreplaces the cart of the Cart-Pole problem. Since we have an \nestimate of the inverse dynamics and inverse kinematics of \nthe arm, we can use the acceleration of the finger in Carte(cid:173)\nsian space as command input to the task. The arm is also \nmuch heavier than the pole which allows us to neglect the \ninteraction forces the pole exerts on the arm. Thus, the pole \nbalancing dynamics of Figure Ib can be reformulated as: \n\n,\n\n.. \n\ni \n\n(7) \nFigure 5: Sketch of SARCOS An variables in this equation can be extracted from a dem-\nanthropomorphic robot arm \n\nonstration. We omit the 3D extension of these equations. \n\numl cosO + Oml 2 - mgl sin 0= 0, x = u \n\nDelayed Visual Information: There are two possibilities of dealing with delayed variables. \nEither the state of the system is augmented by delayed commands corresponding to \n7* 1/60s:::::120s delay time, x T = (x, x,O, (}, U t_1' ut- 2 ' ... , ut- 7 ) , or a state predictive controller \nis employed. The former method increases the complexity of a policy significantly, while \nthe latter method requires a model f. \nInaccuracies of Command Execution: Given an acceleration command u, the robot will \nexecute something close to u, but not u exactly. Thus, learning a function which includes \nu, e.g., the dynamics model (7), can be dangerous since the mapping (x,i,O,(},u) ~ (x,ii) \nis contaminated by the nonlinear dynamics of the robot arm. Indeed, it turned out that we \ncould not learn such a model reliably. This could be remedied by \"observing\" the com(cid:173)\nmand u, i.e., by extracting u = x from visual feedback. \nInternal State in Demonstrated Policy: Investigations with human subjects have shown \nthat humans use internal state in pole balancing. Thus, a policy cannot be observed that \neasily anymore as claimed in Section 2.2: a regression analysis for extracting the policy of \na teacher must find the a~propriate time-alignment of observed current state and com(cid:173)\nmand(s) in the past. This can become a numerically involved process as regressing a pol(cid:173)\nicy based on delayed commands is endangered by singUlar regression matrices. Conse(cid:173)\nquently, it easily happens that one extracts a nonstabilizing policy from the demonstration, \nwhich prevents the application of Q-Iearning and V-learning as described in Section 2.2. \nAs a result of these considerations, the most trustworthy item to extract from a demonstra(cid:173)\ntion is the model of the pole dynamics. In our implementation it was used in two ways, for \ncalculating the policy as in (6), and in state-predictive control with a Kalman filter to \novercome the delays in visual information processing. The model was learned incremen(cid:173)\ntally in real-time by an implementation of RFWR (Schaal & Atkeson 1996). Figure 6 \nshows the results of learning from scratch and learning from demonstration of the actual \nrobot. Without a demonstration, it took about 10-20 trials before learning succeeded in re(cid:173)\nliable performance longer than one minute. With a 30 second long demonstration, learning \nwas reliably accomplished in one single trial, using a large variety of physically different \npoles and using demonstrations from arbitrary people in the laboratory. \n\n\f1046 \n\nS. Schaal \n\n-\n_ \n\n/!Trial \n\n30 \n\n20 \n\n10 \n\n,.'l' 40 \n\na)scratcn \nb) primed model \n\n~~------------------, \n\nO-fl=:::::;:==;:::;:::;='~~lrO -L:::;::'.::!.;:.~~~I00 \n\nFigure 6: Smoothed average of 10 learn(cid:173)\ning curves of the robot for pole balancing. \n\n70 \n60+--- ------.,..------1 \n50 \n\n3. CONCLUSION \nWe discussed learning from demonstration in the \ncontext of reinforcement learning, focusing on Q(cid:173)\nlearning, value function \nlearning, and model \nbased reinforcement learning. Q-Iearning and \nvalue function learning can theoretically profit \nfrom a demonstration by extracting a policy, by \nusing the demonstration data to prime the Q/value \nfunction, or, in the case of value function learn-\ning, by extracting a predictive model of the \nworld. Only in the special case of LQR problems, \nhowever, could we find a significant benefit of \npriming the learner from the demonstration. In \ncontrast, model-based reinforcement learning was \nable to greatly profit from the demonstration by \nusing the predictive model of the world for \n\"mental simulations\". In an implementation with \nan anthropomorphic robot arm, we illustrated that even in LQR problems, model-based \nreinforcement learning offers larger robustness towards the complexity in real learning \nsystems than Q-Iearning and value function learning. Using a model-based strategy, our \nrobot learned pole-balancing from a demonstration in a single trial with great reliability. \nThe important message of this work is that not every learning approach is equally suited to \nallow knowledge transfer and/or the incorporation of biases. This issue may serve as a \ncritical additional constraint to evaluate artificial and biological models of learning. \nAcknowledgments \n\nThe trials were aborted after successful \nbalancing of 60 seconds. We also tested \nlong term performance of the learning \n\nsystem by running pole balancing for over \n\nan hour-the pole was never dropped. \n\nIkeuchi, K. (1993b). \"Assembly plan from observa-\ntion.\", School of Computer Science, Carnegie Mellon \n\n. \n\n. \n\n'b \n\nSupport was provided by the A TR Human Infor- University, Pittsburgh, PA. \nmation Processing Labs \nthe German Research Kawato, M., Gandolfo, F., Gomi, H., & Wada, Y. \n(1994b). \"Teaching by showing in kendama based on \n~ssocIatlOn, the Alexander v. ~um oldt ~ounda- optimization principle.\" In: Proceedings of the Interna-\ntional Conference on Artificial Neural Networks \ntlOn, and the German ScholarshIp FoundatIOn. \n(ICANN'94), 1, pp.601-606. \nReferences \nLozano-Perez, T. (1982). \"Task-Planning.\" In: Brady, \nAtkeson, C. G. (1994). \"Using local trajectory optimiz- M., Hollerbach, 1. M., Johnson, T. L., Lozano-P _rez, T., \ners to speed up global optimization in dynamic pro- & Mason, M. T. (Eds.), , pp,473-498. MIT Press. \ngramming.\" In: Moody, Hanson, & Lippmann (Ed.), Miyamoto, H., Schaal, S., Gandolfo, F., Koike, Y., Osu, \nAdv. in Neural In! Proc. Sys. 6. Morgan Kaufmann. \nR., Nakano, E., Wada, Y., & Kawato, M. (in press). \"A \nBakker, P., & Kuniyoshi, Y. (1996). \"Robot see, robot Kendama learning robot based on bi-directional theory.\" \ndo: An overview of robot imitation.\", Autonomous Neural Networks. \nSystems Section, Electrotechnical Laboratory, Tsukuba Moore, A. (1991a). \"Fast, robust adaptive control by \nScience City, Japan. \nlearning only forward models.\" In: Moody, 1. E., Han(cid:173)\nBarto, A. G., Sutton, R. S., & Anderson, C. W. (1983). son, S. J., & and Lippmann, R. P. (Eds.), Advances in \n\"Neuronlike adaptive elements that can solve difficult NeuralInt Proc. Systems 4. Morgan Kaufmann. \nlearning control problems.\" IEEE Transactions on Sys- Schaal, S., & Atkeson, C. G. (1996). \"From isolation to \ntems, Man, and Cybernetics, SMC-13, 5. \ncooperation: An alternative of a system of experts.\" In: \nBradtke, S. 1. (1993). \"Reinforcement learning applied Touretzky, D. S., Mozer, M. c., & Hasselmo, M. E. \nto linear quadratic regulation.\" In: Hanson, J. S., Cowan, (Eds.), Advances in Neural Information Processing \n1. D., & Giles, C. L. (Eds.), Advances in Neural In! Systems 8. Cambridge, MA: MIT Press. \nProcessing Systems 5, pp.295-302. Morgan Kaufmann. Segre, A. B., & Dejong, G. (1985). \"Explanation-based \nDillmann, R., Kaiser, M., & Ude, A. \n(1995). manipulator learning: Acquisition of planning ability \n\"Acquisition of elementary robot skills from human through observation.\" In: Conference on Robotics and \ndemonstration.\" In: International Symposium on Intelli- Automation, pp.555-560. \ngent Robotic Systems (SIRS'95), Pisa, Italy. \nDoya, K. (1996). \"Temporal difference learning in con- learning with eligibility traces.\" Machine Learning. \ntinuous time and space.\" In: Touretzky, D. S., Mozer, Sutton, R. S. \n(1990). \"Integrated architectures for \nM. c. , & Hasselmo, M. E. (Eds.), Advances in Neural learning, planning, and reacting based on approximating \nInformation Processing Systems 8. MIT Press. \nDufay, B., & Latombe, J.-c. (1984). \"An approach to tional Machine Learning Conference. \nautomatic robot programming based on \nlearning.\" In: Brady, M., & Paul, R. (Eds.), Robotics Re- wards.\" Ph.D. thesis, Cambridge University (UK), . \nsearch, pp.97-115. Cambridge, MA: MIT Press. \nWidrow, B., & Smith, F. W. (1964). \"Pattern recogniz(cid:173)\nDyer, P., & Mc~eynolds, S. R. (1970). The ~omputation ing control systems.\" In: 1963 Compo and In! Sciences \nand theory of opumal control. NY: AcademIC Press. \n\ninductive Watkins, C. 1. C. H. (1989). \"Learning with delayed re(cid:173)\n\nSingh, S. P., & Sutton, R. S. (1996). \"Reinforcement \n\ndynamic programming.\" In: Proceedings of the Interna(cid:173)\n\n(COINS) Symp. Proc., 288-317, Washington: Spartan. \n\n\f", "award": [], "sourceid": 1224, "authors": [{"given_name": "Stefan", "family_name": "Schaal", "institution": null}]}