{"title": "Explanation-Based Neural Network Learning for Robot Control", "book": "Advances in Neural Information Processing Systems", "page_first": 287, "page_last": 294, "abstract": null, "full_text": "Explanation-Based Neural Network Learning \n\nfor Robot Control \n\nTom M. Mitchell \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nE-mail: mitchell@cs.cmu.edu \n\nSebastian B. Thrun \nUniversity of Bonn \n\nInstitut fUr Infonnatik III \n\nROmerstr. 164, D-5300 Bonn, Germany \n\nthrnn@uran.informatik.uni-bonn.de \n\nAbstract \n\nHow can artificial neural nets generalize better from fewer examples? In order \nto generalize successfully, neural network learning methods typically require \nlarge training data sets. We introduce a neural network learning method that \ngeneralizes rationally from many fewer data points, relying instead on prior \nknowledge encoded in previously learned neural networks. For example, in robot \ncontrol learning tasks reported here, previously learned networks that model the \neffects of robot actions are used to guide subsequent learning of robot control \nfunctions. For each observed training example of the target function (e.g. the \nrobot control policy), the learner explains the observed example in terms of its \nprior knowledge, then analyzes this explanation to infer additional information \nabout the shape, or slope, of the target function. This shape knowledge is used \nto bias generalization when learning the target function. Results are presented \napplying this approach to a simulated robot task based on reinforcement learning. \n\n1 Introduction \n\nNeural network learning methods generalize from observed training data to new cases based \non an inductive bias that is similar to smoothly interpolating between observed training \npoints. Theoretical results [Valiant, 1984], [Baum and Haussler, 1989] on learnability, as \nwell as practical experience, show that such purely inductive methods require significantly \nlarger training data sets to learn functions of increasing complexity. This paper introduces \nexplanation-based neural network learning (EBNN), a method that generalizes successfully \nfrom fewer training examples, relying instead on prior knowledge encoded in previously \nlearned neural networks. \nEBNN is a neural network analogue to symbolic explanation-based learning methods (EBL) \n[DeJong and Mooney, 1986], [Mitchell et al., 19861 Symbolic EBL methods generalize \nbased upon pre-specified domain knowledge represented by collections of symbolic rules. \n\n287 \n\n\f288 \n\nMitchell and Thrun \n\nFor example. in the task of learning general rules for robot control EBL can use prior \nknow ledge about the effects of robot actions to analytically generalize from specific training \nexamples of successful control actions. This is achieved by a. observing a sequence of states \nand actions leading to some goal. b. explaining (i.e .\u2022 post-facto predicting) the outcome \nof this sequence using the domain theory. then c. analyzing this explanation in order \nto determine which features of the initial state are relevant to achieving the goal of the \nsequence. and which are nol In previous approaches to EBL. the initial domain knowledge \nhas been represented symbolically. typically by propositional rules or hom clauses. and has \ntypically been assumed to be complete and correct \n\n2 EBNN: Integrating inductive and analytical learning \n\nEBNN extends explanation-based learning to cover situations in which prior knowledge \n(also called the domain theory) is approximate and is itself1earned from scratch. In EBNN, \nthis domain theory is represented by real-valued neural networks. By using neural network \nrepresentations. it becomes possible to learn the domain theory using training algorithms \nsuch as the Backpropagation algorithm [Rumelhart et al., 19861 In the robot domains \naddressed in this paper. such domain theory networks correspond to action models. i.e., \nnetworks that model the effect of actions on the state of the world M:s x a -+ Sf (here \na denotes an action, s a state. and Sf the successor state). This domain theory is used by \nEBNN to bias the learning of the 'robot control function. Because the action models may be \nonly approximately correct. we require that EBNN be robust with respect to severe errors \nin the domain theory. \nThe remainder of this section describes the EBNN learning algorithm. Assume that the \nrobot agent's action space is discrete. and that its domain knowledge is represented by a \ncollection of pre-trained action models Mi:S -+ Sf. one for each discrete action i. The \nlearning task of the robot is to learn a policy for action selection that maximizes the reward, \ndenoted by R. which defines the task. More specifically. the agent has to learn an evaluation \nfunction Q(s, a). which measures the cumulativefuture expected reward when action a is \nexecuted at state s. Once learned. the function Q(s, a) allows the agent to select actions \nthat maximize the reward R (greedy policy). Hence learning control reduces to learning \nthe evaluation function Q. 1 \nHow can the agent use its previously learned action models to focus its learning of Q? \nTo illustrate. consider the episode shown in Figure 1. The EBNN learning algorithm for \nlearning the target function Q consists of two components, an inductive learning component \nand an analytical learning component. \n\n2.1 The inductive component of EBNN \nThe observed episode is used by the agent to construct training examples, denoted by Q, \nfor the evaluation function Q: \n\nQ(sl,ad:= R \n\nQ(s2,a2):= R \n\nQ(s3,a3):= R \n\nQ could for example be realized by a monolithic neural network, or by a collection of net(cid:173)\nworks trained with the Backpropagation training procedure. As observed training episodes \nare accumulated, Q will become increasingly accurate. Such pure inductive learning typ-\n\nIThis approach to learning a policy is adopted from recent research on reinforcenuml learning \n\n[Barto et al., 1991]. \n\n\fExplanation-Based Neural Network Learning for Robot Control \n\n289 \n\n_-----~~ reward: R \n(goal state) \n\nFigure 1: Episode: Starting with the initial state SI. the action sequence aI, az, a3 was observed to \nproduce the final reward R. The domain knowledge represented by neural network action models is \nused to post-facto predict and analyze each step of the observed episode. \n\nically requires large amounts of training data (which will be costly in the case of robot \nlearning). \n\n2.2 The analytical component or EBNN \n\nIn EBNN, the agent exploits its domain lrnowledge to extract additional shape lrnowledge \nabout the target function Q. to speed convexgence and reduce the number of training \nexamples required. This shape lrnowledge. represented by the estimated slope of the target \nfunction Q. is then used to guide the generalization process. More specifically. EBNN \ncombines the above inductive learning component with an analytical learning component \nthat performs the following three steps for each observed training episode: \n\n1. Explain: Post-facto predict the obsexved episode (states and final reward), using the \naction models Mi (c.f. Fig. 1). Note that thexe may be a deviation between predicted \nand observed states. since the domain lrnowledge is only approximately correct. \n\n2. Analyze: Analyze the explanation to estimate the slope of the target function for \neach observed state-action pair (81:, a1:) (k = 1..3). i.e .\u2022 extract the derivative of the \nfinal reward R with respect to the features of the states 81:. according to the action \nmodels Mi. For instance. consider the explanation of the episode shown in Fig. 1. \nThe domain theory networks Mi represent differentiable functions. Therefore it is \npossible to extract the derivative of the final reward R with respect to the preceding \nstate 83. denoted by \"V '3R. Using the chain rule of differentiation. the derivatives of \nthe final reward R with respect to all states 81: can be extracted. These derivatives \n\"V,,, R describe the dependence of the final reward upon features of the previous states. \nThey provide the target slopes. denoted by \"V,,, Q. for the target function Q: \n\n-\n\n-\n\"V '3 Q 83, a3) \n\n( \n\n\"V '3 R \n\n\"V'2 R \n\n8M43 (S3) \n\n0 \n83 \n\noM4,(83) OM42 (82) \n\nOS3 \n\n082 \n\nOM43 (83) 8M42 (82) 8M41 (81) \n\n883 \n\n082 \n\n881 \n\n3. Learn: Update the learned target function to better fit both the target values and target \nslopes. Fig. 2 illustrates training information extracted by both the inductive (values) \nand the analytical (slopes) components ofEBNN. Assume that the \"true\" Q-function \n\n\f290 \n\nMitchell and Thrun \n\nFigure2: Fitting slopes: Let/bea target function for which tbreeexampies (Xl, I(xt)}. (X2, I(X2)). \nand (X3, 1 (X3)) are known. Based on these points the learner might generate the hypothesis g. If the \nslopes are also known. the learner can do much better: h. \n\nis shown in Fig. 2a, and that three training instances at Xl, X2 and X3 are given. When \nonly values are used for learning, i.e., as in standard inductive learning, the learner \nmight conclude the hypothesis g depicted in Fig. 2b. If the slopes are known as well, \nthe learner can better estimate the target function (Fig. 2c). From this example it is \nclear that the analysis in EBNN may reduce the need for training data, provided that \nthe estimated slopes extracted from the explanations are sufficiently accurate. \nIn EBNN, the function Q is learned by a real-valued function approximator that fits both \nthe target values and target slopes. If this approximator is a neural network, an extended \nversion of the Backpropagation algorithm can be employed to fit these slope constraints \nas well, as originally shown by [Simard et al., 19921 Their algorithm \"Tangent Prop\" \nextends the Backpropagation error function by a second term measuring the mean \nsquare error of the slopes. Gradient descent in slope space is then combined with \nBackpropagation to minimize both error functions. In the experiments reported here, \nhowever, we used an instance-based function approximation technique described in \nSect. 3. \n\n2.3 Accommodating imperfect domain theories \n\nNotice that the slopes extracted from explanations will be only approximately correct, since \nthey are derived from the approximate action models Mi. If this domain knowledge is \nweak, the slopes can be arbitrarily poor, which may mislead generalization. \nEBNN reduces this undesired effect by estimating the accuracy of the extracted slopes \nand weighting the analytical component of learning by these estimated slope accuracies. \nGenerally speaking, the accuracy of slopes is estimated by the prediction accuracy of the \nexplanation (this heuristic has been named LOB *). More specifically, each time the domain \ntheory is used to post-facto predict a state sk+1, its prediction st~icted may deviate from the \nobserved state sr+ied \u2022 Hence the I-step prediction accuracy at state Sk, denoted by Cl (i), \nis defined as 1 minus the normalized prediction error: \n\n( .) \n\nCl Z \n\n:= \n\n1 _ II st~cted - skb+red II \nmax..prediction...error \n\nFor a given episode we define the n-step accuracy cn(i) as the product of the I-step \naccuracies in the next n steps. The n-step accuracy, which measures the accuracy of the \nderi ved slopes n steps away from the end of the episode, posseses three desireable properties: \na. It is I if the learned domain theory is perfectly correct, b. it decreases monotonically as \nthe length of the chain of inferences increases, and c. it is bounded below by O. The n-step \naccuracy is used to determine the ratio with which the analytical and inductive components \n\n\fExplanation-Based Neural Network Learning for Robot Control \n\n291 \n\nare weighted when learning the target concept. If an observation is n steps away from the \nend of the episode. the analytically derived training information (slopes) is weighted by \nthe n-step accuracy times the weight of the inductive component (values). Although the \nexperimental results reported in section 3 are promising. the generality of this approach is \nan open question. due to the heuristic nature of the assumption LOB *. \n\n2.4 EBNN and Reinforcement Learning \n\nTo make EBNN applicable to robot learning, we extend it here to a more sophisticated \nscheme for learning the evaluation function Q. namely Watkins' Q-Learning [Watkins, \n1989] combined with Sutton's temporal difference methods [Sutton, 19881 The reason \nfor doing so is the problem 0/ suboptimal action choices in robot learning: Robots must \nexplore their environment. i.e., they must select non-optimal actions. Such non-optimal \nactions can have a negative impact on the final reward of an episode which results in both \nunderestimating target values and misleading slope estimates. \nWatkins' Q-Learning [Watkins, 1989] permits non-optimal actions during the course of \nIn his algorithm targets for Q are constructed recursively, based on the \nlearning Q. \nmaximum possible Q-value at the next state:2 \n\n...... \nQ(Sk, ak) = \n\n{ R \n\nif k is the final step and R final reward \n\n, m~ Q(Sk+l, a) otherwise \n\na acuon \n\nHere , (O~,~I) is a discount/actor that discounts reward over time, which is commonly \nused for minimizing the number of actions. Sutton's TD(A) [Sutton, 1988] can be used to \ncombine both Watkins' Q-Learning and the non-recursive Q-estimation scheme underlying \nthe previous section. Here the parameter A (0 ~ A ~ 1) determines the ratio between recursive \nand non-recursive components: \n\n{ R \n\n...... \nQ(sk,ak) = \n\nif k final step \n) \n(I-A),max a Q(sk+l,a) + A,Q(sk+l,ak+d otherwise (1 \nEq. (1) describes the extended inductive component of the EBNN learning algoriLhm. The \nextension of the analytical component in EBNN is straightforward. Slopes are extracted \nvia the derivative of Eq. (1), which is computed via the derivative of both the models !IIi \nand the derivative of Q. \n\nif k last step \n\notherwise \n\n3 Experimental results \n\nEBNN has been evaluated in a simulated robot navigation domain. The world and the \naction space are depicted in Fig. 3a&b. The learning task is to find a Q function, for which \nthe greedy policy navigates the agent to its goal location (circle) from arbitrary starting \nlocations, while avoiding collisions with the walls or the obstacle (square). States are \n\n210 order to simplify the notation. we assume that reward is only received at the end of the episode, \n\nand is also modeled by the action models. The extension to more general cases is straightforward. \n\n\f292 \n\nMitchell and Thrun \n\n(a) \n\n(b) \n\n0'*' \n\nrobot \n\n(c) \n\nerror \n\n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\n0 \n\nnumber of \ntraining examples \n\nFigure 3: a. The simulated robot world. b. Actions. c. The squared generalization error of the \ndomain theory networks decreases monotonically as the amount of training data increases. These \nnine alternative domain theories were used in the experiments. \n\ndescribed by the local view of the agent. in terms of distances and angles to the center of \nthe goal and to the center of the obstacle. Note that the world is deterministic in these \nexperiments, and that there is no sensor noise. \nWe applied Watkins' Q-Learning and TD(~) as described in the previous section with \nA=0.7 and a discount factor ;=0.8. Each of the five actions was modeled by a separate \nneural network (12 hidden units) and each had a separate Q evaluation function. The \nlatter functions were represented by a instance-based local approximation technique. In a \nnutshell, this technique memorizes all training instances and their slopes explicitly, and fits \na local quadratic model over the [=3 nearest neighbors to the query point, fitting both target \nvalues and target slopes. We found empirically that this technique outperformed Tangent \nProp in the domain at hand.3 We also applied an experience replay technique proposed by \nLin [Lin, 19911 in order to optimally exploit the information given by the observed training \nepisodes. \nFig. 4 shows average performance curves for EBNN using nine different domain theories \n(action models) trained to different accuracies, with (Fig. 4a) and without (Fig. 4b) taking \nthe n-step accuracy of the slopes into account. Fig. 4a shows the main result. It shows \nclearly that (1) EBNN outperfonns purely inductive learning, (2) more accurate domain \ntheories yield better performance than less accurate theories, and (3) EBNN learning de(cid:173)\ngrades gracefully as the accuracy of the domain theory decreases, eventually matching the \nperformance of purely inductive learning. In the limit, as the size of the training data set \ngrows, we expect all methods to converge to the same asymptotic performance. \n\n4 Conclusion \n\nExplanation-based neural network learning. compared to purely inductive learning, gen(cid:173)\neralizes more accurately from less training data. It replaces the need for large training \ndata sets by relying instead on a previously learned domain theory. represented by neural \nnetworks. In this paper, EBNN has been described and evaluated in terms of robot learning \ntasks. Because the learned action models Mi are independent of the particular control task \n(reward function), this knowledge acquired during one task transfers directly to other tasks. \n\n3Note that in a second experiment not reported here, we applied EBNN using neural network \n\nrepresentation for Q and Tangent Prop successfully in a real robot domain. \n\n\fExplanation-Based Neural Network Learning for Robot Control \n\n293 \n\n(a) \n\nProb(success) \n\n1 \n\n0 . 8 \n\n0.6 \n\n0 . 4 \n\n0.2 \n\n(b) \n\nProb(success) \n\n1 \n\n0.8 \n\n0 . 6 \n\n0.4 \n\n0.2 \n\n0 \n\nn\\lI1lber of \n\n100 epiSOdes \n\n\" ...... .. ~. \n\n20 \n\n40 \n\n60 \n\n80 \n\n,\"\",_10 \n\n\" ~ \n\nnU!Jlberof \n\n100 episodes \n\nFigure 4: How does domain knowledge improve generalization? a. Averaged results for EBNN \ndomain theories of differing accuracies, pre-trained with from 5 to 8 192 training examples for each \naction model network. In contrast, the bold grey line reflects the learning curve for pure inductive \nlearning, i.e., Q-Leaming and TD(A). b. Same experiments, but without weighting the analytical \ncomponent of EBNN by its accuracy, illustrating the importance of the WB* heuristic. All curves \nare averaged over 3 runs and are also locally window-averaged. The perfonnance (vertical axis) is \nmeasured on an independent test set of starting positions. \n\nEBNN differs from other approaches to knowledge-based neural network learning, such \nas Shavlik/fowell's KBANNs [Shavlik and Towell, 1989]. in that the domain knowledge \nand the target function are strictly separated, and that both are learned from scratch. A \nmajor difference from other model-based approaches to robot learning, such as Sutton's \nDYNA architecture [Sutton, 1990] or Jordan!Rumelhart's distal teacher method [Jordan \nand Rumelhart, 1990], is the ability of EBNN to operate across the spectrum of strong to \nweak domain theories (using LOB*). EBNN has been found to degrade gracefully as the \naccuracy of the domain theory decreases. \nWe have demonstrated the ability of EBNN to transfer knowledge among robot learning \ntasks. However, there are several open questions which will drive future research, the \nmost significant of which are: a. Can EBNN be extended to real-valued, parameterized \n\n\f294 \n\nMitchell and Thrun \n\naction spaces? So far we assume discrete actions. b. Can EBNN be extended to handle \nfirst-order predicate logic. which is common in symbolic approaches to EBL? c. How will \nEBNN perform in highly stochastic domains? d. Can knowledge other than slopes (such \nas higher order derivatives) be extracted via explanations? e. Is it feasible to automatically \npartition/modularize the domain theory as well as the target function, as this is the case with \nsymbolic EBL methods? More research on these issues is warranted. \n\nAcknowledgments \nWe thank Ryusuke Masuoka, Long-Ji Lin. the CMU Robot Learning GrouP. Jude Shavlik, \nand Mike Jordan for invaluable discussions and suggestions. This research was sponsored in \npart by the Avionics Lab, Wright Research and Development Center. Aeronautical Systems \nDivision (APSC). U. S. Air Force. Wright-Patterson AFB. OH 45433-6543 under Contract \nF33615-90-C-1465.Arpa Order No. 7597 and by a grant from Siemens Corporation. \n\nReferences \n\n[Barto et aI., 1991] Andy G. Barto, StevenJ. Bradtke, and Satinder P. Singh. Real-time learning and \ncontrol using asynchronous dynamic programming. Technical Report COINS 91-57, Department \nof Computer Science, University of Massachusetts, MA, August 1991. \n\n[Baum and Haussler, 1989] Eric Baum and David Haussler. What size net gives valid generalization? \n\nNeural Computation, 1(1):151-160,1989. \n\n[Dejong and Mooney, 1986] Gerald DeJong and Raymond Mooney. Explanation-based learning: \n\nAn alternative view. Machine Learning, 1(2):145-176, 1986. \n\n[Jordan and Rumelhart, 1990] Michael I. Jordan and David E. Rumelhart. Forward models: Super(cid:173)\n\nvised learning with a distal teacher. submitted to Cognitive Science, 1990. \n\n[Lin, 19911 Long-Ji Lin. Programming robots using reinforcement learning and teaching. In Pro(cid:173)\n\nceedings of AAAl-91 , Menlo Park, CA, July 1991. AAAI Press I The MIT Press. \n\n[Mitchell et al., 1986] Tom M. Mitchell, Rich Keller, and Smadar Kedar-Cabelli. Explanation-based \n\ngeneralization: A unifying view. Machine Learning, 1(1):47-80, 1986. \n\n[Pratt, 1993] Lori Y. Pratt. Discriminability-based transfer between neural networks. Same volume. \n[Rumelhart et aI., 1986] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning \ninternal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, \nParallel Distributed Processing. Vol. I + II. MIT Press, 1986. \n\n[Shavlik and Towell, 1989] Jude W. Shavlik and G.G. Towell. An approach to combining \n\nexplanation-based and neural learning algorithms. Connection Science, 1(3):231-253, 1989. \n\n[Simard et al., 1992] Patrice Simard, Bernard Victorri, Yann LeCun, and John Denker. Tangent prop \n- a formalism for specifying selected invariances in an adaptive network. In J. E. Moody, S. J. \nHanson, and R. P. Lippmann, editors, Advances in Neural Information Processing Systems 4, \npages 895-903, San Mateo, CA, 1992. Morgan Kaufmann. \n\n[SUlton, 1988] Richard S. Sutton. Learning to predict by the methods of temporal differences. \n\nMachine Learning, 3. 1988. \n\n[Sutton, 1990] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based \non approximating dynamic programming. In Proceedings of the Seventh International Conference \non Machine Learning, June 1990, pages 216-224,1990. \n\n[Valiant, 1984] Leslie G. Valiant A theory of the learnable. Communications of the ACM, 27:1134-\n\n[Watkins,1989] Chris J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's \n\n1142, 1984. \n\nCollege, Cambridge, England,1989. \n\n\f", "award": [], "sourceid": 614, "authors": [{"given_name": "Tom", "family_name": "Mitchell", "institution": null}, {"given_name": "Sebastian", "family_name": "Thrun", "institution": null}]}