{"title": "Extracting Rules from Artificial Neural Networks with Distributed Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 512, "abstract": "", "full_text": "Extracting Rules from Artificial Neural Networks \n\nwith Distributed Representations \n\nSebastian Thrun \nUniversity of Bonn \n\nDepartment of Computer Science III \n\nRomerstr. 164, D-53117 Bonn, Germany \n\nE-mail: thrun@carbon.informatik.uni-bonn.de \n\nAbstract \n\nAlthough artificial neural networks have been applied in a variety of real-world scenarios \nwith remarkable success, they have often been criticized for exhibiting a low degree of \nhuman comprehensibility. Techniques that compile compact sets of symbolic rules out \nof artificial neural networks offer a promising perspective to overcome this obvious \ndeficiency of neural network representations. \nThis paper presents an approach to the extraction of if-then rules from artificial neu(cid:173)\nIts key mechanism is validity interval analysis, which is a generic \nral networks. \ntool for extracting symbolic knowledge by propagating rule-like knowledge through \nBackpropagation-style neural networks. Empirical studies in a robot arm domain illus(cid:173)\ntrate the appropriateness of the proposed method for extracting rules from networks with \nreal-valued and distributed representations. \n\nIntroduction \n\n1 \nIn the last few years artificial neural networks have been applied successfully to a variety \nof real-world problems. For example, neural networks have been successfully applied in \nthe area of speech generation [12] and recognition [18], vision and robotics [8], handwritten \ncharacter recognition [5], medical diagnostics [11], and game playing [13]. While in these \nand other approaches neural networks have frequently found to outperform more traditional \napproaches, one of their major shortcomings is their low degree of human comprehensibility. \nIn recent years, a variety of approaches for compiling rules out of networks have been \nproposed. Most approaches [1, 3,4,6, 7, 16, 17] compile networks into sets of rules with \nequivalent structure: Each processing unit is mapped into a separate rule-or a smal1 set \nof rules-, and the ingoing weights are interpreted as preconditions to this rule. Sparse \nconnectivity facilitates this type rule extraction, and so do binary activation values. In order \nto enforce such properties, which is a necessary prerequisite for these techniques to work \neffectively, some approaches rely on specialized training procedures, network initializations \n\n\f506 \n\nSebastian Thrun \n\nand/or architectures. \nWhile such a methodology is intriguing, as it draws a clear one-to-one correspondence \nbetween neural inference and rule-based inference, it is not universally applicable to arbitrary \nBackpropagation-style neural networks. This is because artificial neural networks might not \nmeet the strong representational and structural requirements necessary for these techniques \nto work successfully. When the internal representation of the network is distributed in nature, \nindividual hidden units typically do not represent clear, logical entities. One might argue that \nnetworks, if one is interested in extracting rules, should be constructed appropriately. But this \nwould outrule most existing network implementation~, as such considerations have barely \nplayed a role. In addition, such an argument would suppress the development of distributed, \nnon-discrete internal representations, which have often be attributed for the generalization \nproperties of neural networks. It is this more general class of networks that is at stake in this \npaper. \nThis paper presents a rule extraction method which finds rules by analyzing networks as a \nwhole. The rules are of the type \"if X then y,\" where both x and y are described by a linear set \nof constraints. The engine for proving the correspondence of rule and network classification is \nVI-Analysis. Rules extracted by VI-Analysis can be proven to exactly describe the network. \n\n2 Validity-Interval Analysis \nValidity Interval Analysis (in short: VI-Analysis) is a generic tool for analyzing the input(cid:173)\noutput behavior of Backpropagation-style neural networks. In short, they key idea of VI(cid:173)\nAnalysis is to attach intervals to the activation range of each unit (or a subset of all units, \nlike input and output units only), such that the network's activations must lie within these \nintervals. These intervals are called validity intervals. VI-Analysis checks whether such \na set of intervals is consistent, i.e., whether there exists a set of network activations inside \nthe validity intervals. It does this by iteratively refining the validity intervals, excluding \nactivations that are provably inconsistent with other intervals. \nIn what follows we will \npresent the general VI-Analysis algorithm, which can be found in more detail elsewhere [14], \nLet n denote the total number of units in the network, and let Xi denote the (output) activation \nof unit i (i = 1, ... , n). If unit i is an input unit, its activation value will simply be the \nexternal input value. If not, i.e., if i refers to a hidden or an output unit, let P( i) denote the \nset of units that are connected to unit i through a link. The activation Xi is computed in two \nsteps: \n\nwith \n\nL WikXk + Oi \n\nThe auxiliary variable neti is the net-input of unit i, and Wik and Oi are the weights and \nbiases, respectively. O'j denotes the transfer function (squashing function), which usually is \ngiven by \n\nkEP(i) \n\n1 + e- net , \n\nValidity intervals for activation values Xi are denoted by [ai, bi ]. If necessary, validity intervals \nare projected into the net-input space of unit i, where they will be denoted by [a~, b~]. Let \nT be a set of validity intervals for (a subset of) all units. An activation vector (XI, .. \" xn) \nis said to be admissible with respect to T, if all activations lie in T. A set of intervals T is \nconsistent, if there exists an admissible activation vector. Otherwise T is inconsistent. \nAssume an initial set of intervals, denoted by T, is given (in the next section we will present \na procedure for generating initial intervals). VI-Analysis refines T iteratively using linear \n\n\fExtracting Rules from Artificial Neural Networks with Distributed Representations \n\n507 \n\nnon-linear .<;quashing functioll' CJ \n\nlinear equations \n\nFigure 1: VI-Analysis in a single weight layer. Units in layer P are connected to the units \nin layer S. A validity interval [aj, bj ] is assigned to each unit j E PuS. By projecting the \nvalidity intervals for all i E S, intervals [a~, b~] for the net-inputs netj are created. These, plus \nthe validity intervals for all units k E P, form a set of linear constraints on the activations x k \nin layer P. Linear programming is now employed to refine all interval bounds one-by-one. \n\nprogramming [9], so that those activation values which are inconsistent with other intervals \nare excluded. In order to simplify the presentation, let us assume without loss of \u00a5enerality \n(a) that the network is layered and fully connected between two adjacent layers, and (b) \nthat there is an interval [aj, bj ] ~ [0,1] in I for every unit in P and S.2 Consider a single \nweight layer, connecting a layer of preceding units, denoted by p, to a layer of succeeding \nunits, denoted by S (cf Fig. 1). In order to make linear programming techniques applicable, \nthe non-linearity of the transfer function must be eliminated. This is achieved by projecting \n[ai, bi ] back to the corresponding net-input intervals3 [ai, biJ = {T-I([ai' biD E ~2 for all \ni E S. The resulting validity intervals in P and S form the foIIowing set of linear constraints \non the activation values in P: \n\nVk E P: \nVi E S: L WjkXk + ()j > \nL WikXk + ()j < \n\nXk > ak \n, \na~ \n\nb~ , \n\nkEP \n\nkEP \n\nand Xk < bk \n[by substituting neti = L WikXk + ()d \n[by substituting netj = L WikXk + ()i] \n\nkEP \n\n(1) \n\nkEP \n\nNotice that all these constraints are linear in the activation values Xk (k E P). Linear \nprogramming allows to maximize or minimize arbitrary linear combinations of the variables \nx j while not violating a set of linear constraints [9]. Hence, linear programming can be \napplied to refine lower and upper bounds for validity intervals one-by-one. \nIn VI-Analysis, constraints are propagated in two phases: \n\n1. Forward phase. To refine the bounds aj and bj for units i E S, new bounds iii and hi are \n\n'This assumption simplifies the description of VI-Analysis, although VI-Analysis can also be applied \nto arbitrary non-layered, partially connected network architectures, as well as recurrent networks not \nexamined here. \n\n2The canonical interval [0, I] corresponds to the state of maximum ignorance about the activation \n\nof a unit, and hence is the default interval if no more specific interval is known. \n\n3Here ~ denotes the set of real numbers extended by \u00b1oo. Notice that this projection assumes that \n\nthe transfer function is monotonic. \n\n\frefined. \nli k \n\nminxk \n\nand \n\nhk = max Xk \n\n508 \n\nderived: \n\nSebastian Thrun \n\nwith \n\nwith \n\nA' a \u00b7 z \n\nmin neti \n\n= max neti \n\nmin L: WikXk + Oi \nmax L: WikXk + OJ \n\nkE1' \n\nkE1' \n\nIf o'i > ai, a tighter lower bound is found and ai is updated by o'i . Likewise, bi is set to hi \nif hi < bi . Notice that the minimax operator is computed within the bounds imposed by \nEq. I, using the Simplex algorithm (linear programming) [9]. \n\n2. Backward phase. In the backward phase the bounds ak and bk of all units k E Pare \n\nAs in the forward phase, ak is updated by o'k if li k > ak, and h is updated by hk if hk < bk. \nIf the network has multiple weight layers, this process is applied to all weight layers one-by(cid:173)\none. Repetitive refinement results in the propagation of interval constraints through multiple \nlayers in both directions. The convergence of VI-Analysis follows from the fact that the \nupdate rule that intervals are changed monotonically, since they can only shrink or stay the \nsame. \nRecall that the \"input\" of VI-Analysis is a set of intervals I ~ [0, l]n that constrain the \nactivations of the network. VI-Analysis generates a refined set of intervals, I' ~ I, so that \nall admissible activation values in the original intervals I are also in the refined intervals \nI'. In other words, the difference between the original set of intervals and the refined set of \nintervals I - I' is inconsistent. \nIn summary, VI-Analysis analyzes intervals I \nis \nfound to be inconsistent, there is provably no admissible activation vector in I . Detecting \ninconsistencies is the driving mechanism for the verification and extraction of rules presented \nin turn. \n\nin order to detect inconsistencies. If I \n\n3 Rule Extraction \nThe rules considered in this paper are propositional if-then rules. Although VI-Analysis is \nable to prove rules expressed by arbitrary linear constraints [14], for the sake of simplicity we \nwill consider only rules where the precondition is given by a set of intervals for the individual \ninput values, and the output is a single target category. Rules of this type can be written as: \n\n!linput E some hypercube I then class is C \n\n(or short: I - - C) \n\nfor some target class C. \nThe compliance of a rule with the network can be verified through VI-Analysis. Assume, \nwithout loss of generality, the network has a single output unit, and input patterns are classified \nas members of class C if and only if the output activation, Xout, is larger than a threshold \ne (see [14] for networks with multiple output units). A rule conjecture I -- C is then \nverified by showing that there is no input vector i E I that falls into the opposite class, \n,C. This is done by including the (negated) condition Xout E [0, e] into the set of intervals: \nIneg = 1+ {xout E [0, e]}. If the rule is correct, Xout will never be in [0, e]. Hence, if \nVI-Analysis finds an inconsistency in Ineg, the rule I -- ,C is proven to be incorrect, and \nthus the original rule I -- C holds true for the network at hand. This illustrates how rules \nare verified using VI-Analysis. \nIt remains to be shown how such conjectures can be generated in a systematic way. Two \nmajor classes of approaches can be distinguished, specific-to-general and general-to-specific. \n\n\fExtracting Rules from Artificial Neural Networks with Distributed Representations \n\n509 \n\nFigure 2: Robot Ann. (a) Front view of two arm configurations. (b) Two-dimensional side \nview. The grey area indicates the workspace, which partially intersects with the table. \n\n1. Specific-to-general. A generic way to generate rules, which forms the basis for the \nexperimental results reported in the next section, is to start with rather specific rules which \nare easy to verify, and gradually generalize those rules by enlarging the corresponding \nvalidity intervals. Imagine one has a training instance that, without loss of generality, falls \ninto a class C. The input vector of the training instance already forms a (degenerate) set \nof validity intervals I. VI-Analysis will, applied to I, trivially confirm the membership in \nC, and hence the single-point rule I ~ C. Starting with I, a sequence of more general \nrule preconditions I C II C I2 C ... can be obtained by enlarging the precondition of \nthe rule (i.e., the input intervals I) by small amounts, and using VI-Analysis to verify if \nthe new rule is still a member of its class. In this way randomly generated instances can \nbe used as \"seeds\" for rules, which are then generalized via VI-Analysis. \n\n2. General-to-specific. An alternative way to extract rules, which has been studied in more \ndetail elsewhere [14], works from general to specific. General-to-specific rule search \nmaintains a list of non-proven conjectures, R. R is initialized with the most general rules \n(like \"everything is in C\" and \"nothing is in C\"). VI-Analysis is then applied to prove \nrules in R. If it successfully confirms a rule, the rule and its complement is removed from \nR. If not, the rule is removed, too, but instead new rules are added to R. These new rules \nform a specialized version of the old rule, so that their disjunct is exactly the old rule. For \nexample, new rules can be generated by splitting the hypercube spanned by the old rule into \ndisjoint regions, one for each new rule. Then, the new set R is checked with VI-Analysis. \nThe whole procedure continues till R is empty and the whole input domain is described by \nrules. In discrete domains, such a strategy amounts to searching directed acyclic graphs in \nbreadth-first manner. \n\nObviously, there is a variety of alternative techniques to generate meaningful rule hypotheses. \nFor example, one might employ a symbolic learning technique such as decision tree learning \n[10] to the same training data that was used for training the network. The rules, which are a \nresult of the symbolic approach, constitute hypotheses that can be checked using VI-Analysis. \n\n4 Empirical Results \nIn this section we will be interested in extracting rules in a real-valued robot arm domain. We \ntrained a neural network to model the forward kinematics function of a 5 degree-of-freedom \nrobot arm. The arm, a Mitsubishi RV-Ml, is depicted in Fig. 2. Its kinematic function \ndetermines the position of the tip of the manipulator in (x, y, z) workspace coordinates and \n\n\f510 \n\nSebastian Thrun \n\ncoverage \nfirst 10 rules \nfirst 100 rules \nfirst 1 000 rules \nfirst 10000 rules \n\naverage (per rule) \n\ncumulative \n\n9.79% \n2.59% \n1.20% \n0.335% \n\n30.2% \n47.8% \n61.6% \n84.4% \n\nTable 1: Rule coverage in the robot arm domain. These numbers inc1ude rules for both \nconcepts, SAFE and UNSAFE. \n\nthe angle of the manipulator h to the table based on the angles of the five joints. As can be \nseen in Fig. 2, the workspace intersects with the table on which the arm is mounted. Hence, \nsome configurations of the joints are safe, namely those for which z ~ 0, whiJe others can \nphysically not be reached without a col1ision that would damage the robot (unsafe). When \noperating the robot arm one has to be able to tell safe from unsafe. Henceforth, we are \ninterested in a set of rules that describes the subspace of safe and unsafe joint configurations. \nA total of 8192 training examples was used for training the network (four input, five hidden \nand four output units), resulting in a considerably accurate model of the kinematics of the \nrobot arm. Notice that the network operates in a continuous space. Obviously, compiling \nthe network into logical rules node-by-node, as frequently done in other approaches to rule \nextraction, is difficult due to the real-valued and distributed nature of the internal represen(cid:173)\ntation. Instead, we applied VI-Analysis using a specific-to-general mechanism as described \nabove. More specifically, we incrementally constructed a collection of rules that gradually \ncovered the workspace of the robot arm. Rules were generated whenever a (random) joint \nconfiguration was not covered by a previously generated rule. Table 1 shows average results \nthat characterize the extraction of rules. Initially, each rule covers a rather large fraction of \nthe 5-dimensional joint configuration space. As few as 11 rules, on average, suffice to cover \nmore than 50% (by volume) of the whole input space. However, these 50% are the easy half. \nAs the domain gets increasingly covered by rules, gradually more specific rules are generated \nin regions closer to the c1ass boundary. After extracting 10,000 rules, only 84.4% of the input \nspace is covered. Since the decision boundary between the two c1asses is highly non-linear, \nfinitely many rules will never cover the input space completely. \nHow general are the rules extracted by VI-Analysis? Genera])y speaking, for joint configu(cid:173)\nrations c10se to the c1ass boundary, i.e., where the tip of the manipulator is close to the table, \nwe observed that the extracted rules were rather specific. If instead the initial configuration \nwas closer to the center of a class, VI-Analysis was observed to produce more general rules \nthat had a larger coverage in the workspace. Here VI-Analysis managed to extract surpris(cid:173)\ningly general rules. For example, the configuration a = (300 ,800 ,200 ,600 , -200 ), which is \ndepicted in Fig. 3, yields the rule \n\n!!.a2 ~ 90.50 and a3 ~ 27.3 0 then SAFE. \n\nNotice that out of 10 initial constraints, 8 were successfully removed by VI-Analysis. The \nrule lacks both bounds on a), a4 and as and the lower bounds on a2 and a3. Fig. 3a shows \nthe front view of the initial arm configuration and the generalized rule (grey area). Fig. 3b \nshows a side view of the arm, along with a slice of the rule (the base joint a) is kept fixed). \nNotice that this very rule covers 17.1 % of the configuration space (by volume). Such general \nrules were frequently found in the robot arm domain. \nThis conc1 udes the brief description of the experimental results. Not mentioned here are results \nwith different size networks, and results obtained for the MONK's benchmark problems. For \nexample, in the MONK's problems [15], VI-Analysis successfully extracted compact target \n\n\fExtracting Rules from Artificial Neural Networks with Distributed Representations \n\n511 \n\nFigure 3: A single rule, extracted from the network. (a) Front view. (b) Two-dimensional \nside view. The grey area indicates safe positions for the tip of the manipulator. \n\nconcepts using the originally published weight sets. These results can be found in [14]. \n\n5 Discussion \n\nIn this paper we have presented a mechanism for the extraction of rules from Backpropagation(cid:173)\nstyle neural networks. There are several limitations of the current approach that warrant future \nresearch. (a) Speed. While the one-to-one compilation of networks into rules is fast, rule \nextraction via VI-Analysis requires mUltiple runs of linear programming, each of which can \nbe computationally expensive [9]. Searching the rule space without domain-specific search \nheuristics can thus be a most time-consuming undertaking. In all our experiments, however, \nwe observed reasonably fast convergence of the VI-Algorithm, and we successfully managed \nto extract rules from larger networks in reasonable amounts of time. Recently, Craven and \nShavlik proposed a more efficient search method which can be applied in conjunction with \nVI-Analysis [2]. (b) Language. Currently VI-Analysis is limited to the extraction of if-then \nrules with linear preconditions. While in [14] it has been shown how to generalize VI-Analysis \nto rules expressed by arbitrary linear constraints, a more powerful rule language is clearly \ndesirable. (c) Linear optimization. Linear programming analyzes multiple weight layers \nindependently, resulting in an overly careful refinement of intervals. This effect can prevent \nfrom detecting correct rules. If linear programming is replaced by a non-linear optimization \nmethod that considers multiple weight layers simultaneously, more powerful rules can be \ngenerated. On the other hand, efficient non-linear optimization techniques might find rules \nwhich do not describe the network accurately. Moreover, it is generally questionable whether \nthere will ever exist techniques for mapping arbitrary networks accurately into compact \nrule sets. Neural networks are their own best description, and symbolic rules might not be \nappropriate for describing the input-output behavior of a complex neural network. \nA key feature of of the approach presented in this paper is the particular way rules are \nextracted. Unlike other approaches to the extraction of rules, this mechanism does not \ncompile networks into structurally equivalent set of rules. Instead it analyzes the input output \nrelation of networks as a whole. As a consequence, rules can be extracted from unstructured \nnetworks with distributed and real-valued internal representations. In addition, the extracted \nrules describe the neural network accurately, regardless of the size of the network. This makes \nVI-Analysis a promising candidate for scaling rule extraction techniques to deep networks, in \nwhich approximate rule extraction methods can suffer from cumulative errors. We conjecture \nthat such properties are important if meaningful rules are to be extracted in today's and \ntomorrow's successful Backpropagation applications. \n\n\f512 \n\nSebastian Thrun \n\nAcknowledgment \nThe author wishes to express his gratitude to Marc Craven, Tom Dietterich, Clayton McMil(cid:173)\nlan. Tom Mitchell and Jude Shavlik for their invaluable feedback that has influenced this \nresearch. \nReferences \n\n[I] M. W. Craven and J. W. Shavlik. Learning symbolic rules using artificial neural networks. In \nPaul E. Utgoff, editor, Proceedings of the Tenth International Conference on Machine Learning, \n1993. Morgan Kaufmann. \n\n[2] M. W. Craven and J. W. Shavlik. Using sampling and queries to extmct rules from tmined neural \nnetworks. In Proceedings of the Eleventh International Conference on Machine Learning, 1994. \nMorgan Kaufmann. \n\n[3] L.-M. Fu. Integration of neural heuristics into knowledge-based inference. Connection Science, \n\n1(3):325-339,1989. \n\n[4] C. L. Giles and C. W. Omlin. Rule refinement with recurrent neural networks. In Proceedings of \n\nthe IEEE International Conference on Neural Network, 1993. IEEE Neuml Network Council. \n\n[5] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard. W. Hubbard, and L. D. Jackel. \nBackpropagation applied to handwritten zip code recognition. Neural Computation, 1 :541-551. \n1990. \n\n[6] J. J. Mahoney and R. J. Mooney. Combining neural and symbolic learning to revise probabilistic \nrule bases. In J. E. Moody, S. J. Hanson, and R. P. Lippmann, editors, Advances in Neural \nInformation Processing Systems 5, 1993. Morgan Kaufmann. \n\n[7] C. McMillan, M. C. Mozer, and P. Smolensky. Rule induction through integrated symbolic and \nsubsymbolic processing. In J. E. Moody. S. J. Hanson. and R. P. Lippmann, editors, Advances in \nNeural Information Processing Systems 4. 1992. Morgan Kaufmann. \n\n[8] D. A. Pomerleau. ALVINN: an autonomous land vehicle in a neural network. Technical Report \n\nCMU-CS-89-1 07. Computer Science Dept. Carnegie Mellon University, Pittsburgh PA, 1989. \n\n[9] W. H. Press. Numerical recipes in C \" the art of scientific computing. Cambridge University \n\nPress, Cambridge [Cambridgeshire], New York, 1988. \n\n[10] J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106,1986. \n[II] J. Rennie. Cancer catcher: Neural net catches errors that slip through pap tests. Scientific \n\nAmerican, 262, May 1990. \n\n[12] T. J. Sejnowski and C. R. Rosenberg. Nettalk: A parallel network that learns to read aloud. \n\nTechnical Report JHUIEECS-86/01, Johns Hopkins University, 1986. \n\n[13] G. J. Tesauro. Practical issues in tempoml difference learning. Machine Learning. 8, 1992. \n[14] S. Thrun. Extracting provably correct rules from artificial neuml networks. Technical Report \n\nIAI-TR-93-5, University of Bonn. Institut flir Informatik III, D-53117 Bonn, May 1993. \n\n[15] S. Thrun, J. Bala,E. Bloedorn, I. Bmtko, B. Cestnik, J. Cheng, K. Dejong, S. Dzeroski, D. Fisher, \nS. E. Fahlman, R. Hamann, K. Kaufman, S. Keller. I. Kononenko, J. Kreuziger, R. S. Michalski, \nT.M. Mitchell, P. Pachowicz, Y. Reich, H. Vafaie, W. Van de WeIde, W. Wenzel, J. Wnek, and \nJ. Zhang. The MONK's problems - a performance comparison of different learning algorithms. \nTechnical Report CMU-CS-91-197, Carnegie Mellon University. Pittsburgh, PA, December 1991. \n[16] G. Towell and J. W. Shavlik. Interpretation of artificial neural networks: Mapping knowledge(cid:173)\nbased neural networks into rules. In J. E. Moody. S. J. Hanson, and R. P. Lippmann, editors, \nAdvances in Neural Information Processing Systems 4. 1992. Morgan Kaufmann. \n\n[17] V. Tresp and J. Hollatz. Network structuring and training using rule-based knowledge. In J. E. \nMoody, S. J. Hanson. and R. P. Lippmann, editors, Advances in Neural Information Processing \nSystems 5,1993. Morgan Kaufmann. \n\n[18] A. H. Waibel. Modular construction of time-delay neural networks for speech recognition. Neural \n\nComputation, 1 :39-46, 1989. \n\n\f", "award": [], "sourceid": 924, "authors": [{"given_name": "Sebastian", "family_name": "Thrun", "institution": null}]}