{"title": "Template-Based Algorithms for Connectionist Rule Extraction", "book": "Advances in Neural Information Processing Systems", "page_first": 609, "page_last": 616, "abstract": null, "full_text": "Template-Based Algorithms for \nConnectionist Rule Extraction \n\nJay A. Alexander and Michael C. Mozer \n\nDepartment of Computer Science and \n\nInstitute for Cognitive Science \n\nUniversity of Colorado \n\nBoulder, CO 80309--0430 \n\nAbstract \n\nCasting neural network weights in symbolic terms is crucial for \ninterpreting and explaining the behavior of a network. Additionally, in \nsome domains, a symbolic description may lead to more robust \ngeneralization. We present a principled approach to symbolic rule \nextraction based on the notion of weight templates, parameterized \nregions of weight space corresponding to specific symbolic expressions. \nWith an appropriate choice of representation, we show how template \nparameters may be efficiently identified and instantiated to yield the \noptimal match to a unit's actual weights. Depending on the requirements \nof the application domain, our method can accommodate arbitrary \ndisjunctions and conjunctions with O(k) complexity, simple n-of-m \nexpressions with O( k!) complexity, or a more general class of recursive \nn-of-m expressions with O(k!) complexity, where k is the number of \ninputs to a unit. Our method of rule extraction offers several benefits \nover alternative approaches in the literature, and simulation results on a \nvariety of problems demonstrate its effectiveness. \n\nINTRODUCTION \n\n1 \nThe problem of understanding why a trained neural network makes a given decision has a \nlong history in the field of connectionist modeling. One promising approach to this prob(cid:173)\nlem is to convert each unit's weights and/or activities from continuous numerical quantities \ninto discrete, symbolic descriptions [2, 4, 8]. This type of reformulation, or rule extraction, \ncan both explain network behavior and facilitate transfer of learning. Additionally, in \nintrinsically symbolic domains, there is evidence that a symbolic description can lead to \nmore robust generalization [4]. \n\n\f610 \n\nJay A. Alexander, Michael C. Mozer \n\nWe are interested in extracting symbolic rules on a unit-by-unit basis from connectionist \nnets that employ the conventional inner product activation and sigmoidal output functions. \nThe basic language of description for our rules is that of n-of-m expressions. An n-of-m \nexpression consists of a list of m subexpressions and a value n such that 1 ~ n ~ m. The \noverall expression is true when at least n of the m subexpressions are true. An example of \nan n-of-m expression stated using logical variables is the majority voter function \nX = 2 of (A, B, C). N-of-m expressions are interesting because they are able to model \nbehaviors intermediate to standard Boolean OR (n = 1) and AND (n = m) functions. \nThese intermediate behaviors reflect a limited form of two-level Boolean logic. (To see \nwhy this is true, note that the expression for X above is equivalent to AB + BC + AC.) In a \nlater section we describe even more general behaviors that can be represented using recur(cid:173)\nsive forms of these expressions. N-of-m expressions fit well with the activation behavior \nof sigmoidal units, and they are quite amenable to human comprehension. \nTo extract an n-of-m rule from a unit's weights, we follow a three-step process. First we \ngenerate a minimal set of candidate templates, where each template is parameterized to \nrepresent a given n-of-m expression. Next we instantiate each template's parameters with \noptimal values. Finally we choose the symbolic expression whose instantiated template is \nnearest to the actual weights. Details on each of these steps are given below. \n\n2 TEMPLATE-BASED RULE EXTRACTION \n\n2.1 Background \nFollowing McMillan [4], we define a weight template as a parameterized region of weight \nspace corresponding to a specific symbolic function. To see how weight templates can be \nused to represent symbolic functions, consider the weight vector for a sigmoidal unit with \nfour inputs and a bias: \n\nw = WI \n\nw2 \n\nw3 \n\nw4 \n\nb \n\nNow consider the following two template vectors: \n\nt1 = -p \nt2 = P \n\nP \n-p \n\n0 \nP \n\n-p \nP \n\nJ.5p \n-D.5p \n\nThese templates are parameterized by the variable p. Given a large positive value of p (say \n5.0) and an input vector 1 (whose components are approximately 0 and 1), t1 describes the \nsymbolic expression J of('lz, 12, 14), while \nthe symbolic expression \n2 of(Ib 12\u2022 h. 14), A general description for n-of-m templates of this form is the following: \n\nt2 describes \n\n1. M of the weight values are set to \u00b1p , P > 0; all others are set to O. \n\n(+p is used for normal subexpressions, -p for negated sUbexpressions) \n\n2. The bias value is set to (0.5 + m neg - n)p, where m neg represents the \n\nnumber of negated subexpressions. \n\nWhen the inputs are Boolean with values -1 and + 1, the form of the templates is the same, \nexcept the template bias takes the value (J + m - 2n)p. This seemingly trivial difference \nturns out to have a significant effect on the efficiency of the extraction process. \n\n\fTemplate-Based Algorithms for Connectionist Rule Extraction \n\n611 \n\n2.2 Basic extraction algorithm \n\nGenerating candidate templates \nGiven a sigmoidal unit with k inputs plus a bias, the total number of n-of-m expressions \nthat unit may compute is an exponential function of k: \n\n/c \n\nm \n\n/c \n\nT = ~ ~ 2m(k) = ~ \n\u00a3.J (k-m)!(m-l)! \nm=1 \n\n\u00a3.J \u00a3.J \nm=ln=1 \n\n2mk! \n\nm \n\n= 2k3/C-1 \n\nFor example, T k=1O is 393,660, while T k=20 is over 46 billion. Fortunately we can apply \nknowledge of the unit's actual weights to explore this search space without generating a \ntemplate for each possible n-of-m expression. Alexander [1] proves that when the -11+1 \ninput representation is used, we need consider at most one template for each possible \nchoice of n and m. For a given choice of nand m, a template is indicated when \nsign( 1 + m - 2n) = sign(b). A required template is formed by setting the template weights \ncorresponding to the m highest absolute value actual weights to sp, where s represents the \nsign of the corresponding actual weight. The template bias is set to (1 + m - 2n)p. This \nreduces the number of templates required to a polynomial function of k: \n\nValues for T k=1O and T k=20 are now 30 and 110, respectively, making for a very efficient \npruning of the search space. When 011 inputs are used, this simple procedure does not suf(cid:173)\nfice and many more templates must be generated. For this reason, in the remainder of this \npaper we focus on the -11+ 1 case and assume the use of symmetric sigmoid functions. \n\nInstantiating template parameters \nInstantiating a weight template t requires finding a value for p such that the Euclidean \ndistance d = lit - wl1 2 is minimized. Letting Uj = 1 if template weight tj is nonzero, Uj = 0 \notherwise, the value of p that minimizes this distance for any -11+ 1 template is given by: \n\n/c L IWjIUj+ (1 +m-2n)b \n\n\u00b7 ==-I!...-_____ ~-\n\nm + (1 + m - 2n) 2 \n\np* = I:....:\n\nFinding the nearest template and checking extraction validity \nOnce each template is instantiated with its value of p*, the distance between the template \nand the actual weight vector is calculated, and the minimal distance template is selected as \nthe basis for rule extraction. Having found the nearest template t*, we can use its values as \npart of a rudimentary check on extraction validity. For example, we can define the \nextraction error as 100% x Ilt*-wI1 2/1IwI1 2 to measure how well the nearest symbolic rule \nfits the actual weights. We can also examine the value of p* used in t*. Small values of p* \ntranslate into activation levels in the linear regime of the sigmoid functions, compromis(cid:173)\ning the assumption of Boolean outputs propagating to subsequent inputs. \n\n\f612 \n\nJay A. Alexander, Michael C. Mozer \n\n2.3 Extending expressiveness \nWhile the n-of-m expressions treated thus far are fairly powerful, there is an interesting \nclass of symbolic behaviors that cannot be captured by simple n-of-m expressions. The \nsimplest example of this type of behavior may be seen in the single hidden unit version of \nxor described in [6]. In this 2-1-1 network the hidden unit H learns the expression \nAND(h 12), while the output unit (which connects to the two inputs as well as to the hid(cid:173)\nden unit) learns the expression AND[OR(h 12),Hj. This latter expression may be viewed \nas a nested or recursive form of n-of-m expression, one where some of the m subexpres(cid:173)\nsions may themselves be n-of-m expressions. The following two forms of recursive n-of-m \nexpressions are linearly separable and are thus computable by a single sigmoidal unit: \n\nOR [Cn-of-m ' COR j \nAND [Cn-of-m , CAND j \n\nwhere Cn-of-m is a nested n-of-m expression (1 S; n S; m) \n\nCOR \nCAND \n\nis a nested OR expression (n = 1) \nis a nested AND expression (n = m) \n\nThese expressions may be seen to generalize simple n-of-m expressions in the same way \nthat simple n-of-m expressions generalize basic disjunctions and conjunctions.1 We term \nthe above forms augmented n-of-m expressions because they extend simple n-of-m \nexpressions with additional disjuncts or conjuncts. Templates for these expressions (under \nthe -1/+ 1 input representation) may be efficiently generated and instantiated using a pro(cid:173)\ncedure similar to that described for simple n-of-m expressions. When augmented expres(cid:173)\nsions are included in the search, the total number of templates required becomes: \n\nThis figure is O(k) worse than for simple n-of-m expressions, but it is still polynomial in k \nand is quite manageable for many problems. (Values for T k=1O and Tk=20 are 150 and 1250, \nrespectively.) A more detailed treatment of augmented n-of-m expressions is given in [1]. \n\n3 RELATED WORK \nHere we briefly consider two alternative systems for connectionist rule extraction. Many \nother methods have been developed; a recent summary and categorization appears in [2]. \n\n3.1 McMillan \nMcMillan described the projection of actual weights to simple weight templates in [4]. \nMcMillan's parameter selection and instantiation procedures are inefficient compared to \nthose described here, though they yield equivalent results for the classes of templates he \nused. McMillan treated only expressions with m S; 2 and no negated sUbexpressions. \n\n1 In fact the nesting may continue beyond one level. Thus sigmoidal units can compute expressions like \n\nOR[AND(Cn_of_m ' CAND), COR]. We have not yet experimented with extensions of this sort. \n\n\fTemplate-Based Algorithms for Connectionist Rule Extraction \n\n613 \n\n3.2 Towell and Shavlik \nTowell and Shavlik [8] use a domain theory to initialize a connectionist network, train the \nnetwork on a set of labeled examples, and then extract rules that describe the network's \nbehavior. To perform rule extraction, Towell and Shavlik first group weights using an iter(cid:173)\native clustering algorithm. After applying additional training, they typically check each \ntraining pattern against each weight group and eliminate groups that do not affect the clas(cid:173)\nsification of any pattern. Finally, they scan remaining groups and attempt to express a rule \nin purely symbolic n-of-m form. However, in many cases the extracted rules take the form \nof a linear inequality involving multiple numeric quantities. For example, the following \nrule was extracted from part of a network trained on the promoter recognition task [5] \nfrom molecular biology: \n\nMinus35 \" -10 < + 5.0 * nt(@-37 \n'--T-G--A' ) \n+ 3.1 * nt(@-37 \n'---GT---') \n+ 1.9 * nt(@-37 \n'----C-CT' ) \n+ 1.5 * nt(@-37 \n'---C--A-' ) \n- 1.5 * nt(@-37 '------GC' ) \n- 1.9 * nt(@-37 \n'--CAW---' ) \n- 3.1 * nt(@-37 \n'--A----C' ) \n\nwhere nt() returns the number of true subexpressions, \n@-37 locates the subexpressions on the DNA strand, \nand \"_N indicates a don't-care subexpression. \n\nTowell and Shavlik's method can be expected to give more accurate results than our \napproach, but at a cost. Their method is very compute intensive and relies substantially on \naccess to a fixed set of training patterns. Additionally, it is not clear that their rules are \ncompletely symbolic. While numeric expressions were convenient for the domains they \nstudied, in applications where one is interested in more abstract descriptions, such expres(cid:173)\nsions may be viewed as providing too much detail, and may be difficult for people to inter(cid:173)\npret and reason about. Sometimes one wants to determine the nearest symbolic \ninterpretation of unit behavior rather than a precise mathematical description. Our method \noffers a simpler paradigm for doing this. Given these differences, we conclude that both \nmethods have their place in rule extraction tool kits. \n\n4 SIMULATIONS \n\n4.1 Simple logic problems \nWe used a group of simple logic problems to verify that our extraction algorithms could \nproduce a correct set of rules for networks trained on the complete pattern space of each \nfunction. Table 1 summarizes the results.2 The rule-plus-exception problem is defined as \n/= AB + 1\\B CD; xor-l is the 2-1-1 version of xordescribed in Section 2.3; and xor-2 is \na strictly layered (2-2-1) version of xor [6]. The negation problem is also described in [6]; \nin this problem one of the four inputs controls whether the other inputs appear normally or \nnegated at the outputs. (As with xor-l, the network for negation makes use of direct inputJ \noutput connections.) In addition to the perfect classification performance of the rules, the \nlarge values of p* and small values of extraction error (as defined in Section 2.2) provide \nevidence that the extraction process is very accurate. \n\n\f614 \n\nJay A. Alexander, MichaeL C. Mozer \n\nProblem \n\nrule-plus-exception \n\nxor-l \n\nxor-2 \n\nnegation \n\nHidden \n\nUnit \n\nPenalty \nTerm \n\n-\n-\n-\n\nAveragep\u00b7 \n\nExtraction Error \n\nHidden \nUnit(s) \n\nOutput \nUnit(s) \n\nHidden \nUnit(s) \n\nOutput \nUnit(s) \n\n2.72 \n\n5.68 \n\n4.34 \n\n6.15 \n\n4.40 \n\n5.68 \n\n0.8% \n\n0.1 % \n\n0.4% \n\n1.3% \n\n0.1 % \n\n1.0% \n\nNetwork \nTopology \n\n4-2-1 \n\n2-1-1 \n\n2-2-1 \n\nactivation \n\n4-3-4 \nTable 1: Simulation summary for simple logic problems \n\n0.2% \n\n2.2% \n\n5.40 \n\n5.17 \n\nPatterns \nCorrectly \nClassified \nby Rules \n\n100.0% \n\n100.0 % \n\n100.0 % \n\n100.0 % \n\nSymbolic solutions for these problems often come in fonns different from the canonical \nfonn of the function. For example, the following rules for the rule-pLus-exception problem \nshow a level of negation within the network: \n\nOR (A, B, c, D) \n\nH1 \nH2 = AND (A, B) \no \n\nOR (H1' H2 ) \n\nExample results on xor-J show the expected use of an augmented n-of-m expression: \n\nH \no \n\nOR (1 1 , 1 2 ) \nOR [AND(I1 , 1 2 ), H! \n\n4.2 The MONK's problems \nWe tested generalization perfonnance using the MONK's problems [5,7], a set of three \nproblems used to compare a variety of symbolic and connectionist learning algorithms. A \nsummary of these tests appears in Table 2. Our perfonnance was equal to or better than all \nof the systems tested in [7] for the monks-J and monks-2 problems. Moreover, the rules \nextracted by our algorithm were very concise and easy to understand, in contrast to those \nproduced by several of the symbolic systems. (The two connectionist systems reported in \n[7] were opaque, Le., no rules were extracted.) As an example, consider the following out(cid:173)\nput for the monks-2 problem: \n\nHI \n\nH2 \n\no \n\n2 of (head_shape round, body_shape round, is_smiling yes, \n\nholding sword, jacket_color red, has_tie yes) \n\n3 of (head_shape round, body_shape round, is_smiling yes, \n\nholding sword, jacket_color red, has_tie not no) \n(HI' H2 ) \n\nAND \n\nThe target concept for this problem is exactly 2 of the attributes have their first value. \nThese rules demonstrate an elegant use of n-of-m expressions to describe the idea of \n\"exactly 2\" as \"at least 2 but not 3\". The monks-3 problem is difficult due to (intentional) \ntraining set noise, but our results are comparable to the other systems tested in [7]. \n\n2 All results in this paper are for networks trained using batch-mode back propagation on the \ncross-entropy error function. Training was stopped when outputs were within 0.05 of their target values for each \npattern or a fixed number of epochs (typically 10(0) was reached. Where indicated, a penalty term for non(cid:173)\nBoolean hidden activations or hidden weight decay was added to the main error function. For the breast cancer \nproblem shown in Table 4.3, hidden rules were extracted first and the output units were retrained briefly before \nextracting their rules. Results for the problems in Table 4.3 used leave-one-out testing or 100fold cross-validation \n(with 10 different initial orderings) as indicated. All results are averages over 10 replications with different initial \nweights. \n\n\fTemplate-Based Algorithms for Connectionist Rule Extraction \n\n615 \n\nProblem \n\nmonies-I \n\nmonles-2 \n\nmonks-3 \n\nNetwork \nTopology \n\n17-3--1 \n\n17-2-1 \n\nl7\"'{)\"\"l \n\nHidden \n\nUnit \n\nPenalty \nTerm \n\ndecay \n\ndecay \n\n-\n\nTraining Set \n\nTest Set \n\n#of \n\nPatterns \n\nPerf. of \nNetwork \n\nPerf. of \nRules \n\n#of \n\nPatterns \n\nPerf. of \nNetwork \n\nPerf. of \nRules \n\n124 \n\n169 \n\n122 \n\n100.0% \n\n100.0% \n\n100.0% \n\n100.0% \n\n93.4% \n\n93.4% \n\n432 \n\n432 \n\n432 \n\n100.0% \n\n100.0% \n\n100.0% \n\n100.0% \n\n97.2% \n\n97.2% \n\nTable 2: Simulation summary for the MONK's problems \n\n4.3 VCI repository problems \nThe final set of simulations addresses extraction performance on three real-world \ndatabases from the UCI repository [5]. Table 3 shows that good results were achieved. For \nthe promoters task, we achieved generalization performance of nearly 88%, compared to \n93-96% reported by Towell and Shavlik [8]. However, our results are impressive when \nviewed in light of the simplicity and comprehensibility of the extracted output. While \nTowell and Shavlik's results for this task included 5 rules like the one shown in Section \n3.2, our single rule is quite simple: \n\npromoter = 5 of (@-45 \n\n'AA-------TTGA-A-----T------T-----AAA----C') \n\nResults for the house-votes-84 and breast-eaneer-wise problems are especially noteworthy \nsince the generalization performance of the rules is virtually identical to that of the raw \nnetworks. This indicates that the rules are capturing a significant portion of the computa(cid:173)\ntion being performed by the networks. The following rule was the one most frequently \nextracted for the house-votes-84 problem, where the task is to predict party affiliation: \n\nDemocrat \n\nOR \n\n[ 5 of (V3 , V7 , V9, V1O ' Vl l , V12 ) , v4 1 \n\nvoted for adoption-of-the-budget-resolution bill \nvoted for physician-fee-freeze bill \nvoted for anti-satellite-test-ban bill \n\nwhere V3 \nV4 \nV7 \nV9 = voted for rnx-missile bill \nV1 0 = voted for immigration bill \nVll \nV12 = voted for education-spending bill \n\nvoted for synfuels-corporation-cutback bill \n\nShown below is a typical rule set extracted for the breast-eaneer-wise problem. Here the \ngoal is to diagnose a tumor as benign or malignant based on nine clinical attributes. \n\nMalignant = AND \n\n(H1 , H2 ) \n\nHl = 4 of (thickness> 3, size> 1, adhesion> 1, epithelial> 5, \n\nnuclei> 3, chromatin> 1, normal> 2, mi toses > 1) \nH2 = 30f(thickness>6,size>1,shape>1,epithelial>1, \n\nnuclei> 8, normal> 9) \n\nH3 = not used \n\nAs suggested by the rules, we used a thermometer (cumulative) coding of the nominally \nvalued attributes so that less-than or greater-than subexpressions could be efficiently rep(cid:173)\nresented in the hidden weights. Such a representation is often useful in diagnosis tasks. We \nalso limited the hidden weights to positive values due to the nature of the attributes. \n\n\f616 \n\nJay A. Alexander, Michael C. Mozer \n\nTraining Set \n\nTest Set \n\nProblem \n\npromoters \n\nhouse-votes-84 \n\nbreast-cancer-wisc \n\nNetwork \nTopology \n\n2284-1 \n\n164-1 \n\n#I of \n\nPatterns \n\n105 \n\n387 \n\nPerf. of \nNetwork \n\n100.0% \n\n97.3 % \n\nPerf. of \nRules \n\n95.9% \n\n96.2% \n\n81-3-1 \nTable 3: Simulation summary for uel repository problems \n\n98.5 % \n\n96.3 % \n\n630 \n\n#I of \n\nPatterns \n\nPerf. of \nNetwork \n\nPerf. of \nRules \n\nI \n\n43 \n\n70 \n\n94.2% \n\n95.7% \n\n95.8% \n\n87.6% \n\n95.9% \n\n95.2% \n\nTaken as a whole our simulation results are encouraging, and we are conducting further \nresearch on rule extraction for more complex tasks. \n\n5 CONCLUSION \nWe have described a general approach for extracting various types of n-of-m symbolic \nrules from trained networks of sigmoidal units, assuming approximately Boolean activa(cid:173)\ntion behavior. While other methods for interpretation of this sort exist, ours represents a \nvaluable price/performance point, offering easily-understood rules and good extraction \nperformance with computational complexity that scales well with the expressiveness \ndesired. The basic principles behind our approach may be flexibly applied to a \nwide variety of problems. \n\nReferences \n[1] Alexander, J. A. (1994). Template-based procedures for neural network interpretation. MS \n\nThesis. Department of Computer Science, University of Colorado, Boulder, CO. \n\n[2] Andrews, R., Diederich, l, and Tickle, A. B. (1995). A survey and critique of techniques for \nextracting rules from trained artificial neural networks. To appear in Fu, L. M. (Ed.), \nKnowledge-Based Systems, Special Issue on Knowledge-Based Neural Networks. \n\n[3] Mangasarian, O. L. and Wolberg, W. H. (1990). Cancer diagnosis via linear programming. \n\nSIAM News 23:5, pages 1 & 18. \n\n[4] McMillan, C. (1992). Rule induction in a neural network through integrated symbolic and \nsubsymbolic processing. PhD Thesis. Department of Computer Science, University of \nColorado, Boulder, CO. \n\n[5] Murphy, P. M. and Aha, D. W. (1994). UCI repository of machine learning databases. \n[Machine-readable data repository]. Irvine, CA: University of California, Department of \nInformation and Computer Science. Monks data courtesy of Sebastian Thrun, promoters data \ncourtesy of M. Noordewier and J. Shavlik, congressional voting data courtesy of Jeff \nSchlimmer, breast cancer data courtesy of Dr. William H. Wolberg (see also [3] above). \n\n[6] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal representations \nby error propagation. In Rumelhart, D. E., McClelland, l L., and the PDP Research Group, \nParallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: \nFoundations, pages 318-362. Cambridge, MA: MIT Press. \n\n[7] Thrun, S. B., and 23 other authors (1991). The MONK's problems - A performance comparison \nof different learning algorithms. Technical Report CS-CMU-91-197. Carnegie Mellon \nUniversity, Pittsburgh, PA. \n\n[8] Towell, G. and Shavlik, J. W. (1992). Interpretation of artificial neural networks: Mapping \nknowledge-based neural networks into rules. In Moody, J. E., Hanson, S. J., and Lippmann, R. \nP. (Eds.), Advances in Neurallnfonnation Processing Systems, 4:977-984. San Mateo, CA: \nMorgan Kaufmann. \n\n\f", "award": [], "sourceid": 915, "authors": [{"given_name": "Jay", "family_name": "Alexander", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}]}