{"title": "GDS: Gradient Descent Generation of Symbolic Classification Rules", "book": "Advances in Neural Information Processing Systems", "page_first": 1093, "page_last": 1100, "abstract": null, "full_text": "GDS: Gradient Descent Generation of \n\nSymbolic Classification Rules \n\nKaiserslautern University, Germany \n\nPresent address: Siemens AG, ZFE ST SN 41 \n\nReinhard Blasig \n\n81730 Miinchen, Germany \n\nAbstract \n\nImagine you have designed a neural network that successfully learns \na complex classification task. What are the relevant input features \nthe classifier relies on and how are these features combined to pro(cid:173)\nduce the classification decisions? There are applications where a \ndeeper insight into the structure of an adaptive system and thus \ninto the underlying classification problem may well be as important \nas the system's performance characteristics, e.g. in economics or \nmedicine. GDSi is a backpropagation-based training scheme that \nproduces networks transformable into an equivalent and concise \nset of IF-THEN rules. This is achieved by imposing penalty terms \non the network parameters that adapt the network to the expressive \npower of this class of rules. Thus during training we simultaneously \nminimize classification and transformation error. Some real-world \ntasks demonstrate the viability of our approach. \n\n1 \n\nIntroduction \n\nThis paper deals with backpropagation networks trained to perform a classification \ntask on Boolean or real-valued data. Given such a classification task in most cases \nit is not too difficult to devise a network architecture that is capable of learning \nthe input-output relation as represented by a number of training examples. Once \ntraining is finished one has a black box which often does a quite good job not \n\n1 Gradient Descent Symbolic Rule Generation \n\n1093 \n\n\f1094 \n\nBlasig \n\nonly on the training patterns but also on some previously unseen test patterns. A \ngood generalization performance indicates that the network has grasped part of the \nstructure inherent in the classification task. The net has figured out which input \nfeatures are relevant to make a classification decision and which are not. It has also \nmodelled the way the relevant features have to be combined in order to produce the \nclassifying output. In many applications it is important to get an understanding of \nthis information hidden inside the neural network. Not only does this help to create \nor verify a domain theory, the analysis of this information may also serve human \nexperts to determine, when and in what way the classifier will fail. \nIn order to explicate the network's implicit information, we transform it into a \nset of rules. This idea is not new, cf. (Saito and Nakano, 1988), (Bochereau and \nBourgine, 1990), (Y. Hayashi, 1991) and (Towell and Shavlik, 1992). In contrast to \nthese approaches, which extract rules after BP-training is finished, we apply penalty \nterms during training to adapt the network's expressive power to that of the rules \nwe want to generate. Consequently the net will be transformable into an equivalent \nset of rules. \n\nDue to their good comprehensibility we restrict the rules to be of the form IF \n< premise> THEN < conclusion >, where the premise as well as the conclusion \nare Boolean expressions. To actually make the transformation two problems have \nto be solved: \n\n\u2022 Neural nets are well known for their distributed representation of informa(cid:173)\ntion; so in order to transform a net into a concise and comprehensible rule \nset one has to find a way of condensing this information without substan(cid:173)\ntially changing it . \n\n\u2022 In the case of backpropagation networks a continuous activation function \n\ndetermines a node's output depending on its activation. However, the dy(cid:173)\nnamic of this function has no counterpart in the context of rule-based de(cid:173)\nscriptions. \n\nWe address these problems by introducing a penalty function Ep, which we add to \nthe classification error Ec yielding the total back propagation error \n\nET = ED + A * Ep. \n\n(1) \n\n2 The Penalty Term \n\nThe term Ep is intended to have two effects on the network weights. First, by \na weight decay component it aims at reducing network complexity by pushing a \n(hopefully large) fraction of the weights to O. The smaller the net, the more concise \nthe rules describing its behavior will be. As a positive side effect, this component will \ntend to act as a form of \"Occam's razor\": simple networks are more likely to exhibit \ngood generalization than complex ones. Secondly, the penalty term should minimize \nthe error caused by transforming the network into a set of rules. Adopting the \ncommon approach that each non-input neuron represents one rule, there would be \nno transformation error if the neurons' activation function were threshold functions; \nthe Boolean node output would then indicate, whether the conclusion is drawn or \nnot. But since backpropagation neurons use continuous activation functions like \n\n\fGDS: Gradient Descent Generation of Symbolic Classification Rules \n\n1095 \n\ny = tanh (x) to transform their activation value x into the output value y, we are \nleft with the difficulty of interpreting the continuous output of a neuron. Thus our \npenalty term will be designed to produce a high penalty for those neurons of the \nbackpropagation net, whose behavior cannot be well approximated by threshold \nneurons, because their activation values are likely to fall into the nonsaturated \nregion of the tanh-function2 . \n\n,.--------------\n\nI \nI \nI \nI \nI \nI \n\n1.00 \n\n0.00 \n\n-1.00 \n\n-3.00 \n\n0.00 \n\n3.00 \n\nFigure 1: We regard Ixl > 3 with Iyl = I tanh(x)I > 0.9 as the regions, where a \nsigmoidal neuron can be approximated by a threshold neuron. The nonsaturated \nregion is marked by the dashed box. \n\nFor a better understanding of our penalty term one has to be aware of the fact \nthat IF-THEN rules with a Boolean premise and conclusion are essentially Boolean \nfunctions. It can easily be shown that any such function can be calculated by a \nnetwork of threshold neurons provided there is one (sufficiently large) hidden layer. \nThis is still true if we restrict connection weights to the values {-I, 0, I} and node \nthresholds to be integers (Hertz, Krogh and Palmer, 1991). In order to transfer this \nscenario to nets with sigmoidal activation functions and having in mind that the \nactivation values of the sigmoidal neurons should always exceed \u00b13 (see figure 1), \nwe require the nodes' biases to be odd multiples of \u00b13 and the weights Wji to obey \n(2) \nWe shortly comment on the practical problem that sometimes bias values as large \nas \u00b16m, (mi being the fan-in of node i) may be necessary to implement certain \nBoolean functions. This may slow down or even block the learning process. A simple \nsolution to this problem is to use some additional input units with a constant output \nof +1. If the connections to these units are also subject to the penalty function Ep, \nit is sufficient to restrict the bias values to \n\nWji E {-6,0,6}. \n\nhi E {-3, 3}. \n\n(3) \n\n2We have to point out that the conversion of sigmoidal neurons to threshold neurons \nwill reduce the net's computational power: there are Boolean functions which can be \ncomputed by a net of sigmoidal neurons, but which exceed the capacity of a threshold \nnet of the same topology (Maass, Schnitger and Sontag, 1991). Note that the objective \nto use threshold units is a consequence of the decision to search for rules of the type IF \n< premise > THEN < conclusion >. A failure of the net to simultaneously minimize \nboth parts of the error measure may indicate that other rule types are more adequate to \nhandle the given classification task. \n\n\f1096 \n\nBlasig \n\nNow we can define penalty functions that push the biases and weights to the desired \nvalues. Obviously Eb (the bias penalty) and Ew (the weight penalty) have to be \ndifferent: \n\n(4) \n\nEb(bi ) = 13-lbill \n{ 16 - IWji11 \n\nfor IWjil ~ e \nfor IWjil < e \n\nE (w .. ) -\nJ' -\n\nw \n\n(5) \nThe parameter e determines whether a weight should be subject to decay or pushed \nto attain the value 6 (or -6 respectively). Figure 2 displays the graphs ofthe penalty \nfunctions. \n\nIWjil \n\n-3.0 \n\n3.0 \n\n-6.0 \n\n-8 \n\n8 \n\n6.0 \n\nFigure 2: The penalty functions Eb and Ew. \n\nThe value of e is chosen with the objective that only those weights should exceed \nthis value, which almost certainly have to be nonzero to solve the given classification \ntask. Since we initialize the network with weights uniformly distributed in the \ninterval [-0.5,0.5]' E> = 1.5 works well at the beginning of the training process. The \npenalty term then has the effect of a pure weight decay. When learning proceeds \nand the weights converge, we can slowly reduce the value of e, because superfluous \nweights will already have decayed. So after each sequence of 100 training patterns, \nsay, we decrease e by a factor of 0.995. \nObservation shows that weights which once exceeded the value of e quickly reach \n6 or -6 and that there are relatively few cases where a large weight is reduced \nagain to a value smaller than e. Accordingly, the number of weights in {-6, 6} \nsuccessively grows in the course of learning, and the criterion to stop training thus \ninfluences the number of nonzero weights. \nThe end of training is determined by means of cross validation. However, we do \nnot examine the cross validation performance of the trained net, but that of the \ncorresponding rule set. This is accomplished by calculating the performance of the \noriginal net with all weights and biases replaced by their optimal values according \nto (2) and (3). \nThe weighting factor A of the penalty term (see equation 1) is critical for good \nlearning performance. We pursued the strategy to start learning with A = 0, so \nthat the network parameters first move into a region where the classification error \nis small. If this error falls below a prespecified tolerance level L, A is incremented \nby 0.001. The factor A goes down by the same amount, when the error grows larger \n\n\fGDS: Gradient Descent Generation of Symbolic Classification Rules \n\n1097 \n\nthan L3. By adjusting the weighting factor every 100 training patterns we keep the \nclassification error close to the tolerance level. The choice of L of course depends on \nthe learning task. As a heuristic, L should be slightly larger than the classification \nerror attainable by a non-penalized network. \n\n3 Splice-Junction Recognition \n\nThe DNA, carrying the genetic information of biological cells, can be thought to \nbe composed of two types of subsequences: exons and introns. The task is to \nclassify each DNA position as either an exon-to-intron transition (EI), an intron(cid:173)\nto-exon transition (IE) or neither (N). The only information available is a sequence \nof 30 nucleotides (A, C, G or T) before and 30 nucleotides after the position to be \nclassified. Splice-junction recognition is a classification task that has already been \ninvestigated by a number of machine learning researchers using various adaptive \nmodels. \n\nThe pattern reservoir contains about 3200 DNA samples, 30% of which were used for \ntraining, 10% for cross-validation and 60% for testing. Since we used a grandmother(cid:173)\ncell coding for the input DNA sequence, the network has an input layer of 4*60 \nneurons. With a hidden layer of 20 neurons4 and two output units for the classes EI \nand IE, this amounts to about 5000 free parameters. The following table compares \nthe classification performance of our penalty term approach and other machine \nlearning algorithms, cf. (Murphy and Aha, 1992). \n\nTable 1: Splice-junction recognition: error (in percent) of various machine learning \nalgorithms \n\nalgorithm \nKBANN \nGDS \nBackprop \nPerceptron \nID3 \nNearest Neighbor 31.11 \n\nIE \nEI \nN \n8.47 \n7.56 \n4.62 \n4.43 \n9.24 \n6.71 \n5.29 \n5.74 10.75 \n3.99 16.32 17.41 \n8.84 10.58 13.99 \n9.09 \n\n11.65 \n\ntotal \n6.32 \n6.75 \n6.77 \n10.43 \n10.56 \n20.74 \n\nSurprisingly, the GDS network turned out to be very small. The weight decay \ncomponent of our penalty term managed to push all but 61 weights to zero, making \nuse of only three hidden neurons. Thus in addition to performing very well, the \nnetwork is transformable into a concise rule set, as follows5 : \n\n3Negative A-values are not allowed. \n4. A reasonable size, considering the experiments described in (Shavlik et al., 1991) \n5We adopt a. notation commonly used in this domain: @n denotes the position of the \nfirst nucleotide in the given sequence being left (negative n) or right (positive n) to the \npoint to be classified. Nucleotide 'V' stands for (,C' or 'T'), 'X' is a.ny of {A, C, G, T}. \nConsequently, e.g. neuron hidden(2) is active iff at least four of the five nucleotides of \nthe sequence 'GTAXG' are identical to the input pattern at positions 1 to 5 right of the \npossible splice junction. \n\n\f1098 \n\nBlasig \n\nsequence 11: \nhidden(2): at least 4 nucleotides match \nhidden(11): at least 3 nucleotides match \nsequence 1-3: 'YAG' \nhidden(17): at least 1 nucleotides matches sequence 1-1: 'GG' \n\n'GTAXG' \n\nclass EI: hidden(2) AID hidden(11) \nclass IE: IOT(hidden(2\u00bb AID hidden(17) \n\n4 Prediction of Interest Rates \n\nThis is an application, where the network input is a vector of real numbers. Since \nour approach can only handle binary input, we supplement the net with a dis(cid:173)\ncretization layer that provides a thermometer code representation (Hancock 1988) \nof the continuous valued input. In contrast to pure Boolean learning algorithms \n(Goodman, Miller and Smyth, 1989), (Mezard and Nadal, 1989), which can also be \nendowed with discretization facilities, here the discretization process is fully inte(cid:173)\ngrated into the learning scheme, as the discretization intervals will be adapted by \nthe backpropagation algorithm. \nThe data comprises a total of 226 patterns, which we distribute randomly on three \nsets: training set (60%), cross-validation set (20%) and test set (20%). The input \nrepresents the monthly development of 14 economic time series during the last 19 \nyears. The Boolean target indicates, whether the interest rates will go up or down \nduring the six months succeeding the reference month6 \u2022 The time series include \namong others month of the year, income of private households or the amount of \nGerman foreign investments. For some time series it is useful not to take the raw \nfeature measurements as input, but the difference between two succeeding measure(cid:173)\nments; this is advantageous if the underlying time series show only small changes \nrelative to their absolute values. All series were normalized to have values in the \nrange from -1 to +1. \nWe used a network containing a discretization layer of two neurons per input di(cid:173)\nmension. So there are 28 discretization neurons, which are fully connected to the \n10 hidden nodes. The output layer consists of a single neuron. Since our data set \nis relatively small, the intention to obtain simple rules is not only motivated by the \nobjective of comprehensibility, but also by the notion that we cannot expect a large \nrule set to be justified by a small amount of training data. In fact, during training \n90% of the weights were set to zero and three hidden units proved to be sufficient for \nthis task. Nevertheless the prediction error on the test set could be reduced to 25%. \nThis compares to an error rate of about 20% attainable by a standard backprop(cid:173)\nagation network with one hidden layer of ten neurons and no input discretization. \nWe thus sacrificed 5% of prediction performance to yield a very compact net, that \ncan be easily transformed into a set of rules. Some of the generated rules are shown \nbelow. The first rule e.g. states that interest rates will rise if private income in(cid:173)\ncreases AND foreign investments decrease by a certain amount during the reference \nmonth. \nIf the rules produce contradicting predictions for a given input, the final decision \nwill be made according to a majority vote. A tie is broken by the bias value of the \n\n61.e. the month where the input data has been measured. \n\n\fGDS: Gradient Descent Generation of Symbolic Classification Rules \n\n1099 \n\noutput unit, which states that by default interest rates will rise. \n\nIF (at least 2 ot { increase ot private income < 0.73%, \n\ndecrease ot toreign investments < 64 MID DM }) \n\nTHE! (interest rates will rise) \nELSE (interest rates will fall). \n\nIF (at least 3 ot { increase of business climate estimate < 1.76%, \ntreasury bonds yields (11 month ago) > 7.36%, \ntreasury bonds yields (12 month ago) > 8.2%, \nincrease ot foreign investments < 60 MID DM }) \n\nTHE! (interest rates will tall) \nELSE (interest rates will rise). \n\n5 Conclusion and Future Work \n\nG DS is a learning algorithm that utilizes a penalty term in order to prepare a \nbackpropagation network for rule extraction. The term is designed to have two \neffects on the network's weights: \n\n\u2022 By a weight decay component, the number of nonzero weights is reduced: \nthus we get a net that can hopefully be transformed into a concise and \ncomprehensible rule set . \n\n\u2022 The penalty term encourages weight constellations that keep the node ac(cid:173)\n\ntivations out of the nonsaturated part of the activation function. This \nis motivated by the fact that rules of the type IF < premise > THEN \n< conclusion > can only mimic the behavior of threshold units. \n\nThe important point is that our penalty function adapts the net to the expressive \npower of the type of rules we wish to obtain. Consequently, we are able to transform \nthe network into an equivalent rule set. The applicability of GDS was demonstrated \non two tasks: splice-junction recognition and the prediction of German interest \nrates. In both cases the generated rules not only showed a generalization perfor(cid:173)\nmance close to or even superior to what can be attained by other machine learning \napproaches such as MLPs or ID3. The rules also prove to be very concise and \ncomprehensible. This is even more remarkable, since both applications represent \nreal-world tasks with a large number of inputs. \nClearly the applied penalty terms impose severe restrictions on the network param(cid:173)\neters: besides minimizing the number of nonzero weights, the weights are restricted \nto a small set of distinct values. Last but not least, the simplification of sigmoidal to \nthreshold units also affects the net's computational power. There are applications, \nwhere such a strong bias may negatively influence the net's learning capabilities. \nFurthermore our current approach is only applicable to tasks with binary target \npatterns. These limitations can be overcome by dealing with more general rules \nthan those of the Boolean IF-THEN type. Future work will go into this direction. \n\n\f1100 \n\nBlasig \n\nAcknowledgements \n\nI wish to thank Hans-Georg Zimmermann and Ferdinand Hergert for many useful \ndiscussions and for providing the data on interest rates, and Patrick Murphy and \nDavid Aha for providing the UCI Repository of ML databases. This work was \nsupported by a grant of the Siemens AG, Munich. \n\nReferences \n\nL. Bochereau, P. Bourgine. (1990) Extraction of Semantic Features and Logical \nRules from a Multilayer Neural Network. Proceedings of the 1990 IJCNN - Wash(cid:173)\nington DC, Vol.II 579-582. \nR.M. Goodman, J .W. Miller, P. Smyth. (1989) An Information Theoretic Approach \nto Rule-Based Connectionist Expert Systems. Advances in Neural Information \nProcessing Systems 1, 256-263. San Mateo, CA: Morgan Kaufmann. \nP.J .B. Hancock. (1988) Data Representation in Neural Nets: an Empirical Study. \nProc. Connectionist Summer School. \nY. Hayashi. (1991) A Neural Expert System with Automated Extraction of Fuzzy \nIf-Then Rules and its Application to Medical Diagnosis. Advances in Neural Infor(cid:173)\nmation Processing Systems 3, 578-584. San Mateo, CA: Morgan Kaufmann. \nJ. Hertz, A. Krogh, R.G. Palmer. (1991) Introduction to the Theory of Neural \nComputation. Addison-Wesley. \nC.M. Higgins, R.M. Goodman. (1991) Incremental Learning with Rule-Based Neu(cid:173)\nral Networks. Proceedings of the 1991 IEEE INNS International Joint Conference \non Neural Networks - Seattle, Vol.1 875-880. \nM. Mezard, J .-P. Nadal. (1989) Learning in Feedforward Layered Networks: The \nTiling Algorithm. J. Phys. A: Math. Gen. 22, 2191-2203. \nW. Maass, G. Schnitger, E.D. Sontag. \n(1991) On the Computational Power of \nSigmoids versus Boolean Threshold Circuits. Proceedings of the 32nd Annual IEEE \nSymposium on Foundations of Computer Science, 767-776. \nP.M. Murphy, D.W. Aha. (1992). UCI Repository of machine learning databases \n[ftp-site: ics.uci.edu: pub/machine-Iearning-databases]. Irvine, CA: University of \nCalifornia, Department of Information and Computer Science. \nJ .R. Quinlan. (1986) Induction of Decision Trees. Machine Learning, 1: 81-106. \nK. Siato, R. Nakano. \n(1988) Medical diagnostic expert systems based on PDP \nmodel. Proc. IEEE International Conference on Neural Networks Vol. I 255-262. \nV. Tresp, J. Hollatz, S. Ahmad. (1993) Network Structuring and Training Using \nRule-Based Knowledge. Advances in Neural Information Processing Systems 5, \n871-878. San Mateo, CA: Morgan Kaufman. \nG.G. Towell, J.W. Shavlik. (1991) Training Knowledge-Based Neural Networks \nto Recognize Genes in DNA Sequences. In: Lippmann, Moody, Touretzky (eds.), \nAdvances in Neural Information Processing Systems 3, 530-536. San Mateo, CA: \nMorgan Kaufmann. \n\n\f", "award": [], "sourceid": 774, "authors": [{"given_name": "Reinhard", "family_name": "Blasig", "institution": null}]}