{"title": "Rule Induction through Integrated Symbolic and Subsymbolic Processing", "book": "Advances in Neural Information Processing Systems", "page_first": 969, "page_last": 976, "abstract": null, "full_text": "Rule Induction through Integrated Symbolic and \n\nSubsymbolic Processing \n\nClayton McMillan, Michael C. Mozer, Paul Smolensky \n\nDepartment of Computer Science and \n\nInstitute of Cognitive Science \n\nUniversity of Colorado \n\nBoulder, CO 80309-0430 \n\nAbstract \n\nWe describe a neural network, called RufeNet, that learns explicit, sym(cid:173)\nbolic condition-action rules in a formal string manipulation domain. \nRuleNet discovers functional categories over elements of the domain, \nand, at various points during learning, extracts rules that operate on \nthese categories. The rules are then injected back into RuleNet and \ntraining continues, in a process called iterative projection. By incorpo(cid:173)\nrating rules in this way, RuleNet exhibits enhanced learning and gener(cid:173)\nalization performance over alternative neural net approaches. By \nintegrating symbolic rule learning and subsymbolic category learning, \nRuleNet has capabilities that go beyond a purely symbolic system. We \nshow how this architecture can be applied to the problem of case-role \nassignment in natural language processing, yielding a novel rule-based \nsolution. \n\n1 INTRODUCTION \nWe believe that neural networks are capable of more than pattern recognition; they can \nalso perform higher cognitive tasks which are fundamentally rule-governed. Further we \nbelieve that they can perform higher cognitive tasks better if they incorporate rules rather \nthan eliminate them. A number of well known cognitive models, particularly of language, \nhave been criticized for going too far in eliminating rules in fundamentally rule-governed \ndomains. We argue that with a suitable choice of high-level, rule-governed task, represen(cid:173)\ntation, processing architecture, and learning algorithm, neural networks can represent and \nlearn rules involving higher-level categories while simultaneously learning those catego(cid:173)\nries. The resulting networks can exhibit better learning and task performance than neural \nnetworks that do not incorporate rules, have capabilities that go beyond that of a purely \nsymbolic rule-learning algorithm. \n\n969 \n\n\f970 \n\nMcMillan, Mozer, and Smolensky \n\nWe describe an architecture, called RuleNet, which induces symbolic condition-action \nrules in a string mapping domain. In the following sections we describe this domain, the \ntask and network architecture, simulations that demonstrate the potential for this \napproach, and finally, future directions of the research leading toward more general and \ncomplex domains. \n\n2 DOMAIN \nWe are interested in domains that map input strings to output strings. A string consists of n \nslots, each containing a symbol. For example, the string abed contains the symbol e in \nslot 3. The domains we have studied are intrinsically rule-based, meaning that the map(cid:173)\nping function from input to output strings can be completely characterized by explicit, \nmutually exclusive condition-action rules. These rules are of the general form \"if certain \nsymbols are present ill the input then perform a certain mapping from the input slots to the \noutput slots.\" The conditions do not operate directly on the input symbols, but rather on \ncategories defined over the input symbols. Input symbols can belong to mUltiple catego(cid:173)\nries. For example, the words boy and girl are instances of the higher level category \nHUMAN. We denote instances with lowercase bold font, and categories with uppercase \nbold font. It should be apparent from context whether a letter string refers to a single \ninstance, such as boy, or a string of instances, such as abed. \n\nThree types of conditions are allowed: 1) a simple condition, which states that an instance \nof some category must be present in a particular slot of the input string, 2) a conjunction of \ntwo simple conditions, and 3) a disjunction of two simple conditions. A typical condition \nmight be that an instance of the category W must be present in slot 1 of the input string and \nan instance of category Y must be present in slot 3. \n\nThe action performed by a rule produces an output string in which the content of each slot \nis either a fixed symbol or a function of a particular input slot, with the additional con(cid:173)\nstraint that each input slot maps to at most one output slot. In the present work, this func(cid:173)\ntion of the input slots is the identity function. A typical action might be to switch the \nsymbols in slots 1 and 2 of the input, replace slot 3 with the symbol a, and copy slot 4 of \nthe input to the output string unchanged, e.g., abed - baad. \n\nWe call rules of this general form second-order categorical permutation (SCP) rules. The \nnumber of rules grows exponentially with the length of the strings and the number of input \nsymbols. An example of an SCP rule for strings of length four is: \n\nif (input1 is an instance of Wand input] is an instance of Y) then \n(output1 = input2' oUtput2 = input1' output] = a, output4==input4) \n\nwhere illputa and outputJl denote input slot a and output slot ~, respectively. As a short(cid:173)\nhand for this rule, we write [A W_Y_ - 21a4], where the square brackets indicate this is \na rule, the\" A\" denotes a conjunctive condition, and the \"_\" denotes a wildcard symbol. A \ndisjunction is denoted by \"v\". \n\nThis formal string manipulation task can be viewed as an abstraction of several interesting \ncognitive models in the connectionist literature, including case-role assignment (McClel(cid:173)\nland & Kawamoto, 1986), grapheme-phoneme mapping (Sejnowski & Rosenberg, 1987), \nand mapping verb stems to the past tense (Rumelhart & McClelland, 1986). \n\n\fRule Induction through Integrated Symbolic and Subsymbolic Processing \n\n971 \n\nsingle unit \n\no \nc:::::I layer of units \n. - complete connectivity \nI>-- gating connection \n\nm condition units \n\nn pools of v category units \n\nn pools of u hidden units \n\nFigure 1: The RuleNet Architecture \n\ninput \n\n3 TASK \nRuleNet's task is to induce a compact set of rules that accurately characterizes a set of \ntraining examples. We generate training examples using a predefined rule base. The rules \nare over strings of length four and alphabets which are subsets of {a, b, c, d, e, f, g, \nh, i, j, k, I}. For example, the rule [v Y _VI _ - 4h21] may be used to generate the \nexemplars: \n\nhedk - kheh, cldk-khlc, gbdj - j hbg, gdbk-khdg \n\nwhere category VI consists of a, b, c, d, i, and category Y consists of f, g, h. Such \nexemplars form the corpus used to train RuleNet. Exemplars whose input strings meet the \nconditions of several rules are excluded. RuleNet's task is twofold: It must discover the \ncategories solely based upon the usage of their instances, and it must induce rules based \nupon those categories. \n\nThe rule bases used to generate examples are minimal in the sense that no smaller set of \nrules could have produced the examples. Therefore, in our simulations the target number \nof rules to be induced is the same as the number used to generate the training corpus. \n\nThere are several traditional, symbolic systems, e.g., COBWEB (Fisher, 1987), that \ninduce rules for classifying inputs based upon training examples. It seems likely that, \ngiven the correct representation, a system such as COBWEB could learn rules that would \nclassify patterns in our domain. However, it is not clear whether such a system could also \nlearn the action associated with each class. Classifier systems (Booker, et ai., 1989) learn \nboth conditions and actions, but thcre is no obvious way to map a symbol in slot a of the \ninput to slot ~ of the output. We have also devised a greedy combinatoric algorithm for \ninducing this type of rule, which has a number of shortcomings in comparison to RuleNet. \nSee McMillan (1992) for comparisons of RuleNet and alternative symbolic approaches. \n\n4 ARCHITECTURE \nRuleNet can implement SCP rules of the type outlined above. As shown in Figure 1, \nRuleNet has five layers of units: an input layer, an output layer, a layer of category units, a \nlayer of condition units, and a layer of hidden units. The operation of RuleNet can be \ndivided into three functional components: categorization is performed in the mapping \nfrom the input layer to the category layer via the hidden units, the conditions are evaluated \nin the mapping from the category layer to the condition layer, and actions are performed in \n\n\f972 \n\nMcMillan. Mozer. and Smolensky \n\nthe mapping from the input layer to the output layer, gated by the condition units. \n\nThe input layer is divided into II pools of units, one for each slot, and activates the cate(cid:173)\ngory layer, which is also divided into 11 pools. Input pool a maps to category pool a. Units \nin category pool a represent possible categorizations of the symbol in input slot a. One or \nmore category units will respond to each input symbol. The activation of the hidden and \ncategory units is computed with a logistic squashing function. There are m units in the \ncondition layer, one per rule. The activation of condition unit i, Pi' is computed as follows: \n\nlogistic (11 e t;) \n\np. -\nI ~ logistic (Ilet) \n\nJ \n\nThe activation Pi represents the probability that rule i applies to the current input. The nor(cid:173)\nmalization enforces a soft winner-take-all competition among condition units. To the \ndegree that a condition unit wins, it enables a set of weights from the input layer to the out(cid:173)\nput layer. These weights correspond to the action for a particular rule. There is one set of \nweights, A j , for each of the m rules. The activation of the output layer, y, is calculated from \nthe input layer, x, as follows: \n\nEssentially, the transformation Ai for rule each rule i is applied to the input, and it contrib(cid:173)\nutes to the output to the degree that condition i is satisfied. Ideally, just one condition unit \nwill be fully activated by a given input, and the rest will remain inactive. \n\nThis architecture is based on the local expert architecture of Jacobs, Jordan, Nowlan, and \nHinton (1991), but is independently motivated in our work by the demands of the task \ndomain. RuleNet has essentially the same structure as the Jacobs network, where the \naction substructure of RuleNet corresponds to their local experts and the condition sub(cid:173)\nstructure corresponds to their gatillg lIetwork. However, their goal-to minimize crosstalk \nbetween logically independent sub tasks-is quite different than ours. \n\n4.1 Weight Templates \nIn order to interpret the weights in RuleNet as symbolic SCP rules, it is necessary to estab(cid:173)\nlish a correspondence between regions of weight space and SCP rules. \n\nA weight template is a parameterized set of constraints on some weights-a manifold in \nweight space-that has a direct correspondence to an SCP rule. The strategy behind itera(cid:173)\ntive projection is twofold: constrain gradient descent so that weights stay close to tem(cid:173)\nplates in weight space, and periodically project the learned weights to the nearest \ntemplate, which can then readily be interpreted as a set of SCP rules. \n\nFor SCP rules, there are three types of weight templates: one dealing with categorization, \none with rule conditions, and one with rule actions. Each type of template is defined over a \nsubset of the weights in RuleNet. The categorization templates are defined over the \nweights from input to category units, the condition templates are defined over the weights \nfrom category to condition units for each rule i, ci ' and the action templates are defined \nover the weights from input to output units for each rule i, Ai' \n\n\fRule Induction through Integrated Symbolic and Subsymbolic Processing \n\n973 \n\nCategory templates. The category templates specify that the mapping from each input slot \na to category pool a, for 1 s a S II, is uniform. This imposes category invariance across \nthe input string. \n\nCondition templates. The weight vector ci , which maps category activities to the activity \nof condition unit i, has Vil elements-v being the number of category units per slot and 11 \nbeing the number of slots. The fact that the condition unit should respond to at most one \ncategory in each slot implies that at most one weight in each v-element subvector of cj \nshould be nonzero. For example, assuming there are three categories, N, X, and Y, the vec(cid:173)\ntor cj that detects the simple condition \"illput2 is an instance of X\" is: (000 OcpO 000 000), \nwhere cp is an arbitrary parameter. Additionally, a bias is required to ensure that the net \ninput will be negative unless the condition is satisfied. Here, a bias value, b, of -O.5cp will \nsuffice. For disjunctive and conjunctive conditions, weights in two slots should be equal to \ncp, the rest zero, and the appropriate bias is -.5cp or -1.5cp, respectively. There is a weight \ntemplate for each condition type and each combination of slots that takes part in a condi(cid:173)\ntion. We generalize these templates further in a variety of ways. For instance, in the case \nwhere each input symbol falls into exactly one category, if a constant Ea is added to all \nweights of Cj corresponding to slot a and Ea is also subtracted from b, the net input to con(cid:173)\ndition unit i will be unaffected. Thus, the weight template must include the {E a }. \nAction templates. If we wish the actions carried out by the network to correspond to the \nstring manipulations allowed by our rule domain, it is necessary to impose some restric(cid:173)\ntions on the values assigned to the action weights for rule i, A j \u2022 Ai has an 11 x Il block form, \nwhere II is the length of input/output strings. Each block is a k x k submatrix, where k is \nthe number of elements in the representation of each input symbol. The block at block-row \n~, block-column a of Aj copies illputa to outputr. if it is the identity matrix. Thus, the \nweight templates restrict each block to being either the identity matrix or the zero matrix. \nIf outputr. is to be a fixed symbol, then block-row ~ must be all zero except for the output \nbias weights in block-row ~. \nThe weight templates are defined over a submatrix Ajr.' the set of weights mapping the \ninput to an output slot ~. There are 11+1 templates, one for the mapping of each input slot \nto the output, and one for the writing of a fixed symbol to the output. An additional con(cid:173)\nstraint that only one block may be nonzero in block-column a of Ai ensures that inputa \nmaps to at most one output slot. \n\n4.2 Constraints on Weight Changes \nRecall that the strategy in iterative projection is to constrain weights to be close to the tem(cid:173)\nplates described above, in order that they may be readily interpreted as symbolic rules. We \nuse a combination of hard and soft constraints, some of which we briefly describe here. \n\nTo ensure that during learning every block in Ai approaches the identity or zero matrix, we \nconstrain the off-diagonal terms to be zero and constrain weights along the diagonal of \neach block to be the same, thus limiting the degrees of freedom to one parameter within \neach block. All weights in Cj except the bias are constrained to positive or zero values. \nTwo soft constraints are imposed upon the network to encourage all-or-none categoriza(cid:173)\ntion of input instances: A decay term is used on all weights in cj except the maximum in \neach slot, and a second cost term encourages binary activation of the category units. \n\n\f974 \n\nMcMillan, Mozer, and Smolensky \n\n4.3 Projection \nThe constraints described above do not guarantee that learning will produce weights that \ncorrespond exactly to SCP rules. However, using projection, it is possible to transform the \ncondition and action weights such that the resulting network can be interpreted as rules. \nThe essential idea of projection is to take a set of learned weights, such as CI , and compute \nvalues for the parameters in each of the corresponding weight templates such that the \nresulting weights match the learned weights. The weight template parameters are esti(cid:173)\nmated using a least squares procedure, and the closest template, based upon a Euclidean \ndistance metric, is taken to be the projected weights. \n\n5 SIMULATIONS \nWe ran sim ulations on 14 different training sets, averaging the performance of the network \nover at least five runs with different initial weights for each set. The training data were \ngenerated from SCP rule bases containing 2-8 rules and strings of length four. Between \nfour and eight categories were used. Alphabets ranged from eight to 12 symbols. Symbols \nwere represented by either local or distributed activity vectors. Training set sizes ranged \nfrom 3-15% of possible examples. \nIterative projection involved the following steps: (1) start with one rule (one set of c;-AI \nweights), (2) perform gradient descent for 500-5,000 epochs, (3) project to the nearest set \nof SCP rules and add a new rule. Steps (2) and (3) were repeated until the training set was \nfully covered. \n\nIn virtually every run on each data set in which RuleNet converged to a set of rules that \ncompletely covered the training set, the rules extracted were exactly the original rules used \nto generate the training set. In the few remaining runs, RuleNet discovered an equivalent \nset of rules. \nIt is instructive to examine the evolution of a rule set. The rightmost column of Figure 2 \nshows a set of five rules over four categories, used to generate 200 exemplars, and the left \nportion of the Figure shows the evolution of the hypothesis set of rules learned by RuleNet \nover 20,000 training epochs, projecting every 4000 epochs. At epoch 8000, RuleNet has \ndiscovered two rules over two categories, covering 24.5% of the training set. At epoch \n12,000, RuleNet has discovered three rules over three categories, covering 52% of the \ntraining set. At epoch 20,000, RuleNet has induced five rules over four categories that \n\nepoch 8000 \n\nepoch 12,000 \n\nepoch 20,000 \n\noriginal rules/categ. \n[v B_C_ - 4h21] [v B_C_ - 4h21] [v B_C_ - 4h21] [v Y_W_ - 4h21] \n[1\\ _B_C - 341\u00a3] [1\\ _EC - 2413] [ _B_ - 4213] [ _Y_ - 4213] \n[1\\ _B_B - 321\u00a3] [v _E_D - 342\u00a3] [v _Z_X - 342\u00a3] \n[1\\ _D_B - 3214] [1\\ _X_Y - 3214] \n[v _EC - 2413] [v _ZW - 2413] \nCateg. Instance \n\nInstance \n\nCateg. \n\nCateg. \n\nCateg. \n\nInstance \nf 9 h \n\nB \nC abc i \n\nInstance \nf 9 h \n\nB \nC abc d i \nE \n\nj k \n\na i \n\nC abc d i \nD \nB \nE \n\ne 9 1 \nf 9 h \na c i \n\nj k \n\nw abc d i \ne 9 1 \nX \ny \nf 9 h \nz a c i \n\nj k \n\nFigure 2: Evolution of a Rule Set \n\n\fRule Induction through Integrated Symbolic and Subsymbolic Processing \n\n975 \n\nTable 1: Generalization performance of RuleNet (average of five runs) \n\nArchitecture \n\nRuleNet \nJacobs architecture \n3-layer backprop \n# of patterns in set \n\nData Set 1 \n(8 Rules) \ntest \ntram \n100 \n100 \n22 \n100 \n27 \n100 \n120 1635 \n\n% of patterns correctly mapped \nData Set 2 \n(3 Rules) \ntram \ntest \n100 \n100 \n100 \n7 \n100 \n7 \n45 1380 \n\nData Set 3 \n(3 Rules) \ntram \ntest \n100 \n100 \n14 \n100 \n14 \n100 \n45 1380 \n\nData Set 4 \n(5 Rules) \ntram \ntest \n100 \n100 \n27 \n100 \n100 \n35 \n75 1995 \n\ncover 100% of the training examples. A close comparison of these rules with the original \nrules shows that they only differ in the arbitrary labels RuleNet has attached to the catego(cid:173)\nries. \n\nLearning rules can greatly enhance generalization. In cases where RuleNet learns the orig(cid:173)\ninal rules, it can be expected to generalize perfectly to any pattern created by those rules. \nWe compared the performance of RuleNet to that of a standard three-layer backprop net(cid:173)\nwork (with 15 hidden units per rule) and a version of the Jacobs architecture, which in \nprinciple has the capacity to perform the task. Four rule bases were tested, and roughly 5% \nof the possible examples were used for training and the remainder were used for generali(cid:173)\nzation testing. Outputs were thresholded to 0 or 1. The cleaned up outputs were compared \nto the targets to determine which were mapped correctly. All three learn the training set \nperfectly. However, on the test set, RuleNet's ability to generalize is 300% to 2000% bet(cid:173)\nter than the other systems (Table1). \n\nFinally, we applied RuleNet to case-role assignment, as considered by McClelland and \nKawamoto (1986). Case-role assignment is the problem of mapping syntactic constituents \nof a sentence to underlying semantic, or thematic, roles. For example, in the sentence, \n\"The boy broke the window\", boy is the subject at the syntactic level and the agent, or act(cid:173)\ning entity, at the semantic level. Window is the object at the syntactic level and the patient, \nor entity being acted upon, at the semantic level. The words of a sentence can be repre(cid:173)\nsented as a string of Il slots, where each slot is labeled with a constituent, such as subject, \nand that slot is filled with the corresponding word, such as boy. The output is handled anal(cid:173)\nogously. We used McClelland and Kawamoto's 152 sentences over 34 nouns and verbs as \nRuleNet's training set. The five categories and six rules induced by RuleNet are shown in \nTable 2, where S = subject, 0 = object, and wNP = noun in the with noun-phrase. We con(cid:173)\njecture that RuleNet has induced such a small set of rules in part because it employs \n\nTable 2: SCP Rules Induced by RuleNet in Case-Role Assignment \n\nRule \n\nif 0 = VICTIM then wNP-modifier \nif 0 = THING 1\\ wNP = UTENSIL \n\nthen wNP-instrument \n\nif S = BREAKER then S-instrument \nif S = THING then S-patient \nif V = moved then self-patient \nif S = ANIMATE then food-patient \n\nSample of Sentences Handled Correctly \nThe boy ate the pasta with cheese. \nThe boy ate the pasta with the fork. \n\nThe rock broke the window. \nThe window broke. The fork moved. \nThe man moved. \nThe lion ate. \n\n\f976 \n\nMcMillan, Mozer, and Smolensky \n\nimplicit conflict resolution, automatically assigning strengths to categories and conditions. \nThese rules cover 97% of the training set and perform the correct case-role assignments on \n84% of the 1307 sentences in the test set. \n\n6 DISCUSSION \nRuleNet is but one example of a general methodology for rule induction in neural net(cid:173)\nworks. This methodology involves five steps: 1) identify a fundamentally rule-governed \ndomain, 2) identify a class of rules that characterizes that domain, 3) design a general \narchitecture, 4) establish a correspondence between components of symbolic rules and \nmanifolds of weight space-weight templates, and 5) devise a weight-template-based \nlearning procedure. \nUsing this methodology, we have shown that RuleNet is able to perform both category and \nrule learning. Category learning strikes us as an intrinsically subsymbolic process. Func(cid:173)\ntional categories are often fairly arbitrary (consider the classification of words as nouns or \nverbs) or have complex statistical structure (consider the classes \"liberals\" and \"conserva(cid:173)\ntives\"). Consequently, real-world categories can seldom be described in terms of boolean \n(symbolic) expressions; subsymbolic representations are more appropriate. \n\nWhile category learning is intrinsically subsymbolic, rule learning is intrinsically a sym(cid:173)\nbolic process. The integration of the two is what makes RuleNet a unique and powerful \nsystem. Traditional symbolic machine learning approaches aren't well equipped to deal \nwith subsymbolic learning, and connectionist approaches aren't well equipped to deal \nwith the symbolic. RuleNct combines the strengths of each approach. \n\nAcknowledgments \nThis research was supported by NSF Presidential Young Investigator award IRI-9058450, grant 90-\n21 from the James S. McDonnell Foundation, and DEC external research grant 1250 to MM; NSF \ngrants IRI-8609599 and ECE-8617947 to PS; by a grant from the Sloan Foundation's computational \nneuroscience program to PS; and by the Optical Connectionist Machine Program of the NSF Engi(cid:173)\nneering Research Center for Optoelectronic Computing Systems at the University of Colorado at \nBoulder. \nReferences \nBooker, L.B., Goldberg, D.E., and Holland, J.H. (1989). Classifier systems and genetic algorithms, \nArtificiallntelligellce 40:235-282. \nFisher, D.H. (1987). Knowledge acquisition via incremental concept clustering. Machine Learning \n2:139-172. \nJacobs, R., Jordan, M., Nowlan, S., Hinton, G. (1991). Adaptive mixtures of local experts. Neural \nComputation, 3:79-87. \nMcClelland, J. & Kawamoto, A. (1986). Mechanisms of sentence processing: assigning roles to con(cid:173)\nstituents. In J.L. McClelland, D.E. Rumelhart, & the PDP Research Group, Parallel Distributed Pro(cid:173)\ncessing: Explorations in tire microstructure of cognition, Vol. 2. Cambridge, MA: MIT PresslBrad(cid:173)\nford Books. \nMcMillan, C. (1992). Rule induction in a neural network through integrated symbolic and subsym(cid:173)\nbolic processing. Unpublished Ph.D. Thesis. Boulder, CO: Department of Computer Science, Univer(cid:173)\nsity of Colorado. \nRumelhart, D., & McClelland, 1. (1986). On learning the past tense of English verbs. In 1.L. McClel(cid:173)\nland, D.E. Rumelhart, & the PDP Research Group, Parallel Distributed Processing: Explorations in \nthe microstructure of cognition. Vol. 2. Cambridge, MA: MIT PresslBradford Books. \nSejnowski, T. 1. & Rosenberg, C. R. (1987). Parallel networks that learn to pronounce English text, \nComplex Systems, 1: 145-168. \n\n\f", "award": [], "sourceid": 520, "authors": [{"given_name": "Clayton", "family_name": "McMillan", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}, {"given_name": "Paul", "family_name": "Smolensky", "institution": null}]}