{"title": "A Novel Net that Learns Sequential Decision Process", "book": "Neural Information Processing Systems", "page_first": 760, "page_last": 766, "abstract": null, "full_text": "760 \n\nA NOVEL NET THAT LEARNS \n\nSEQUENTIAL DECISION PROCESS \n\nG.Z. SUN, Y.C. LEE and H.H. CHEN \n\nDepartment of PhYJicJ and AJtronomy \n\nand \n\nUNIVERSITY OF MARYLAND,COLLEGE PARK,MD 20742 \n\nInJtitute for Advanced Computer StudieJ \n\nABSTRACT \n\nWe propose a new scheme to construct neural networks to classify pat(cid:173)\n\nterns. The new scheme has several novel features : \n\n1. We focus attention on the important attributes of patterns in ranking \norder. Extract the most important ones first and the less important \nones later. \n\n2. In training we use the information as a measure instead of the error \n\nfunction. \n\n3. A multi-percept ron-like architecture is formed auomatically. Decision \n\nis made according to the tree structure of learned attributes. \n\nThis new scheme is expected to self-organize and perform well in large scale \nproblems. \n\n\u00a9 American Institute of Physics 1988 \n\n\f761 \n\n1 \n\nINTRODUCTION \n\nIt is well known that two-layered percept ron with binary connections but no \nhidden units is unsuitable as a classifier due to its limited power [1]. It cannot \nsolve even the simple exclusive-or problem. Two extensions have been pro(cid:173)\np'osed to remedy this problem. The first is to use higher order connections \nl2]. It has been demonstrated that high order connections could in many \ncases solve the problem with speed and high accuracy [3], [4]. The repre(cid:173)\nsentations in general are more local than distributive. The main drawback \nis however the combinatorial explosion of the number of high-order terms. \nSome kind of heuristic judgement has to be made in the choice of these terms \nto be represented in the network. \n\nA second proposal is the multi-layered binary network with hidden units \nr5]. These hidden units function as features extracted from the bottom input \nlayer to facilitate the classification of patterns by the output units. In order \nto train the weights, learning algorithms have been proposed that back(cid:173)\npropagate the errors from the visible output layer to the hidden layers for \neventual adaptation to the desired values. The multi-layered networks enjoy \ngreat popularity in their flexibility. \n\nHowever, there are also problems in implementing the multi-layered nets. \nFirstly, there is the problem of allocating the resources. Namely, how many \nhidden units would be optimal for a particular problem. If we allocate too \nmany, it is not only wasteful but also could negatively affect the performance \nof the network. Since too many hidden units implies too many free param(cid:173)\neters to fit specifically the training patterns. Their ability to generalize to \nnoval test patterns would be adversely affected. On the other hand, if too \nfew hidden units were allocated then the network would not have the power \neven to represent the trainig set. How could one judge beforehand how many \nare needed in solving a problem? This is similar to the problem encountered \nin the high order net in its choice of high order terms to be represented. \n\nSecondly, there is also the problem of scaling up the network. Since the \nnetwork represents a parallel or coorperative process of the whole system, \neach added unit would interact with every other units. This would become \na serious problem when the size of our patterns becomes large. \n\nThirdly, there is no sequential communication among the patterns in the \nconventional network. To accomplish a cognitive function we would need \nthe patterns to interact and communicate with each other as the human \nreasoning does. It is difficult to envision such an interacton in current systems \nwhich are basically input-output mappings. \n\n2 THE NEW SCHEME \n\nIn this paper, we would like to propose a scheme that constructs a network \ntaking advantages of both the parallel and the sequential processes. \n\nWe note that in order to classify patterns, one has to extract the intrinsic \nfeatures, which we call attributes. For a complex pattern set, there may \nbe a large number of attributes. But differnt attributes may have different \n\n\f762 \n\nranking of importance. Instead of ext racing them all simultaneously it may \nbe wiser to extract them sequentially in order of its importance [6], [7]. Here \nthe importance of an attribute is determined by its ability to partition the \npattern set into sub-categories. A measure of this ability of a processing unit \nshould be based on the extracted information. For simplicity, let us assume \nthat there are only two categories so that the units have only binary output \n\nvalues 1 and \u00b0 ( but the input patterns may have analog representations). We \n\ncall these units, including their connection weights to the input layer, nodes. \nFor given connection weights, the patterns that are classified by a node as \nin category 1 may have their true classifications either 1 or 0. Similarly, the \npatterns that are classified by a node as in category 0 may also have their \ntrue classifications either 1 or o. As a result, four groups of patterns are \nformed: (1,1), (0,0), (1,0), (0,1). We then need to judge on the efficiency of \nthe node by its ability to split these patterns optimally. To do this we shall \nconstruct the impurity fuctions for the node. Before splitting, the impurity \nof the input patterns reaching the node is given by \n\n(1) \nwhere pt = Nf / N is the probability of being truely classified as in category \n1, and P~ = N~/N is the probability of being truely classified as in category \no. After splitting, the patterns are channelled into two branches, the impurity \nbecomes \n\n1(1 = -Pt L P(j, 1) logP(j, 1) - P; L P(j, O)logP(j, 0) \n\n(2) \n\nj=O,1 \n\nj=O,1 \n\nwhere Pi = Ni / N is the probability of being classified by the node as in \ncategory 1, P; = N8/N is the probability of being classified by the node as \nin category 0, and P(j, i) is the probability of a pattern, which should be in \ncategory j, but is classified by the node as in category i. The difference \n\nrepresents the decrease of the impurity at the node after splitting. It is the \nquantity that we seek to optimize at each node. The logarithm in the im(cid:173)\npurity function come from the information entropy of Shannon and Weaver. \nFor all practical purI?ose, we found the. optimization of (3) the same as max(cid:173)\nimizing the entropy l6] \n\n(3) \n\nwhere Ni is the number of training patterns classified by the node as in \ncategory i, Nij is the number of training patterns with true classification in \ncategory i but classified by the node as in category j. Later we shall call the \nterms in the first bracket SI and the second S2. Obviously, we have \n\ni = 0,1 \n\n\f763 \n\nAfter we trained the first unit, the training patterns were split into two \nbranches by the unit. If the classificaton in either one of these two branches \nis pure enough, or equivalently either one of Sl and S2 is fairly close to 1, \nthen we would terminate that branch ( or branches) as a leaf of the decision \ntree, and classify the patterns as such. On the other hand, if either branch is \nnot pure enough, we add additional node to split the pattern set further. The \nsubsequent unit is trained with only those patterns channeled through this \nbranch. These operations are repeated until all the branches are terminated \nas leaves. \n\n3 LEARNING ALGORITHM \n\nWe used the stochastic gradient descent method to learn the weights of each \nnode. The training set for each node are those patterns being channeled to \nthis node. As stated in the previous section, we seek to maximize the entropy \nfunction S. The learning of the weights is therefore conducted through \n\noS \n1:::. Wj = 11 ow-\n\nJ \n\n(5) \n\nWhere 11 is the learning rate. The gradient of S can be calculated from the \nfollowing equation \n\noS = ~ [(1 _ 2NJ1) oNn + (1 _ 2 Nil ) oNOl + \noWj \n\nNl oWj \n\nNl oW; \n\nN \n(1 _ 2NJo) ONIO + (1 _ 2N'fO) ONoo] \nNJ oWj \n\nNJ oWj \n\nUsing analog units \n\nwe have \n\nor = \n\n1 \n\n1 + exp( - Lj WjII) \n\n(6) \n\n(7) \n\n(9) \n\n(10) \n\noor = orC1 _ or)!'; \now-\n\n(8) \nFurthermore, let Ar = 1 or 0 being the true answer for the input pattern r , \nthen \n\nN;; = t. [iA' + (1 - i)(1 - A') 1 [i O' + (1 - j)(1 - 0') 1 \n\nJ \n\nJ \n\nSubstituting these into equation (5), we get \n\n1:::.Wj = 2T} :L[2Ar(NU - NlO) + Ni~ - Ni;]or(l - or)IJ \n\nNo \n\nNl \n\nr \n\nNl \n\nNo \n\nIn applying the formula (10),instead of calculating the whole summation at \nonce, we update the weights for each pattern individually. Meanwhile we \nupdate N ij in accord with equation (9). \n\n\f764 \n\nFigure 1: The given classification tree, where 01 , O'l and 03 are chosen to be \nall zeros in the numerical example. \n\n4 AN EXAMPLE \n\nTo illustrate our method, we construct an example which is itself a decision \ntree. Assuming there are three hidden variables ai, a'l, a3, a pattern is given \nby a ten-dimensional vector II, I'l, ... , 110 , constructed from the three hidden \nvariables as follows \n\nII -\n-\n1'l \n13 \nI\" -\nIs -\n\nal + a3 \n2al - a'l \na3 - 2a'l \nal + 2a'l + 3a3 \n5al - 4a\" \n\n16 \n-\n-\n17 \n18 -\n19 -\n110 -\n\n2a3 \na3 - al \n2al + 3a3 \n4a3 - 3a l \n2al + 2a'l + 2a3\u00b7 \n\nA given pattern is classified as either 1 (yes) or 0 (no) according to the \ncorresponding values of the hidden variables ai, a'l, a3. The actual decision \nis derived from the decision tree in Fig.1. \n\nIn order to learn this classification tree, we construct a training set of 5000 \npatterns generated by randomly chosen values ai, a'l, a3 in the interval -1 to \n+ 1. We randomly choose the initial weights for each node, and terminate \n\n\f51 =0.65/ \n\n,S2= 0.88 \n\n~OCE:S] (16171114) \n\n\" \n\nG \n\nVI \n\nG \n\n765 \n\n5=0.79 G \n\n51 =0.60/ ~=0.87 \n\n(fIg (2519/35) \n\n51 = 0.85/ ~= 0.73 \n\n(SS/S)Wi \n\n5. =0.90/ \n(92/S)rul \n\n52=0.96 \n\nffQ](548/12) \n\nFigure 2: The learned classification tree structure \n\na branch as a leaf whenever the branch entropy is greater than 0.80. The \nentropy is started at S = 0.65, and terminated at its maximum value S = \n0.79 for the first node. The two branches of this node have the entropy \nfuction valued at SI = 0.61, S2 = 0.87 respectively. This corrosponds to \n2446 patterns channeled to the first branch and 2554 to the second. Since \nS2 > 0.80 we terminate the second branch. Among 2554 patterns channeled \nto the second branch there are 2519 patterns with true classification as no and \n35 yes which are considered as errors. After completing the whole training \nprocess, there are totally four nodes automatically introduced. The final \nresult is shown in a tree structure in Fig.2. \n\nThe total errors classified by the learned tree are 3.4 % of the 5000 trainig \npatterns. After trainig we have tested the result using 10000 novel patterns, \nthe error among which is 3.2 %. \n\n5 SUMMARY \n\nWe propose here a new scheme to construct neural network that can au(cid:173)\ntomatically learn the attributes sequentially to facilitate the classification \nof patterns according to the ranking importance of each attribute. This \nscheme uses information as a measure of the performance of each unit. It is \n\n\f766 \n\nself-organized into a presumably optimal structure for a specific task. The \nsequential learning procedure focuses attention of the network to the most \nimportant attribute first and then branches out' to the less important at(cid:173)\ntributes. This strategy of searching for attributes would alleviate the scale \nup problem forced by the overall parallel back-propagation scheme. It also \navoids the problem of resource allocation encountered in the high-order net \nand the multi-layered net. In the example we showed the performance of the \nnew method is satisfactory. We expect much better performance in problems \nthat demand large size of units. \n\n6 acknowledgement \n\nThis work is partially supported by AFOSR under the grant 87-0388. \n\nReferences \n\n[1] M. Minsky and S. Papert, Perceptron, MIT Press Cambridge, Ma(1969). \n\n[2] Y.C. Lee, G. Doolen, H.H. Chen, G.Z. Sun, T. Maxwell, H.Y. Lee and \n\nC.L. Giles, Machine Learning Using A High Order Connection Netwe(cid:173)\nork, Physica D22,776-306 (1986). \n\n[3] H.H. Chen, Y.C. Lee, G.Z. Sun, H.Y. Lee, T. Maxwell and C.L. Giles, \nHigh Order Connection Model For Associate Memory, AlP Proceedings \nVol.151,p.86, Ed. John Denker (1986). \n\n[4] T. Maxwell, C.L. Giles, Y.C. Lee and H.H. Chen, Nonlinear Dynamics \nof Artificial Neural System, AlP Proceedings Vol.151,p.299, Ed. John \nDenker(1986). \n\n[5] D. Rummenlhart and J. McClelland, Parallel Distributit'e Processing, \n\nMIT Press(1986). \n\n[6] L. Breiman, J. Friedman, R. Olshen, C.J. Stone, Classification and Re(cid:173)\n\ngression Trees,Wadsworth Belmont, California(1984). \n\n[7] J.R. Quinlan, Machine Learning, Vol.1 No.1(1986). \n\n\f", "award": [], "sourceid": 74, "authors": [{"given_name": "Guo-Zheng", "family_name": "Sun", "institution": null}, {"given_name": "Yee-Chun", "family_name": "Lee", "institution": null}, {"given_name": "Hsing-Hen", "family_name": "Chen", "institution": null}]}