{"title": "Boosting the Performance of RBF Networks with Dynamic Decay Adjustment", "book": "Advances in Neural Information Processing Systems", "page_first": 521, "page_last": 528, "abstract": null, "full_text": "Boosting the Performance \n\nof RBF Networks \n\nwith Dynamic Decay Adjustment \n\nMichael R. Berthold \n\nForschungszentrum Informatik \nGruppe ACID (Prof. D. Schmid) \nHaid-und-Neu-Strasse 10-14 \n\n76131 Karlsruhe, Germany \n\neMail: berthold@fzLde \n\nJay Diamond \nIntel Corporation \n\neMail: jdiamond@mipos3.intel.com \n\n2200 Mission College Blvd. \n\nSanta Clara, CA, USA \n\n95052 MS:SC9-15 \n\nAbstract \n\nRadial Basis Function (RBF) Networks, also known as networks \nof locally-tuned processing units (see [6]) are well known for their \nease of use. Most algorithms used to train these types of net(cid:173)\nworks, however, require a fixed architecture, in which the number \nof units in the hidden layer must be determined before training \nstarts. The RCE training algorithm, introduced by Reilly, Cooper \nand Elbaum (see [8]), and its probabilistic extension, the P-RCE \nalgorithm, take advantage of a growing structure in which hidden \nunits are only introduced when necessary. The nature of these al(cid:173)\ngorithms allows training to reach stability much faster than is the \ncase for gradient-descent based methods. Unfortunately P-RCE \nnetworks do not adjust the standard deviation of their prototypes \nindividually, using only one global value for this parameter. \nThis paper introduces the Dynamic Decay Adjustment (DDA) al(cid:173)\ngorithm which utilizes the constructive nature of the P-RCE al(cid:173)\ngorithm together with independent adaptation of each prototype's \ndecay factor. In addition, this radial adjustment is class dependent \nand distinguishes between different neighbours. It is shown that \nnetworks trained with the presented algorithm perform substan(cid:173)\ntially better than common RBF networks. \n\n\f522 \n\nMichael R. Berthold, Jay Diamond \n\n1 \n\nIntroduction \n\nMoody and Darken proposed Networks with locally-tuned processing units, which \nare also known as Radial Basis Functions (RBFs, see [6]). Networks of this type \nhave a single layer of units with a selective response for some range of the input \nvariables. Earn unit has an overall response function, possibly a Gaussian: \n\nD_(\"') \n.J.t,i X = exp -\n\n(11x - riW) \n\n2 \n(Ii \n\n(1) \n\nHere x is the input to the network, ri denotes the center of the i-th RBF and (Ii \ndetermines its standard deviation. The second layer computes the output function \nfor each class as follows: \n\nm \n\n(2) \n\ni=l \n\nwith m indicating the number of RBFs and Ai being the weight for each RBF. \nMoody and Darken propose a hybrid training, a combination of unsupervised clus(cid:173)\ntering for the centers and radii of the RBFs and supervised training of the weights. \nUnfortunately their algorithm requires a fixed network topology, which means that \nthe number of RBFs must be determined in advance. The same problem applies to \nthe Generalized Radial Basis Functions (GRBF), proposed in [12]. Here a gradient \ndescent technique is used to implement a supervised training of the center locations, \nwhich has the disadvantage of long training times. \nIn contrast RCE (Restricted Coulomb Energy) Networks construct their architecture \ndynamically during training (see [7] for an overview). This algorithm was inspired \nby systems of charged particles in a three-dimensional space and is analogous to \nthe Liapunov equation: \n\n1 m \n\nQi \n\n~ = - L l: II'\" _ ... \u00b7II L \n\ni=l X \n\nT, 2 \n\n(3) \n\nwhere ~ is the electrostatic potential induced by fixed particles with charges -Qi \nand locations ri. One variation of this type of networks is the so called P-RCE \nnetwork, which attempts to classify data using a probabilistic distribution derived \nfrom the training set. The underlying training algorithm for P-RCE is identical \nto RCE training with gaussian activation functions used in the forward pass to \nresemble a Probabilistic Neural Network (PNN [10]). PNNs are not suitable for \nlarge databases because they commit one new prototype for each training pattern \nthey encounter, effectively becoming a referential memory scheme. In contrast, the \nP-RCE algorithm introduces a new prototype only when necessary. This occurs \nwhen the prototype of a conflicting class misclassifies the new pattern during the \ntraining phase. The probabilistic extension is modelled by incrementing the a-priori \nrate of occurrence for prototypes of the same class as the input vector, therefore \nweights are only connecting RBFs and an output node of the same class. The recall \nphase of the P-RCE network is similar to RBFs, except that it uses one global \nradius for all prototypes and scales each gaussian by the a-priori rate of occurrence: \n\n(4) \n\n\fBoosting the Performance of RBF Networks with Dynamic Decay Adjustment \n\n523 \n\nFigure 1: This picture shows how a new pattern results in a slightly higher activity \nfor a prototype of the right class than for the conflicting prototype. Using only one \nthreshold, no new prototype would be introduced in this case. \n\nwhere c denotes the class for which the activation is computed, me is the number \nof prototypes for class c, and R is the constant radius of the gaussian activation \nfunctions. The global radius of this method and the inability to recognize areas \nof conflict, leads to confusion in some areas of the feature space, and therefore \nnon-optimal recognition performance. \nThe Dynamic Decay Adjustment (DDA) algorithm presented in this paper was \ndeveloped to solve the inherent problems associated with these methods. The con(cid:173)\nstructive part of the P-RCE algorithm is used to build a network with an ap(cid:173)\npropriate number of RBF units, for which the decay factor is computed based on \ninformation about neighbours. This technique increases the recognition accuracy \nin areas of conflict. \n\nThe following sections explain the algorithm, compare it with others, and examine \nsome simulation results. \n\n2 The Algorithm \n\nSince the P-RCE training algorithm already uses an independent area of influence \nfor each RBF, it is relatively straightforward to extract an individual radius. This \nresults, however, in the problem illustrated in figure 1. The new pattern p of class \nB is properly covered by the right prototype of the same class. However, the left \nprototype of conflicting class A results in almost the same activation and this leads \nto a very low confidence when the network must classify the pattern p. \nTo solve this dilemma, two different radii, or thresholds1 are introduced: a so-called \npositive threshold (0+), which must be overtaken by an activation of a prototype of \nthe same class so that no new prototype is added, and a negative threshold (0-), \nwhich is the upper limit for the activation of conflicting classes. Figure 2 shows an \nexample in which the new pattern correctly results in activations above the positive \nthreshold for the correct class B and below the negative threshold for conflicting \nclass A. This results in better classification-confidence in areas where training \n\nIThe conversion from the threshold to the radius is straightforward as long as the \n\nactivation function is invertible. \n\n\f524 \n\nMichael R. Berthold, Jay Diamond \n\nnew input pattern \n(class B) \n\nx \n\nFigure 2: The proposed algorithm distinguishes between prototypes of correct and \nconflicting classes and uses different thresholds. Here the level of confidence is \nhigher for the correct classification of the new pattern. \n\npatterns did not result in new prototypes. The network is required to hold the \nfollowing two equations for every pattern x of class c from the training data: \n\n3i : Rf(x) 2:: 8+ \n\nVk :/; c, 1 ~ j ~ mk : Rj(x) < 8-\n\n(5) \n\n(6) \n\nThe algorithm to construct a classifier can be extracted partly from the ReE algo(cid:173)\nrithm. The following pseudo code shows what the training for one new pattern x \nof class c looks like: \n\n/ / reset weights: \nFORALL prototypes pf DO \n\nAf = 0.0 \nEND FOR \n/ / train one complete epoch \nFORALL training pattern (x,c) DO: \n\nIF 3pi : Ri( x) 2:: 8+ THEN \n\nAi+ = 1.0 \n\nELSE \n\n/ / \"commit\": introduce new prototype \nadd new prototype P~c+1 with: \n~c+1 =x \nO'~ +1 = \nA~c+1 = 1.0 \nmc+= 1 \n\nk#cl\\l~J::;mk \n\nc \n\nmaJ:C \n\n{O' : R~ +1 (r7) < 8-} \n\nc \n\nENDIF \n/ / \"shrink\": adjust conflicting prototypes \nFORALL k :/; c, 1 ~ j ~ mk DO \n\nO'j = max{O' : Rj(x) < 8-} \n\nENDFOR \n\nFirst, all weights are set to zero because otherwise they would accumulate duplicate \ninformation about training patterns. Next all training patterns are presented to the \n\n\fBoos,;ng the Peiformance of RBF Networks with Dynamic Decay Adjustment \n\n525 \n\npIx) \n\n(2) \n\npattern class A \n\n(1) \n\n+2 pIx) \n\npattern class B \n\npattern class A \n\nx \n\n(3) \n\n(4) \n\nFigure 3: An example of the DDA- algorithm: (1) a pattern of class A is encountered \nand a new RBF is created; (2) a training pattern of class B leads to a new prototype \nfor class B and shrinks the radius of the existing RBF of class A; (3) another pattern \nof class B is classified correctly and shrinks again the prototype of class A; (4) a \nnew pattern of class A introduces another prototype of that class. \n\nnetwork. If the new pattern is classified correctly, the weight of the closest prototype \nis increased; otherwise a new protoype is introduced with the new pattern defining \nits center. The last step of the algorithm shrinks all prototypes of conflicting classes \nif their activations are too high for this specific pattern. \n\nRunning this algorithm over the training data until no further changes are required \nensures that equations (5) and (6) hold. \nThe choice of the two new parameters, (J+ and (J- are not as critical as it would \ninitially appear2. For all of the experiments reported, the settings (J+ = 0.4 and \n(J- = 0.1 were used, and no major correlations of the results to these values were \nnoted. Note that when choosing (J+ = (J- one ends up with an algorithm having \nthe problem mentioned in figure l. \n\nFigure 3 shows an example that illustrates the first few training steps of the DDA(cid:173)\nalgorithm. \n\n3 Results \n\nSeveral well-known databases were chosen to evaluate this algorithm (some can be \nfound in the eMU Neural Network Benchmark Databases (see [13])). The DDA-\n\n2Theoretically one would expect the dimensionality of the input- space to playa major \n\nrole for the choice of those parameters \n\n\f526 \n\nMichael R. Berthold, Jay Diamond \n\nalgorithm was compared against PNN, RCE and P-RCE as well as a classic Multi \nLayer Perceptron which was trained using a modified Backpropagation algorithm \n(Rprop, see [9]). The number of hidden nodes of the MLP was optimized manually. \nIn addition an RBF-network with a fixed number of hidden nodes was trained \nusing unsupervised clustering for the center positions and a gradient descent to \ndetermine the weights (see [6] for more details). The number of hidden nodes was \nagain optimized manually. \n\n\u2022 Vowel Recognition: Speaker independent recognition of the eleven steady \n\nstate vowels of British English using a specified training set of Linear Pre(cid:173)\ndictive Coding (LPC) derived log area ratios (see [3]) resulting in 10 inputs \nand 11 classes to distinguish. The training set consisted of 528 tokens, with \n462 different tokens used to test the network. \n\nII performance I #units I #epochs I \n\nalgorithm \n\nNearest Neighbour \n\nMLP (RPRUP) \n\nPNN \nRBF \nRCE \n\nP-RCE \n\nDDA-RBF \n\n56% \n57% \n61% \n59% \n27% \n59% \n65_% \n\n-\n\n5 \n528 \n70 \n125 \n125 \n204 \n\n1 \n\n..... 200 \n\n-\n\n..... 100 \n\n3 \n3 \n4 \n\n\u2022 Sonar Database: Discriminate between sonar signals bounced off a metal \ncylinder and those bounced off a roughly cylindrical rock (see [4] for more \ndetails) . The data has 60 continuous inputs and is separated into two \nclasses. For training and testing 104 samples each were used. \n\nalgorithm \n\nII performance I #units I #epochs I \n\nMLP (RPROP) \n\nPNN \nRBF \nRCE \n\nP-RCE \n\nDDA-RBF \n\n90.4% \n91.3% \n90.7% \n77.9% \n90.4% \n93.3% \n\n50 \n104 \n80 \n68 \n68 \n68 \n\n..... 250 \n\n-\n\n..... 150 \n\n3 \n3 \n3 \n\n\u2022 Two Spirals: This well-known problem is often used to demonstrate the \ngeneralization capability of a network (see [5]). The required task involves \ndiscriminating between two intertwined spirals. For this paper the spirals \nwere changed slightly to make the problem more demanding. The origi(cid:173)\nnal spirals radius declines linearly and can be correctly classified by RBF \nnetworks with one global radius. To demonstrate the ability of the DDA(cid:173)\nalgorithm to adjust the radii of each RBF individually, a quadratic decline \nwas chosen for the radius of both spirals (see figure 4) . The training set \nconsisted of 194 points, and the spirals made three complete revolutions. \nFigure 4 shows both the results of an RBF Network trained with the DDA \ntechnique and the same problem solved with a Multi-Layer Perceptron \n(2-20-20-1) trained using a modified Error Back Propagation algorithm \n(Rprop, see [9]). Note that in both cases all training points are classified \ncorrectly. \n\n\fBoosting the Peifonnance of RBF Networks with Dynamic Decay Adjustment \n\n527 \n\nFigure 4: The (quadratic) \"two spirals problem\" solved by a MLP (left) using \nError Back Propagation (after 40000 epochs) and an RBF network (right) trained \nwith the proposed DDA-algorithm (after 4 epochs). Note that all training patterns \n(indicated by squares vs. crosses) are classified correctly. \n\nIn addition to these tasks, the BDG-database was used to compare the DDA al(cid:173)\ngorithm to other approaches. This database was used by Waibel et al (see [11]) to \nintroduce the Time Delay Neural Network (TDNN). Previously it has been shown \nthat RBF networks perform equivalently (when using a similar architecture, [1], [2]) \nwith the DDA technique used for training of the RBF units. The BDG task involves \ndistinguishing the three stop consonants \"B\", \"D\" and \"G\". While 783 training sets \nwere used, 749 data sets were used for testing. Each of these contains 15 frames \nof melscale coefficients, computed from a 10kHz, 12bit converted signal. The final \nframe frequency was 100Hz. \n\nalgorithm \n\nTDNN \n\nTDRBF (P-RCE) \nTDRBF (DDA) \n\nII performance I #epochs I \n\n98.5% \n85.2% \n98.3% \n\n\",50 \n\n5 \n6 \n\n4 Conclusions \n\nIt has been shown that Radial Basis Function Networks can boost their performance \nby using the dynamic decay adjustment technique. The algorithm necessary to \nconstruct RBF networks based on the RCE method was described and a method \nto distinguish between conflicting and matching prototypes at the training phase \nwas proposed. An increase in performance was noted, especially in areas of conflict, \nwhere standard (P-)RCE did not commit new prototypes. \nFour different datasets were used to show the performance of the proposed DDA(cid:173)\nalgorithm. In three of the cases, RBF networks trained with dynamic decay ad(cid:173)\njustment outperformed known RBF training methods and MLPs. For the fourth \ntask, the BDG-recognition dataset, the TDRBF was able to reach the same level \n\n\f528 \n\nMichael R. Berthold. Jay Diamond \n\nof performance as a TDNN. \nIn addition, the new algorithm trains very quickly. Fewer than 6 epochs were \nsufficient to reach stability for all problems presented. \n\nAcknowledgements \n\nThanks go to our supervisors Prof. D. Schmid and Mark Holler for their support \nand the opportunity to work on this project. \n\nReferences \n\n[1] M. R. Berthold: \"A Time Delay Radial Basis FUnction Network for Phoneme \n\nRecognition\" in Proc. of the IEEE International Conference on Neural Net(cid:173)\nworks, 7, p.447D---4473, 1994. \n\n[2] M. R. Berthold: \"The TDRBF: A Shift Invariant Radial Basis Function Net(cid:173)\n\nwork\" in Proc. of the Irish Neural Network Conference, p.7-12, 1994. \n\n[3] D. Deterding: \"Speaker Normalization for Automatic Speech Recognition\" , \n\nPhD Thesis, University of Cambridge, 1989. \n\n[4] R. Gorman, T. Sejnowski: \"Analysis of Hidden Units in a Layered Network \n\nTrained to Classify Sonar Targets\" in Neural Networks 1, pp.75. \n\n[5] K. Lang, M. Witbrock: \"Learning to Tell Two Spirals Apart\", in Proc. of \n\nConnectionist Models Summer School, 1988. \n\n[6] J . Moody, C.J. Darken: \"Fast Learning in Networks of Locally-Tuned Process(cid:173)\n\ning Units\" in Neural Computation 1, p.281-294, 1989. \n\n[7] M.J. Hudak: \"RCE Classifiers: Theory and Practice\" in Cybernetics and Sys(cid:173)\n\ntems 23, p.483-515, 1992. \n\n[8] D.L. Reilly, L.N. Cooper, C. Elbaum: \"A Neural Model for Category Learning\" \n\nin BioI. Cybernet. 45, p.35-41, 1982. \n\n[9] M. Riedmiller, H. Braun: \"A Direct Adaptive Method for Faster Backprop(cid:173)\n\nagation Learning: The Rprop Algorithm\" in Proc. of the IEEE International \nConference on Neural Networks, 1, p.586-591, 1993. \n\n[10] D.F. Specht: \"Probabilistic Neural Networks\" in Neural Networks 3, p.109-\n\n118,1990. \n\n[11] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, K. Lang: \" Phoneme Recogni(cid:173)\n\ntion Using Time-Delay Neural Networks\" in IEEE Trans. in Acoustics, Speech \nand Signal Processing Vol. 37, No. 3, 1989. \n\n[12] D. Wettschereck, T. Dietterich: \"Improving the Performance of Radial Ba(cid:173)\n\nsis Function Networks by Learning Center Locations\" in Advances in Neural \nInformation Processing Systems 4, p.1133- 1140, 1991. \n\n[13] S. Fahlman, M. White: \"The Carnegie Mellon University Collection of Neural \n\nNet Benchmarks\" from ftp.cs.cmu.edu in /afs/cs/project/connect/bench. \n\n\f", "award": [], "sourceid": 946, "authors": [{"given_name": "Michael", "family_name": "Berthold", "institution": null}, {"given_name": "Jay", "family_name": "Diamond", "institution": null}]}