{"title": "Supervised Learning with Growing Cell Structures", "book": "Advances in Neural Information Processing Systems", "page_first": 255, "page_last": 262, "abstract": null, "full_text": "Supervised Learning with Growing Cell \n\nStructures \n\nBernd Fritzke \n\nInstitut fiir Neuroinformatik \n\nRuhr-U niversitat Bochum \n\nGermany \n\nAbstract \n\nWe present a new incremental radial basis function network suit(cid:173)\nable for classification and regression problems. Center positions \nare continuously updated through soft competitive learning. The \nwidth of the radial basis functions is derived from the distance \nto topological neighbors. During the training the observed error \nis accumulated locally and used to determine where to insert the \nnext unit. This leads (in case of classification problems) to the \nplacement of units near class borders rather than near frequency \npeaks as is done by most existing methods. The resulting networks \nneed few training epochs and seem to generalize very well. This is \ndemonstrated by examples. \n\n1 \n\nINTRODUCTION \n\nFeed-forward networks of localized (e.g., Gaussian) units are an interesting alter(cid:173)\nnative to the more frequently used networks of global (e.g., sigmoidal) units. It \nhas been shown that with localized units one hidden layer suffices in principle to \napproximate any continuous function, whereas with sigmoidal units two layers are \nnecessary. \n\nIn the following we are considering radial basis function networks similar to those \nproposed by Moody & Darken (1989) or Poggio & Girosi (1990). Such networks \nconsist of one layer L of Gaussian units. Each unit eEL has an associated vector \nWe E Rn indicating the position of the Gaussian in input vector space and a standard \n\n255 \n\n\f256 \n\nFritzke \n\ndeviation U c . For a given input datum e E Rn the activation of unit c is described \nby \n\nD (C) _ \n\nc \n\n'\" \n\n(_ lie - wcll2) \n\n- exp 2 \u00b7 \n\nUc \n\n(1) \n\nOn top of the layer L of Gaussian units there are m single layer percepirons. \nThereby, m is the output dimensionality of the problem which is given by a number \nof input/output pairsl (e, () E (Rn x Rm). Each of the single layer perceptrons \ncomputes a weighted sum of the activations in L: \n\nOi(e) = L Wij Dj (0 \n\njEL \n\niE{1, ... ,m} \n\n(2) \n\nWith Wij we denote the weighted connection from local unit j to output unit i. \nTraining of a single layer perceptron to minimize square error is a very well under(cid:173)\nstood problem which can be solved incrementally by the delta rule or directly by \nlinear algebra techniques (Moore-Penrose inverse). Therefore, the only (but severe) \ndifficulty when using radial basis function networks is choosing the number of local \nunits and their respective parameters, namely center position wand width u. \n\nOne extreme approach is to use one unit per data points and to position the units \ndirectly at the data points. If one chooses the width of the Gaussians sufficiently \nsmall it is possible to construct a network which correctly classifies the training \ndata, no matter how complicated the task is (Fritzke, 1994). However, the network \nsize is very large and might even be infinite in the case of a continuous stream of \nnon-repeating stochastic input data. Moreover, such a network can be expected to \ngeneralize poorly. \nMoody & Darken (1989), in contrast, propose to use a fixed number of local units \n(which is usually considerably smaller than the total number of data points). These \nunits are first distributed by an unsupervised clustering method (e.g., k-means). \nThereafter, the weights to the output units are determined by gradient descent. \nAlthough good results are reported for this method it is rather easy to come up \nwith examples where it would not perform well: k-means positions the units based \non the density of the training data, specifically near density peaks. However, to ap(cid:173)\nproximate the optimal Bayesian a posteriori classifier it would be better to position \nunits near class borders. Class borders, however, often lie in regions with a particu(cid:173)\nlarly low data density. Therefore, all methods based on k-means-like unsupervised \nplacement of the Gaussians are in danger to perform poorly with a fixed number of \nunits or - similarly undesirable - to need a huge number of units to achieve decent \nperformance. \n\nFrom this one can conclude that - in the case of radial basis function networks \n- it is essential to use the class labels not only for the training of the connection \nweights but also for the placement of the local units. Doing this forms the core of \nthe method proposed below. \n\nIThroughout this article we assume a classification problem and use the corresponding \n\nterminology. However, the described method is suitable for regression problems as well. \n\n\fSupervised Learning with Growing Cell Structures \n\n257 \n\n2 SUPERVISED GROWING CELL STRUCTURES \n\nIn the following we present an incremental radial basis function network which \nis able to simultaneously determine a suitable number of local units, their center \npositions and widths as well as the connection weights to the output units. The \nbasic idea is a very simple one: \n\nO. Start with a very small radial basis function network. \n1. Train the current network with some I/O-pairs from the training data. \n2. Use the observed accumulated error to determine where in input vector \n\nspace to insert new units. \n\n3. If network does not perform well enough goto 1. \n\nOne should note that during the training phase (Step 1.) error is accumulated \nover several data items and this accumulated error is used to determine where to \ninsert new units (Step 2.). This is different from the approach of Platt (1991) where \ninsertions are based on single poorly mapped patterns. In both cases, however, the \ngoal is to position new units in regions where the current network does not perform \nwell rather than in regions where many data items stem from. \n\nIn our model the center positions of new units are interpolated from the positions \nof existing units. Specifically, after some adaptation steps we determine the unit \nq which has accumulated the maximum error and insert a new unit in between q \nand one of its neighbors in input vector space. The interpolation procedure makes \nit necessary to allow the center positions of existing units to change. Otherwise, all \nnew units would be restricted to the convex hull of the centers of the initial network. \n\nWe do not necessarily insert a new unit in between q and its nearest neighbor. \nRather we like to choose one of the units with adjacent Voronoi regions2 . In the \ntwo-dimensional case these are the direct neighbors of q in the Delaunay triangu(cid:173)\nlation (Delaunay-neighbors) induced by all center positions. In higher-dimensional \nspaces there exists an equivalent based on hypertetrahedrons which, however, is \nvery hard to compute. For this reason, we arrange our units in a certain topological \nstructure (see below) which has the property that if two units are direct neighbors \nin that structure they are mostly Delaunay-neighbors. By this we get with very \nlittle computational effort an approximate subset of the Delaunay-neighbors which \nseems to be sufficient for practical purposes. \n\n2.1 NETWORK STRUCTURE \n\nThe structure of our network is very similar to standard radial basis function net(cid:173)\nworks. The only difference is that we arrange the local units in a k-dimensional \ntopological structure consisting of connected simplices3 (lines for k = 1, triangles \n\n2The Voronoi region of a unit c denotes the part of the input vector space which consists \n\nof points for which c is the nearest unit. \n\n3 A historical reason for this specific approach is the fact that the model was developed \nfrom an unsupervised network (see Fritzke, 1993) where the k-dimensional neighborhood \nwas needed to reduce dimensionality. We currently investigate an alternative (and more \n\n\f258 \n\nFritzke \n\nfor k = 2, tetrahedrons for k = 3 and hypertetrahedrons for larger k). This ar(cid:173)\nrangement is done to facilitate the interpolation and adaptation steps described \nbelow. The initial network consists of one k-dimensional simplex (k + 1 local units \nfully connected with each other). The neighborhood connections are not weighted \nand do not directly influence the behavior of the network. They are, however, used \nto determine the width of the Gaussian functions associated with the units. Let \nfor each Gaussian unit c denote Ne the set of direct topological neighbors in the \ntopological structure. Then the width of c is defined as \n\n(je = (1/INe l) L: Ilwe - wdl12 \n\ndENc \n\n(3) \n\nwhich is the mean distance to the topological neighbors. If topological neighbors \nhave similar center positions (which will be ensured by the way adaptation and \ninsertion is done) then this leads to a covering ofthe input vector space with partially \noverlapping Gaussian functions. \n\n2.2 ADAPTATION \n\nIt was mentioned above that several adaptation steps are done before a new unit is \ninserted. One single adaptation step is done as follows (see fig. 1): \n\n\u2022 Chose an I/O-pair (e,(),e E Rn,( E Rm) from the training data. \n\u2022 determine the unit s closest to e (the so-called best-matching unit). \n\u2022 Move the centers of s and its direct topological neighbors towards e. \n\ndWe = en (e - we) \n\nfor all c E N~ \n\neb and en are small constants with eb > > en. \n\n\u2022 Compute for each local unit eEL the activation De(e) \n\u2022 Compute for each output unit i the activation Oi \n\n(see eqn. 2) \n\n(see eqn. 1) \n\n\u2022 Compute the square error by \n\nm \n\nSE = L:\u00ab(i - Oi)2 \n\ni=l \n\n\u2022 Accumulate error at best-matching unit s: \n\nderrs = SE \n\n\u2022 Make Delta-rule step for the weights (a denotes the learning rate): \n\niE{1, ... ,m},jEL \n\nSince together with the best-matching unit always its direct topological neighbors \nare adapted, neighboring units tend to have similar center positions. This prop(cid:173)\nerty can be used to determine suitable center positions for new units as will be \ndemonstrated in the following. \n\n\fSupervised Learning with Growing Cell Structures \n\n259 \n\na) Before ... \n\nb) during, and ... \n\nc) ... after adaptation \n\nFigure 1: One adaptation step. The center positions of the current network are \nshown and the change caused by a single input signal. The observed error SE for \nthis pattern is added to the local error variable of the best-matching unit. \n\nf \n\na) Before ... \n\nb) ... and after insertion \n\nFigure 2: Insertion of a new unit. The dotted lines indicate the Voronoi fields. \nThe unit q has accumulated the most error and, therefore, a new unit is inserted \nbetween q and one of its direct neighbors. \n\n2.3 \n\nINSERTION OF NEW UNITS \n\nAfter a constant number A of adaptation steps a new unit is inserted. For this \npurpose the unit q with maximum accumulated error is determined. Obviously, \nq lies in a region of the input vector space where many misclassifications occur. \nOne possible reason for this is that the gradient descent procedure is unable to find \nsuitable weights for the current network. This again might be caused by the coarse \nresolution at this region of the input vector space: \nif data items from different \nclasses are covered by the same local unit and activate this unit to about the same \ndegree then it might be the case that their vectors of local unit activations are \nnearly identical which makes it hard for the following single layer perceptrons to \ndistinguish among them. Moreover, even if the activation vectors are sufficiently \ndifferent they still might be not linearly separable. \n\naccurate) approximation of the Delaunay triangulation which is based on the \"Neural-Gas\" \nmethod proposed by Martinetz & Schulten (1991). \n\n\f260 \n\nFritzke \n\n\u2022 \n\no 0 \n. . \n\no \n\n0 \n\n0 \n\n0 \n\no \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \n\n\u2022 \n\n\u2022 \n\no \no . 00 \ng \n\n\u2022 \n\n0 \n\n\u2022 \n\n\u2022 \n.o.o:~~oo.o.o \n\n00000 \n0 \n\n\u00b00\n\no \n\n\u2022 \n\n0 \n\n0 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n.o\u00b7.~\"~\u00b7o \n\n-. \n\n\u2022\u2022 \n\n-.... \n\n. . . . \n. . \n\no \n\n0 \n\n0 \n\no \n\no \n\n0 \n\n0 \n\no \n\n\u2022 \n\u2022 \n\n0 \n\na) two spiral problem: 194 points in two \nclasses \n\ndecision \n\nb) \nCorrelation (reprinted with permission \nfrom Fahlman & Lebiere, 1990) \n\nfor Cascade(cid:173)\n\nregions \n\nFigure 3: Two spiral problem and learning results of a constructive network. \n\nThe insertion of a new local unit near q is likely to improve the situation: This unit \nwill probablY be activated to a different degree by the data items in this region and \nwill, therefore, make the problem easier for the single layer perceptrons. \n\nWhat exactly are we doing? We choose one of the direct topological neighbors of \nq, say a unit f (see also fig. 2). Currently this is the neighbor with the maximum \naccumulated error. Other choices, however, have shown good results as well, e.g., the \nneighbor with the most distant center position or even a randomly picked neighbor. \nWe insert a new unit r in between q and f and initialize its center by \n\n(4) \nWe connect the new unit with q and f and with all common neighbors of q and f. \nThe original connection between q and f is removed. By this we get a structure of \nk-dimensional simplices again. The new unit gets weights to the output units which \nare interpolated from the weights of its neighbors. The same is done for the initial \nerror variable which is linearly interpolated from the variables of the neighbors of r. \nAfter the interpolation all the weights of r and its neighbors and the error variables \nof these units are multiplied by a factor INrl/(INrl + 1)1. This is done to disturb \nthe output of the network as less as possible4 \u2022 However, the by far most important \n\n4The redistribution of the error variable is again a relict from the unsupervised version \n(Fritzke, 1993). There we count signals rather than accumulate error. An elaborate \nscheme for redistributing the signal counters is necessary to get good local estimates of the \nprobability density. For the supervised version this redistribution is harder to justify since \nthe insertion of a new unit in general makes previous error information void. However, \neven though there is still some room for simplification, the described scheme does work \nvery well already in its present form. \n\n\fSupervised Learning with Growing Cell Structures \n\n261 \n\no \n\na) final network with 145 cells \n\nb) decision regions \n\nFigure 4: Performance of the Growing Cell Structures on the two spiral benchmark. \n\ndecision seems to be to insert the new unit near the unit with maximum error. The \nweights and the error variables adjust quickly after some learning steps. \n\n2.4 SIMULATION RESULTS \n\nSimulations with the two spiral problem (fig. 3a) have been performed. This clas(cid:173)\nsification benchmark has been widely used before so that results for comparison \nare readily available.Figure 3b) shows the result of another constructive algorithm. \nThe data consist of 194 points arranged on two interlaced spirals in the plane. Each \nspiral corresponds to one class. Due to the high nonlinearity of the task it is partic(cid:173)\nular difficult for networks consisting of global units (e.g., multi-layer perceptrons). \nHowever, the varying density of data points (which is higher in the center of the \nspirals) makes it also a challenge for networks of local units. \n\nAs for most learning problems the interesting aspect is not learning the training \nexamples but rather the performance on new data which is often denoted as gen(cid:173)\neralization. Baum & Lang (1991) defined a test set of 576 points for this problem \nconsisting of three equidistant test points between each pair of adjacent same-class \ntraining points. They reported for their best network 29 errors on the test set in \nthe mean. \n\nIn figure 4 a typical network generated by our method can be seen as well as the \ncorresponding decision regions. No errors on the test set of Baum and Lang are \nmade. Table 1 shows the necessary training cycles for several algorithms. The new \ngrowing network uses far less cycles than the other networks. \n\nOther experiments have been performed with a vowel recognition problem (Fritzke, \nIn all simulations we obtained significantly better generalization results \n1993). \nthan Robinson (1989) who in his thesis investigated the performance of several \nconnectionist and conventional algorithms on the same problem. The necessary \n\n\f262 \n\nFritzke \n\nTable 1: Training epochs necessary for the two spiral problem \n\nnetwork model \nBackpropagation \nCross Entropy BP \nCascade-Correlation \nGrowmg Cell Structures \n\nepochs \n20000 \n10000 \n1700 \n180 \n\ntest error \n\nyes \nyes \nyes \nno \n\nreported in \n\nLang & Witbrock (1989) \nLang & Witbrock (1989) \nFahlman & Lebiere (1990) \n\nFntzke ( 1993) \n\nnumber of training cycles for our method was lower by a factor of about 37 than \nthe numbers reported by Robinson (1993, personal communication). \n\nREFERENCES \n\nBaum, E. B. & K. E. Lang [1991]' \"Constructing hidden units using examples and queries,\" \nin Advances in Neural Information Processing Systems 3, R.P. Lippmann, J.E. \nMoody & D.S. Touretzky, eds., Morgan Kaufmann Publishers, San Mateo, 904-\n910. \n\nFahlman, S. E. & C. Lebiere [1990], \"The Cascade-Correlation Learning Architecture,\" \nin Advances in Neural Information Processing Systems 2, D.S. Touretzky, ed., \nMorgan Kaufmann Publishers, San Mateo, 524-532. \n\nFritzke, B. [1993], \"Growing Cell Structures - a self-organizing network for unsupervised \nand supervised learning,\" International Computer Science Institute, TR-93-026, \nBerkeley. \n\nFritzke, B. [1994], \"Making hard problems linearly separable - incremental radial basis \n\nfunction approaches,\" (submitted to ICANN'94: International Conference on Ar(cid:173)\ntificial Neural Networks), Sorrento, Italy. \n\nLang, K. J. & M. J. Witbrock [1989], \"Learning to tell two spirals apart,\" in Proceedings \nof the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton & T . \nSejnowski, eds., Morgan Kaufmann, San Mateo, 52-59. \n\nMartinetz, T. M. & K. J. Schulten [1991]' \"A \"neural-gas\" network learns topologies,\" in \nArtificial Neural Networks, T. Kohonen, K. Makisara, O. Simula & J. Kangas, \neds., North-Holland, Amsterdam, 397-402. \n\nMoody, J. & C. Darken [1989], \"Learning with Localized Receptive Fields,\" in Proceedings \nof the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton & T. \nSejnowski, eds., Morgan Kaufmann, San Mateo, 133-143. \n\nPlatt, J. C. [1991], \"A Resource-Allocating Network for Function Interpolation,\" Neural \n\nComputation 3, 213-225. \n\nPoggio, T. & F. Girosi [1990], \"Regularization Algorithms for Learning That Are Equiva(cid:173)\n\nlent to Multilayer Networks,\" Science 247, 978-982. \n\nRobinson, A. J. [1989], \"Dynamic Error Propagation Networks,\" Cambridge University, \n\nPhD Thesis, Cambridge. \n\n\f", "award": [], "sourceid": 791, "authors": [{"given_name": "Bernd", "family_name": "Fritzke", "institution": null}]}