{"title": "Metamorphosis Networks: An Alternative to Constructive Models", "book": "Advances in Neural Information Processing Systems", "page_first": 131, "page_last": 138, "abstract": null, "full_text": "Metamorphosis Networks: \n\nAn Alternative to Constructive Methods \n\nBrian v. Bonnlander \n\nMichael C. Mozer \n\nDepartment of Computer Science & \n\nInstitute of Cognitive Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \n\nAbstract \n\nGiven a set oft raining examples, determining the appropriate num(cid:173)\nber of free parameters is a challenging problem. Constructive \nlearning algorithms attempt to solve this problem automatically by \nadding hidden units, and therefore free parameters, during learn(cid:173)\ning. We explore an alternative class of algorithms-called meta(cid:173)\nmorphosis algorithms-in which the number of units is fixed, but \nthe number of free parameters gradually increases during learning. \nThe architecture we investigate is composed of RBF units on a lat(cid:173)\ntice, which imposes flexible constraints on the parameters of the \nnetwork. Virtues of this approach include variable subset selec(cid:173)\ntion, robust parameter selection, multiresolution processing, and \ninterpolation of sparse training data. \n\n1 \n\nINTRODUCTION \n\nGeneralization performance on a fixed-size training set is closely related to the \nnumber of free parameters in a network. Selecting either too many or too few \nparameters can lead to poor generalization. Geman et al. \n(1991) refer to this \nproblem as the bias/variance dilemma: introducing too many free parameters incurs \nhigh variance in the set of possible solutions, and restricting the network to too few \nfree parameters incurs high bias in the set of possible solutions. \n\nConstructive learning algorithms (e.g., Fahlman & Lebiere, 1990; Platt, 1991) have \n\n131 \n\n\f132 \n\nBonnlander and Mozer \n\noutput \nlayer \n\nRBFunit \nlayer \n\niDpIlt \nlayer \n\nFigure 1: Architecture of an RBF network. \n\nbeen proposed as a way of automatically selecting the number of free parameters in \nthe network during learning. In these approaches, the learning algorithm gradually \nincreases the number of free parameters by adding hidden units to the network. \nThe algorithm stops adding hidden units when some validation criterion indicates \nthat network performance is good enough. \n\nWe explore an alternative class of algorithms-called metamorphosis algorithms(cid:173)\nfor which the number of units is fixed, but heavy initial constraints are placed on \nthe unit response properties. During learning, the constraints are gradually relaxed, \nincreasing the flexibility of the network. Within this general framework, we devel(cid:173)\nop a learning algorithm that builds the virtues of recursive partitioning strategies \n(Breiman et aI., 1984j Friedman, 1991) into a Radial Basis Function (RBF) net(cid:173)\nwork architecture. We argue that this framework offers two primary advantages \nover constructive RBF networks: for problems with low input variable interaction, \nit can find solutions with far fewer free parameters, and it is less susceptible to noise \nin the training data. Other virtues include multiresolution processing and built-in \ninterpolation of sparse training data. \n\nSection 2 introduces notation for RBF networks and reviews the advantages of \nusing these networks in constructive learning. Section 3 describes the idea behind \nmetamorphosis algorithms and how they can be combined with RBF networks. \nSection 4 describes the advantages of this class of algorithm. The final section \nsuggests directions for further research. \n\n2 RBF NETWORKS \n\nRBF networks have been used successfully for learning difficult input-output map(cid:173)\npings such as phoneme recognition (Wettschereck & Dietterich, 1991), digit classi(cid:173)\nfication (Nowlan, 1990), and time series prediction (Moody & Darken, 1989j Platt, \n1991). The basic architecture is shown in Figure 1. The response properties of each \nRBF unit are determined by a set of parameter values, which we'll call a pset. The \npset for unit i, denoted ri, includes: the center location of the RBF unit in the \ninput space, pij the width of the unit, Uij and the strength of the connection(s) \nfrom the RBF unit to the output unit(s), hi. \nOne reason why RBF networks work well with constructive algorithms is because \n\n\fMetamorphosis Networks: An Alternative to Constructive Methods \n\n133 \n\nthe hidden units have the property of noninterference: the nature of their activation \nfunctions, typically Gaussian, allows new RBF units to be added without changing \nthe global input-output mapping already learned by the network. \n\nHowever, the advantages of constructive learning with RBF networks diminish for \nproblems with high-dimensional input spaces (Hartman & Keeler, 1991). For these \nproblems, a large number of RBF units are needed to cover the input space, even \nwhen the number of input dimensions relevant for the problem is small. The rele(cid:173)\nvant input dimensions can be different for different parts of the input space, which \nlimits the usefulness of a global estimation of input dimension relevance, as in Pog(cid:173)\ngio and Girosi (1990). Metamorphosis algorithms, on the other hand, allow RBF \nnetworks to solve problems such as these without introducing a large number of free \nparameters. \n\n3 METAMORPHOSIS ALGORITHMS \n\nMetamorphosis networks contrast with constructive learning algorithms in that the \nnumber of units in the network remains fixed, but degrees of freedom are gradually \nadded during learning. While metamorphosis networks have not been explored in \nthe context of supervised learning, there is at least one instance of a metamorphosis \nnetwork in unsupervised learning: a Kohonen net. Units in a Kohonen net are \narranged on a lattice; updating the weights of a unit causes weight updates of the \nunit's neighbors. Units nearby on the lattice are thereby forced to have similar \nresponses, reducing the effective number of free parameters in the network. In one \nvariant of Kohonen net learning, the neighborhood of each unit gradually shrinks, \nincreasing the degrees of freedom in the network. \n\n3.1 MRBF NETWORKS \n\nWe have applied the concept of metamorphosis algorithms to ordinary RBF net(cid:173)\nworks in supervised learning, yielding MRBF networks. Units are arranged on an \nn-dimensional lattice, where n is picked ahead of time and is unrelated to the dimen(cid:173)\nsionality of the input space. The response of RBF unit i is constrained by deriving \nits pset, ri, from a collection of underlying psets, each denoted Uj, that also reside \non the lattice. The elements of Uj correspond to those of ri: Uj = (I-'i, uj, hj). \nDue to the orderly arrangement of the Uj, the lattice is divided into nonoverlap(cid:173)\nping hyperrectangular regions that are bounded by 2n Uj. Consequently, each ri is \nenclosed by 2n Uj. The pset ri can then be derived by linear interpolation of the \nenclosing underlying psets Uj, as shown in Figure 2 for a one-dimensional lattice. \n\nLearning in MRBF networks proceeds by minimizing an error function E in the Uj \ncomponents via gradient descent: \n\nwhere NEIGHj is the set of RBF units whose values are affected by underlying pset \ni, and k indexes the input units of the network. The update expression is similar \nfor uj and hi- To better condition the search space, instead of optimizing the \n\n\f134 \n\nBonnlander and Mozer \n\n(a) \n\n(b) \n\nFigure 2: Constrained RBF units. (a) Four RBF units with psets rl-r4 are arranged \non a one-dimensional lattice, enclosed by underlying psets Ul and U2. (b) An input \nspace representation of the constrained RBF units. RBF center locations, widths, \nand heights are linearly interpolated. \n\n0'[ directly, we follow Nowlan and Hinton's (1991) suggestion of computing each \nRBF unit width according to the transformation O'i = ezp{\"Yi!2) and searching for \nthe optimum value of \"Yi. This forces RBF widths to remain positive and makes it \ndifficult for a width to approach zero. \n\nWhen a local optimum is reached, either learning is stopped or additional underlying \npsets are placed on the lattice in a process called metamorphosis. \n\n3.2 METAMORPHOSIS \n\nMetamorphosis is the process that gradually adds new degrees of freedom to the \nnetwork during learning. For the MRBF network explored in this paper, introduc(cid:173)\ning new free parameters corresponds to placing additional underlying psets on the \nlattice. The new psets split one hyperrectangular region-an n-dimensional sub(cid:173)\nlattice bounded by 2n underlying psets-into two nonoverlapping hyperrectangular \nregions. To achieve this, 2n - 1 additional underlying psets, which we call the split \ngroup, are required (Figure 3). The splitting process implements a recursive par(cid:173)\ntitioning strategy similar to the strategies employed in the CART (Breiman et aI., \n1984) and MARS (Friedman, 1991) statistical learning algorithms. \n\nMany possible rules for region splitting exist. In the simulations presented later, \nwe consider every possible region and every possible split of the region into two \nsubregions. For each split group k, we compute the tension of the split, defined as \n\njE'Pl~QUP .11 :! II' \n\nWe then select the split group that has the greatest tension. This heuristic is based \non the assumption that the error gradient at the point in weight space where a split \nwould take place reflects the long-term benefit of that split. \n\nIt may appear that this splitting process is computationally expensive, but it can be \nimplemented quite efficiently; the cost of computing all possible splits and choosing \n\n\fMetamorphosis Networks: An Alternative to Constructive Methods \n\n135 \n\nKey \n\n\u2022 RBF llllit 1*[ \n., derivalive of \"Pli[ sroup poe[ \n\"- deriv.live of RBF 1*[ \n0 split &JOIIP pse[ \n\n0 UDdel'1ying pse[ \n\n~::: ::::: Latria: repOll boundary \n\nFigure 3: Computing the tension of a split group. Arrows are meant to represent \nderivatives of corresponding pset components. \n\nthe best one is linear in the number of RBF units on the lattice. \n\n4 VIRTUES OF METAMORPHOSIS NETS \n\n4.1 VARIABLE SUBSET SELECTION \n\nOne advantage ofMRBF networks is that they can perform variable subset selection; \nthat is, they can select a subset of input dimensions more relevant to the problem \nand ignore the other input dimensions. This is also a property of other recursive \npartitioning algorithms such as CART and MARS. In MRBF networks, however, \nregion splitting occurs on a lattice structure, rather than in the input space. Con(cid:173)\nsequently, the learning algorithm can orient a small number of regions to fit data \nthat is not aligned with the lattice to begin with. CART and MARS have to create \nmany regions to fit this kind of data (Friedman, 1991). \n\nTo see if this style oflearning algorithm could learn to solve a difficult problem, we \ntrained an MRBF network on the Mackey-Glass chaotic time series. Figure 4(a) \ncompares normalized RMS error on the test set with Platt's (1991) RAN algorithm \nas the number of parameters increases during learning. Although RAN eventually \nfinds a superior solution, the MRBF network requires a much smaller number of \nfree parameters to find a reasonably accurate solution. This result agrees with the \nidea that ordinary RBF networks must use many free parameters to cover an input \nspace with RBF units, whereas MRBF networks may use far fewer by concentrating \nresources on only the most relevant input dimensions. \n\n4.2 ROBUST PARAMETER SELECTION \n\nIn RBF networks, the local response of a hidden unit makes it difficult for back \npropagation to move RBF centers far from where they are originally placed. Con(cid:173)\nsequently, the choice of initial RBF center locations is critical for constructive al-\n\n\f136 \n\nBonnlander and Mozer \n\n(a) \n\nII) \n\n1S \n! \n\u00a7 \nt.t-1 \nen \n::;g \n~ 0.1 \n\n\" \n\n-'. \n\nRAN-\nMRBF ..... \n\n(b) \n\n70 \n\n8 \n~ \nS ~ \n\n0\u00ab \n\n60 \n\n50 \n\n40 \n\n30 \n\n20 \n\nRAN \"'*(cid:173)\nMRBF -tI-\n\n0 100 200 300 400 500 600700 800 900 \n\nDegrees of Freedom \n\no \n\n0.1 \n\n0.2 0.3 0.4 0.5 0.6 0.7 \n\nLevel of Noise \n\nFigure 4: (a) Comparison on the Mackey-Glass chaotic time series. The curves \nfor RAN and MRBF represent an average over ten and three simulation runs, re(cid:173)\nspectively. The simulations used 300 training patterns and 500 test patterns as \ndescribed in (Platt 1991). Simulation parameters for RAN match those report(cid:173)\ned in (Platt 1991) with \u20ac = 0.02. (b) Gaussian noise was added to the function \ny = sin81[':l, 0 < :l < 1, where the task was to predict y given x. The horizontal \naxis represents the standard deviation of the Gaussian distribution. For both al(cid:173)\ngorithms, 20 simulations were run at each noise level. The number of degrees of \nfreedom (DOF) needed to achieve a fixed error level was averaged. \n\ngorithms. Poor choices could result in the allocation of more RBF units than are \nnecessary. One apparent weakness of the RAN algorithm is that it chooses RBF \ncenter locations based on individual examples, which makes it susceptible to noise. \nMetamorphosis in MRBF networks, on the other hand, is based on the more global \nmeasure of tension. \n\nFigure 4(b) shows the average number of degrees of freedom allocated by RAN \nand an MRBF network on a simple, one-dimensional function approximation task, \nGaussian noise was added to the target output values in the training and test sets. \nAs the amount of noise increases, the average number of free parameters allocated \nby RAN also increases, whereas for the MRBF network, the average remains low. \n\nOne interesting property of RAN is that allocating many extra RBF units does not \nnecessarily hurt generalization performance. This is true when RAN starts with \nwide RBF units and decreases the widths of candidate RBF units slowly. The main \ndisadvantage to this approach is wasted computational resources. \n\n4.3 MULTIRESOLUTION PROCESSING \n\nOur approach has the property of initially finding solutions sensitive to coarse prob(cid:173)\nlem features and using these solutions to find refinements more sensitive to finer \nfeatures (Figure 5). This idea of multiresolution processing has been studied in the \ncontext of computer vision relaxation algorithms and is a property of algorithms \nproposed by other authors (e.g. Moody, 1989, Platt, 1991). \n\n\fMetamorphosis Networks: An Alternative to Constructive Methods \n\n137 \n\n(a) \n\n(b) \n\n(c) \n\ntwo underlying psets \n\nthree Wlderlying psets \n\nfive Wlderlying psets \n\nFigure 5: Example of multiresolution processing. The figure shows performance on a \ntwo-dimensional classification task, where the goal is to classify all inputs inside the \nU-shape as belonging to the same category. An MRBF network is constrained using \na one-dimensional lattice. Circles represent RBF widths, and squares represent the \nheight of each RBF. \n\n4.4 \n\nINTERPOLATION OF SPARSE TRAINING DATA \n\nFor a problem with sparse training data, it is often necessary to make assumptions \nabout the appropriate response at points in the input space far away from the \ntraining data. Like nearest-neighbor algorithms, MRBF networks have such an \nassumption built in. The constrained RBF units in the network serve to interpolate \nthe values of underlying psets (Figure 6). Although ordinary RBF networks can, \nin principle, interpolate between sparse data points, the local response of an RBF \nunit makes it difficult to find this sort of solution by back propagation. \n\n:t \n\n.. \n\n.,. ............... \\ \"\"--\n1 \n\n\\/~da' \n\nMRBF Network Output \n\nI-Nearest Neighbor Asswnption \n\n:t \n\nPlain RBFNetwork Output \n\nFigure 6: Assumptions made for sparse training data on a task with a one(cid:173)\ndimensional input space and one-dimensional output space. Target output values \nare marked with an 'x'. Like nearest-neighbor algorithms, the assumption made by \nMRBF networks causes network response to interpolate between sparse data points. \nThis assumption is not built into ordinary RBF networks. \n\n\f138 \n\nBonnlander and Mozer \n\n5 DIRECTIONS FOR FURTHER RESEARCH \n\nIn our simulations to date, we have not observed astonishingly better generalization \nperformance with metamorphosis nets than with alternative approaches, such as \nPlatt's RAN algorithm. Nonetheless, we believe the approach worthy of further \nexploration. We've examined but one type of metamorphosis net and in only a few \ndomains. The sorts of investigations we are considering next include: substituting \nfinite-element basis functions for RBFs, implementing a \"soft\" version of the RBF \npset constraint using regularization techniques, and using a supervised learning \nalgorithm similar to Kohonen networks, where updating the weights of a unit causes \nweight updates of the unit's neighbors. \n\nA cknowledgeIllent s \n\nThis research was supported by NSF PYI award IRI-9058450 and grant 90-21 from the \nJames S. McDonnell Foundation. We thank John Platt for providing the Mackey-Glass \ntime series data, and Chris Williams, Paul Smolensky, and the members of the Boulder \nConnectionist Research Group for helpful discussions. \n\nReferences \n\nL. Breiman, J. Friedman, R. A. Olsen & C. J. Stone. (1984) Clauification and Regreuion \nTrees. Belmont, CA: Wadsworth. \nS. E. Fahlman & C. Lebiere. (1990) The cascade-correlation learning architecture. In \nD. S. Touretzky (ed.), Advance! in Neural Information Proceuing Sy!tem! ~, 524-532. \nSan Mateo, CA: Morgan Kaufmann. \nJ. Friedman. (1991) Multivariate Adaptive Regression Splines. Annab of Stati&tic! 19:1-\n141. \nS. Geman, E. Bienenstock & R. Doursat. (1992) Neural networks and the bias/variance \ndilemma. Neural Computation 4(1):1-58. \nE. Hartman & J. D. Keeler. (1991) Predicting the future: advantages of semilocal units. \nNeural Computation 3( 4):566-578. \nT. Kohonen. (1982) Self-organized formation of topologically correct feature maps. Bio(cid:173)\nlogical Cybernetic! 43:59-69. \nJ. Moody & C. Darken. (1989) Fast learning in networks oflocally-tuned processing units. \nNeural Computation 1(2):281-294. \nJ. Moody. (1989) Fast learning in multi-resolution hierarchies. In D. S. Touretzky (ed.), \nAdvance! in Neural Information Proceuing 1, 29-39. San Mateo, CA: Morgan Kaufmann. \nS. J. Nowlan. (1990) Maximum likelihood competition in RBF networks. Tech. Rep. \nCRG-TR-90-2, Department of Computer Science, University of Toronto, Toronto, Canada. \nS. J. Nowlan & G. Hinton. (1991) Adaptive soft weight-tying using Gaussian Mixtures. \nIn Moody, Hanson, & Lippmann (eds.), Advance! in Neural Information Proce!!ing 4, \n993-1000. San Mateo, CA: Morgan-Kaufmann. \nJ. Platt. (1991) A resource-allocating network for function interpolation. Neural Compu(cid:173)\ntation 3(2):213-225. \nT. Poggio & F. Girosi. (1990) Regularization algorithms for learning that are equivalent \nto multilayer networks. Science 247:978-982. \nD. Wettschereck & T. Dietterich. (1991) Improving the performance of radial basis func(cid:173)\ntion networks by learning center locations. In Moody, Hanson, & Lippmann (eds.), Ad(cid:173)\nvances in Neural Info. Proceuing 4, 1133-1140. San Mateo, CA: Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 604, "authors": [{"given_name": "Brian", "family_name": "Bonnlander", "institution": null}, {"given_name": "Michael", "family_name": "Mozer", "institution": null}]}