{"title": "Hierarchies of adaptive experts", "book": "Advances in Neural Information Processing Systems", "page_first": 985, "page_last": 992, "abstract": null, "full_text": "Hierarchies of adaptive experts \n\nMichael I. Jordan \n\nRobert A. Jacobs \n\nDepartment of Brain and Cognitive Sciences \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nIn this paper we present a neural network architecture that discovers a \nrecursive decomposition of its input space. Based on a generalization of the \nmodular architecture of Jacobs, Jordan, Nowlan, and Hinton (1991), the \narchitecture uses competition among networks to recursively split the input \nspace into nested regions and to learn separate associative mappings within \neach region. The learning algorithm is shown to perform gradient ascent \nin a log likelihood function that captures the architecture's hierarchical \nstructure. \n\n1 \n\nINTRODUCTION \n\nNeural network learning architectures such as the multilayer perceptron and adap(cid:173)\ntive radial basis function (RBF) networks are a natural nonlinear generalization \nof classical statistical techniques such as linear regression, logistic regression and \nadditive modeling. Another class of nonlinear algorithms, exemplified by CART \n(Breiman, Friedman, Olshen, & Stone, 1984) and MARS (Friedman, 1990), gen(cid:173)\neralizes classical techniques by partitioning the training data into non-overlapping \nregions and fitting separate models in each of the regions. These two classes of algo(cid:173)\nrithms extend linear techniques in essentially independent directions, thus it seems \nworthwhile to investigate algorithms that incorporate aspects of both approaches \nto model estimation. Such algorithms would be related to CART and MARS as \nmultilayer neural networks are related to linear statistical techniques. In this pa(cid:173)\nper we present a candidate for such an algorithm. The algorithm that we present \npartitions its training data in the manner of CART or MARS, but it does so in a \nparallel, on-line manner that can be described as the stochastic optimization of an \nappropriate cost functional. \n\n985 \n\n\f986 \n\nJordan and Jacobs \n\nWhy is it sensible to partition the training data and to fit separate models within \neach of the partitions? Essentially this approach enhances the flexibility of the \nlearner and allows the data to influence the choice between local and global repre(cid:173)\nsentations. For example, if the data suggest a discontinuity in the function being \napproximated, then it may be more sensible to fit separate models on both sides of \nthe discontinuity than to adapt a global model across the discontinuity. Similarly, \nif the data suggest a simple functional form in some region, then it may be more \nsensible to fit a global model in that region than to approximate the function locally \nwith a large number of local models. Although global algorithms such as backprop(cid:173)\nagation and local algorithms such as adaptive RBF networks have some degree of \nflexibility in the tradeoff that they realize between global and local representation, \nthey do not have the flexibility of adaptive partitioning schemes such as CART and \nMARS. \n\nIn a previous paper we presented a modular neural network architecture in which \na number of \"expert networks\" compete to learn a set of training data (Jacobs, \nJordan, Nowlan & Hinton, 1991). As a result of the competition, the architecture \nadaptively splits the input space into regions, and learns separate associative map(cid:173)\npings within each region. The architecture that we discuss here is a generalization \nof the earlier work and arises from considering what would be an appropriate inter(cid:173)\nnal structure for the expert networks in the competing experts architecture. In our \nearlier work, the expert networks were multilayer perceptrons or radial basis func(cid:173)\ntion networks. If the arguments in support of data partitioning are valid, however, \nthen they apply equally well to a region in the input space as they do to the en(cid:173)\ntire input space, and therefore each expert should itself be composed of competing \nsub-experts. Thus we are led to consider recursively-defined hierarchies of adaptive \nexperts. \n\n2 THE ARCHITECTURE \n\nFigure 1 shows two hierarchical levels of the architecture. (We restrict ourselves to \ntwo levels throughout the paper to simplify the exposition; the algorithm that we \ndevelop, however, generalizes readily to trees of arbitrary depth). The architecture \nhas a number of expert networks that map from the input vector x to output \nvectors Yij. There are also a number of gating networks that define the hierarchical \nstructure of the architecture. There is a gating network for each cluster of expert \nnetworks and a gating network that serves to combine the outputs of the clusters. \nThe output of the ith cluster is given by \n\nYi = L gjliYij \n\nj \n\n(1) \n\nwhere gjli is the activation of the ph output unit of the gating network in the ith \ncluster. The output of the architecture as a whole is given by \n\n(2) \n\nwhere gi is the activation of the ith output unit of the top-level gating network. \n\n\fHierarchies of adaptive experts \n\n987 \n\nGating \nNetwork \n\nGating \nNetwork \n\nGating \nNetwork \n\n9 1/2 \n\n9 2/2 \n\nExpert \nNetwork \n\nExpert \nNetwork \n\nExpert \nNetwork \n\nExpert \nNetwork \n\nY22 \n\nFigure 1: Two hierarchical levels of adaptive experts. All of the expert networks \nand all of the gating networks have the same input vector. \n\nWe assume that the outputs of the gating networks are given by the normalizing \nsoftmax function (Bridle) 1989): \n\ne S , \n\ngi = 'Ii:\"\"\"\" \n\nS \n\n~j e J \n\nand \n\ngjli = Lk eSkl. \n\n(3) \n\n(4) \n\nwhere Si and Sjli are the weighted sums arriving at the output units of the corre(cid:173)\nsponding gating networks. \n\nThe gating networks in the architecture are essentially classifiers that are responsi(cid:173)\nble for partitioning the input space. Their choice of partition is based on the ability \n\n\f988 \n\nJordan and Jacobs \n\nof the expert networks to model the input-output functions within their respec(cid:173)\ntive regions (as quantified by their posterior probabilities; see below). The nested \narrangement of gating networks in the architecture (cf. Figure 1) yields a nested \npartitioning much like that found in CART or MARS. The architecture is a more \ngeneral mathematical object than a CART or MARS tree, however, given that the \ngating networks have non-binary outputs and given that they may form nonlinear \ndecision surfaces. \n\n3 THE LEARNING ALGORITHM \n\nWe derive a learning algorithm for our architecture by developing a probabilistic \nmodel of a tree-structured estimation problem. The environment is assumed to be \ncharacterized by a finite number of stochastic processes that map input vectors x \ninto output vectors y*. These processes are partitioned into nested collections of \nprocesses that have commonalities in their input-output parameterizations. Data \nare assumed to be generated by the model in the following way. For any given x, \ncollection i is chosen with probability 9i, and a particular process j is then chosen \nwith conditional probability 9jli. The selected process produces an output vector \ny* according to the probability density f(y* I x; Yij), where Yij is a vector of \nparameters. The total probability of generating y* is: \n\nP(y* I x) = L 9i L 9jld(Y* I x; Yij), \n\nj \n\nwhere 9i, 9jli, and Yij are unknown nonlinear functions of x. \nTreating the probability P(y* Ix) as a likelihood function in the unknown param(cid:173)\neters 9i, 9j Ii, and Yij, we obtain a learning algorithm by using gradient ascent to \nmaximize the log likelihood. Let us assume that the probability density associated \nwith the residual vector (y* - Yij) is the multivariate normal density, where Yij is \nthe mean of the ph process of the ith cluster (or the (i, j)th expert network) and \nEij is its covariance matrix. Ignoring the constant terms in the normal density, the \nlog likelihoo d is: \n\nIn L = In L 9i L 9jlilE ij I-~ e-!(Y*-Y'J)Tl:;-;l(Y*-Y'J). \n\nj \n\n(5) \n\n(6) \n\n(7) \n\n(8) \n\nwhich is the posterior probability that a process in the ith cluster generates a par(cid:173)\nticular target vector y*. We also define the conditional posterior probability: \n\nh . . -\nJlz - L \n\n9 \u00b7I\u00b7IE \"1- ~ e - ~(Y* - Y'J )Tl;;-;l (y* -Y'J) \nJ i \n9 'I' ~'. - 2 e \nJ i \n\niJ \nI~ I ~ _'!'(Y*-Y )Tl:-l(y*_y )' \n\nZJ \n\nj \n\n2 \n\n'J 'J \n\n'J \n\nwhich is the conditional posterior probability that the ph expert in the ith cluster \ngenerates a particular target vector y*. Differentiating 6, and using Equations 3, 4, \n\n\fHierarchies of adaptive experts \n\n989 \n\n7, and 8, we obtain the partial derivative of the log likelihood with respect to the \noutput of the (i,j)th expert network: \n\nf) In L \n-\u00a3 ) - = hi hjli (y - Yij). \nUYij \n\nof< \n\n(9) \n\nThis partial derivative is a supervised error term modulated by the appropriate \nposterior probabilities . Similarly, the partial derivatives of the log likelihood with \nrespect to the weighted sums at the output units of the gating networks are given \nby: \n\nand \n\nf) In L \n- - - h\u00b7 (h 'I' - g'I ') \n- ! J! J ! ' \nUSjli \n\n\u00a3) \n\n(10) \n\n(ll) \n\nThese derivatives move the prior probabilities associated with the gating networks \ntoward the corresponding posterior probabilities. \n\nIt is interesting to note that the posterior probability hi appears in the gradient for \nthe experts in the ith cluster (Equation 9) and in the gradient for the gating network \nin the ith cluster (Equation 11). This ties experts within a cluster to each other and \nimplies that experts within a cluster tend to learn similar mappings early in the \ntraining process. They differentiate later in training as the probabilities associated \nwith the cluster to which they belong become larger. Thus the architecture tends \nto acquire coarse structure before acquiring fine structure, This feature of the \narchitecture is significant because it implies a natural robustness to problems with \noverfitting in deep hierarchies. \n\nWe have also found it useful in practice to obtain an additional degree of control over \nthe coarse-to-fine development of the algorithm. This is achieved with a heuristic \nthat adjusts the learning rate at a given level of the tree as a function of the time(cid:173)\naverage entropy of the gating network at the next higher level of the tree: \n\nj.l 'li(t + 1) = G:j.l'li(f) + f3(Mi + L gili In gjli) \n\nj \n\nwhere Mi is the maximum possible entropy at level i of the tree. This equation \nhas the effect that the networks at level i + 1 are less inclined to diversify if the \nsuperordinate cluster at level i has yet to diversify (where diversification is quantified \nby the entropy of the gating network). \n\n4 SIMULATIONS \n\nWe present simulation results from an unsupervised learning task and two super(cid:173)\nvised learning tasks. \n\nIn the unsupervised learning task, the problem was to extract regularities from a set \nof measurements of leaf morphology. Two hundred examples of maple, poplar, oak, \nand birch leaves were generated from the data shown in Table 1. The architecture \nthat we used had two hierarchical levels, two clusters of experts, and two experts \n\n\f990 \n\nJordan and Jacobs \n\nLength \nWidth \nFlare \nLobes \nMargin \nApex \nBase \nColor \n\nMaple \n3,4,5,6 \n3,4,5 \n\n0 \n5 \n\nEntire \nAcute \n\nTruncate \n\nLight \n\nPoplar \n1,2,3 \n1,2 \n0,1 \n1 \n\nCrenate, Serrate \n\nAcute \n\nRounded \nYellow \n\nOak \n\n5,6,7,8,9 \n2,3,4,5 \n\nBirch \n2,3,4,5 \n1,2,3 \n\n0 \n7,9 \n\nEntire \nRounded \nCumeate \n\nLight \n\n1 \n1 \n\nDoubly-Serrate \n\nAcute \n\nRounded \n\nDark \n\nTable 1: Data used to generate examples of leaves from four types of trees. The \ncolumns correspond to the type of tree; the rows correspond to the features of a \ntree's leaf. The table's entries give the possible values for each feature for each type \nof leaf. See Preston (1976). \n\nwithin each cluster. Each expert network was an auto-associator that maps forty(cid:173)\neight input units into forty-eight output units through a bottleneck of two hidden \nunits. Within the experts, backpropagation was used to convert the derivatives \nin Equation 9 into changes to the weights. The gating networks at both levels \nwere affine. We found that the hierarchical architecture consistently discovers the \ndecomposition ,of the data that preserves the natural classes of tree species (cf. \nPreston, 1976). That is, within one cluster of expert networks, one expert learns \nthe maple training patterns and the other expert learns the oak patterns. Within the \nother cluster, one expert learns the poplar patterns and the other expert learns the \nbirch patterns. Moreover , due to the use of the autoassociator experts, the hidden \nunit representations within each expert are principal component decompositions \nthat are specific to a particular species of leaf. \n\nWe have also studied a supervised learning problem in which the learner must \npredict the grayscale pixel values in noisy images of human faces based on values of \nthe pixels in surrounding 5x5 masks. There were 5000 masks in the training set. We \nused a four-level binary tree, with affine experts (each expert mapped from twenty(cid:173)\nfive input units to a single output unit) and affine gating networks. We compared \nthe performance of the hierarchical architecture to CART and to backpropagation. 1 \nIn the case of backpropagation and the hierarchical architecture, we utilized cross(cid:173)\nvalidation (using a test set of 5000 masks) to stop the iterative training procedure. \nAs shown in Figure 2, the performance of the hierarchical architecture is comparable \nto backpropagation and better than CART. \n\nFinally we also studied a system identification problem involving learning the sim(cid:173)\nulated forward dynamics of a four-joint, three-dimensional robot arm. The task \nwas to predict the joint accelerations from the joint positions, sines and cosines of \njoint positions, joint velocities, and torques. There were 6000 data items in the \ntraining set. We used a four-level tree with trinary splits at the top two levels, \nand binary splits at lower levels. The tree had affine experts (each expert mapped \n\n1 Fifty hidden units were used in the backpropagation network, making the num(cid:173)\n\nber of parameters in the backpropagation network and the hierarchical network roughly \ncomparable. \n\n\fHierarchies of adaptive experts \n\n991 \n\n0.1 -\n\n0.08 -\n\nRelative 0.06-\n\nError \n\n0.04 -\n\n0.02 -\n\nCART \n\nBP \n\nHier4 \n\nFigure 2: The results on the image restoration task. The dependent measure is \nrelative error on the test set. (cf. Breiman, et al., 1984) . \n\nfrom twenty input units to four output units) and affine gating networks. We once \nagain compared the performance of the hierarchical architecture to CART and to \nback propagation. In the case of back propagation and the hierarchical architecture, \nwe utilized a conjugate gradient technique, and halted the training process after \n1000 iterations. In the case of CART, we ran the algorithm four separate times on \nthe four output variables. Two of these runs produced 100 percent relative error, \na third produced 75 percent relative error , and the fourth (the most proximal joint \nacceleration) yielded 46 percent relative error, which is the value we report in Fig(cid:173)\nure 3. As shown in the figure, the hierarchical architecture and backpropagation \nachieve comparable levels of performance. \n\n5 DISCUSSION \n\nIn this paper we have presented a neural network learning algorithm that captures \naspects of the recursive approach to function approximation exemplified by algo(cid:173)\nrithms such as CART and MARS . The results obtained thus far suggest that the \nalgorithm is computationally viable, comparing favorably to backpropagation in \nterms of generalization performance on a set of small and medium-sized tasks. The \nalgorithm also has a number of appealing theoretical properties when compared to \nbackpropagation: In the affine case, it is possible to show that (1) no backward \npropagation of error terms is required to adjust parameters in multi-level trees (cf. \nthe activation-dependence of the multiplicative terms in Equat.ions 9 and 11), (2) \nall of the parameters in the tree are ma..ximum likelihood estimators. The latter \nproperty suggests that the affine architecture may be a particularly suitable archi(cid:173)\ntecture in which to explore the effects of priors on the parameter space (cf. Nowlan \n\n\f992 \n\nJordan and Jacobs \n\n0.6 -\n\n0.4 -\n\n0.2 -\n\nRelative \n\nError \n\n0.0-'----\n\nCART \n\nBP \n\nHier4 \n\nFigure 3: The results on the system identification task. \n\n& Hinton, this volume). \n\nAcknowledgements \n\nThis project was supported by grant IRI-9013991 awarded by the National Science \nFoundation, by a grant from Siemens Corporation, by a grant from ATR Auditory \nand Visual Perception Research Laboratories, by a grant from the Human Frontier \nScience Program, and by an NSF Presidential Young Investigator Award to the first \nauthor. \n\nReferences \n\nBreiman, L., Friedman, J .H., Olshen, R.A., & Stone, C.J. (1984) Classification and \nRegression Trees. Belmont, CA: Wadsworth International Group. \n\nBridle, J. (1989) Probabilistic interpretation of feedforward classification network \noutputs, with relationships to statistical pattern recognition. In F. Fogelman-Soulie \n& J. Herault (Eds.), Neuro-computing: Algorithms, Architectures, and Applications. \nNew York: Springer-Verlag. \n\nFriedman, J .H. (1990) Multivariate adaptive regression splines. The Annals of \nStatistics, 19,1-141. \nJacobs, R.A, Jordan, M.L, Nowlan, S.J., & Hinton, G.E. (1991) Adaptive mixtures \nof local experts. Neural Computation, 3, 79-87. \n\nPreston, R.J. (1976) North American Trees (Third Edition). Ames, IA: Iowa State \nUniversity Press. \n\n\f", "award": [], "sourceid": 514, "authors": [{"given_name": "Michael", "family_name": "Jordan", "institution": null}, {"given_name": "Robert", "family_name": "Jacobs", "institution": null}]}