{"title": "Adaptively Growing Hierarchical Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 459, "page_last": 465, "abstract": null, "full_text": "Adaptively Growing Hierarchical \n\nMixtures of Experts \n\nJiirgen Fritsch, Michael Finke, Alex Waibel \n\n{fritsch+,finkem, waibel }@cs.cmu.edu \n\nInteractive Systems Laboratories \n\nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nAbstract \n\nWe propose a novel approach to automatically growing and pruning \nHierarchical Mixtures of Experts. The constructive algorithm pro(cid:173)\nposed here enables large hierarchies consisting of several hundred \nexperts to be trained effectively. We show that HME's trained by \nour automatic growing procedure yield better generalization per(cid:173)\nformance than traditional static and balanced hierarchies. Eval(cid:173)\nuation of the algorithm is performed (1) on vowel classification \nand (2) within a hybrid version of the JANUS r9] speech recog(cid:173)\nnition system using a subset of the Switchboard large-vocabulary \nspeaker-independent continuous speech recognition database. \n\nINTRODUCTION \n\nThe Hierarchical Mixtures of Experts (HME) architecture [2,3,4] has proven use(cid:173)\nful for classification and regression tasks in small to medium sized applications \nwith convergence times several orders of magnitude lower than comparable neu(cid:173)\nral networks such as the multi-layer perceptron. The HME is best understood as a \nprobabilistic decision tree, making use of soft splits of the input feature space at the \ninternal nodes, to divide a given task into smaller, overlapping tasks that are solved \nby expert networks at the terminals of the tree. Training of the hierarchy is based \non a generative model using the Expectation Maximisation (EM) [1,3] algorithm as \na powerful and efficient tool for estimating the network parameters. \nIn [3], the architecture of the HME is considered pre-determined and remains fixed \nduring training. This requires choice of structural parameters such as tree depth \nand branching factor in advance. As with other classification and regression tech(cid:173)\nniques, it may be advantageous to have some sort of data-driven model-selection \nmechanism to (1) overcome false initialisations (2) speed-up training time and (3) \nadapt model size to task complexity for optimal generalization performance. In \n[11], a constructive algorithm for the HME is presented and evaluated on two small \nclassification tasks: the two spirals and the 8-bit parity problems. However, this \n\n\f460 \n\n1. Fritsch, M. Finke and A. Waibel \n\nalgorithm requires the evaluation of the increase in the overall log-likelihood for all \npotential splits (all terminal nodes) in an existing tree for each generation. This \nmethod is computationally too expensive when applied to the large HME's neces(cid:173)\nsary in tasks with several million training vectors, as in speech recognition, where \nwe can not afford to train all potential splits to eventually determine the single best \nsplit and discard all others. We have developed an alternative approach to growing \nHME trees which allows the fast training of even large HME's, when combined with \na path pruning technique. Our algorithm monitors the performance of the hierar(cid:173)\nchy in terms of scaled log-likelihoods, assigning penalties to the expert networks, \nto determine the expert that performs worst in its local partition. This expert will \nthen be expanded into a new subtree consisting of a new gating network and several \nnew expert networks. \n\nHIERARCHICAL MIXTURES OF EXPERTS \n\nWe restrict the presentation of the HME to the case of classification, although it was \noriginally introduced in the context of regression. The architecture is a tree with \ngating networks at the non-terminal nodes and expert networks at the leaves. The \ngating networks receive the input vectors and divide the input space into a nested \nset of regions, that correspond to the leaves of the tree. The expert networks also \nreceive the input vectors and produce estimates of the a-posteriori class probabilities \nwhich are then blended by the gating network outputs. All networks in the tree \nare linear, with a softmax non-linearity as their activation function. Such networks \nare known in statistics as multinomiallogit models, a special case of Generalized \nLinear Models (GLIM) [5] in which the probabilistic component is the multinomial \ndensity. This allows for a probabilistic interpretation of the hierarchy in terms of \na generative likelihood-based model. For each input vector x, the outputs of the \ngating networks are interpreted as the input-dependent multinomial probabilities \nfor the decisions about which child nodes are responsible for the generation of the \nactual target vector y. After a sequence of these decisions, a particular expert \nnetwork is chosen as the current classifier and computes multinomial probabilities \nfor the output classes. The overall output of the hierarchy is \n\nN \n\nN \n\nP(ylx,0) = L9i(X, Vi) L9jli(X, Vij)P(ylx, Oij) \n\ni=1 \n\nj=1 \n\nwhere the 9i and 9jli are the outputs of the gating networks. \nThe HME is trained using the EM algorithm [1] (see [3] for the application of EM \nto the HME architecture) . The E-step requires the computation of posterior node \nprobabilities as expected values for the unknown decision indicators: \n\nhi = \n\n9i I:j 9jli Pij(y) \nI:i 9i I:j 9jli Pij(y) \n\nThe M-step then leads to the following independent maximum-likelihood equations \n\nVij = arg ~ax L.,.. L.,.. hk L.,.. hqk log911k \n(t) \n\n\"\"\"\" (t) \"\" (t) \n\n'J \n\nt \n\nk \n\nI \n\n\fAdaptiveLy Growing HierarchicaL Mixtures of Experts \n\n461 \n\nwhere the (Jij are the parameters of the expert networks and the Vi and Vij are \nthe parameters of the gating networks. In the case of a multinomiallogit model , \nPij{y) = Yc , where Yc is the output of the node associated with the correct class. The \nabove maximum likelihood equations might be solved by gradient ascent, weighted \nleast squares or Newton methods. In our implementation, we use a variant of Jordan \n& Jacobs' [3] least squares approach. \n\nGROWING MIXTURES \n\nIn order to grow an HME, we have to define an evaluation criterion to score the \nexperts performance on the training data. This in turn will allow us to select \nand split the worst expert into a new subtree, providing additional parameters \nwhich can help to overcome the errors made by this expert . Viewing the HME \nas a probabilistic model of the observed data, we partition the input dependent \nlikelihood using expert selection probabilities provided by the gating networks \n\n1(0; X) \n\nLlogP(y(t)lx(t) , 0) = LLgk log P (y (t)lx(t) , 0) \n\nLL log[P(y(t)lx(t) , 0)]gk = Llk(0;X) \nk \n\nk \n\nk \n\nwhere the gk are the products of the gating probabilities along the path from the \nroot node to the k-th expert. gk is the probability that expert k is responsible \nfor generating the observed data (note, that the gk sum up to one) . The expert(cid:173)\ndependent scaled likelihoods Ik (0; X) can be used as a measure for the performance \nof an expert within its region of responsibility. We use this measure as the basis of \nour tree growing algorithm : \n\n1. Initialize and train a simple HME consisting of only one gate and several experts. \n2. Compute the expert-dependent scaled likelihoods lk(9j X) for each expert in one \n\nadditional pass through the training data. \n\n3. Find the expert k with minimum lk and expand the tree, replacing the expert by \na new gate with random weights and new experts that copy the weights from the \nold expert with additional small random perturbations. \n\n4. Train the architecture to a local minimum of the classification error using a cross(cid:173)\n\nvalidation set. \n\n5. Continue with step (2) until desired tree size is reached. \n\nThe number of tree growing phases may either be pre-determined, or based on \ndifference in the likelihoods before and after splitting a node. In contrast to the \ngrowing algorithm in [11], our algorithm does not hypothesize all possible node \nsplits, but determines the expansion node(s) directly, which is much faster , espe(cid:173)\ncially when dealing with large hierarchies. Furthermore, we implemented a path \npruning technique similar to the one proposed in [11], which speeds up training \nand testing times significantly. During the recursive depth-first traversal of the tree \n(needed for forward evaluation, posterior probability computation and accumula(cid:173)\ntion of node statistics) a path is pruned temporarily if the current node's probability \nof activation falls below a certain threshold . Additionally, we also prune subtrees \npermanently, if the sum of a node's activation probabilities over the whole training \nset falls below a certain threshold. This technique is consistent with the growing \nalgorithm and also helps preventing instabilities and singularities in the parameter \nupdates, since nodes that accumulate too little training information will not be \nconsidered for a parameter update because such nodes are automatically pruned by \nthe algorithm. \n\n\f462 \n\n1. Fritsch, M. Finke and A. Waibel \n\nFigure 1: Histogram trees for a standard and a grown HME \n\nVOWEL CLASSIFICATION \n\nIn initial experiments, we investigated the usefulness of the proposed tree growing \nalgorithm on Peterson and Barney's [6] vowel classification data that uses formant \nfrequencies as features. We chose this data set since it is small, non-artificial and \nlow-dimensional, which allows for visualization and understanding of the way the \ngrowing HME tree performs classification tasks. \n\nx \n\nx \n~ \n\n0 \n\no 0 \n\nThe vowel data set contains \n1520 samples consisting of the \nformants FO, F1, F2 and F3 \nand a class label, indicating \none of 10 different vowels. \nExperiments were carried out \non the 4-dimensional feature \nspace, however, in this paper \ngraphical representations are \nrestricted to the F1-F2 plane. \nThe figure to the left shows \nthe data set represented in \nthis plane (The formant fre(cid:173)\nquencies are normalized to \nthe range [0,1]). \n\nIn the following experiments, we use binary branching HME's exclusively, but in \ngeneral the growing algorithm poses no restrictions on the tree branching factor. \nWe compare a standard, balanced HME of depth 3 with an HME that grows from \na two expert tree to a tree with the same number of experts (eight) as the standard \nHME. The size of the standard HME was chosen based on a number of experiments \nwith different sized HME's to find an optimal one. Fig. 1 shows the topology \nof the standard and the fully grown HME together with histograms of the gating \nprobability distributions at the internal nodes. \nFig. 2 shows results on 4-dimensional feature vectors in terms of correct classifi(cid:173)\ncation rate and log-likelihood. The growing HME achieved a slightly better (1.6% \nabsolute) classification rate than the fixed HME. Note also, that the growing HME \noutperforms the fixed HME even before it reaches its full size. The growing HME \nwas expanded every 4 iterations, which explains the bumpiness of the curves. \nFig. 3 shows the impact of path pruning during training on the final classification \nrate of the grown HME's. The pruning factor ranges from no pruning to full pruning \n(e.g. only the most likely path survives). \nFig. 4 shows how the gating networks partition the feature space. It contains plots \n\n\fAdaptively Growing Hierarchical Mixtures of Experts \n\n463 \n\n, \n\n80 \n\nAvorage Claaalficlllion Ra'o for dlfforenl '0\" eo'\" \n\n\u2022 . , .\u2022 \n\n-\n\n-.. \n\n1 \n\ni \n\n-..... t\n\n100 r-----r-'-r---,--,-----,r----;--.--, \n.-- -:c'= --- .--:==- =--SiC' \neeLS% \n\u00b7-r \n_ ._ H~ .--.\u2022 -+~.=-t. \n: .. ~ -=--+= -1= ...... --.-\n-+.---. -.--+--- .----\n\n. -1 \n\n__ ----L--\n\n----\u00b7t\u00b7 \n\n--.... \u00b7'1-\n\n-.--.-\n\n: \n\n_ \n\n40 \n\n_\n\nlOl/\"lkolihood for o\"'ndard and growing HME \n\n! \n\n--+--\n\n-~.-\n\n- r--.,. \n\ng \n! \n} \n\n-500 \n\n\u00b71000 \n\n\u00b71500 \n\n\u00b72500 \n\n-3000 \n\n10\no \n\no \n\n4 \n\n8 \n\n12 \n\n16 \nepoch \n\n20 \n\n24 \n\n28 \n\n16 \nopoch \n\n20 \n\n24 \n\n28 \n\nFigure 2: Classification rate and log-likelihood for standard and growing HME \n\nCIa \u2022 \u2022 Hlcatlon rato when pruning gro,mg HME\u00b7. '*'ring training \n\n~r--.--~~-r~~--r--~ \n\n\"j' \n\nI \n\"\"1-\n\nl .O __ ll.1--_ I l-~--. ....J.-2.2------ __ _ _ _ \n\n~ \n\n91 \n\n90 \n\n89 \n\n88 \n\n87 \n\n88 \n\n85 \n\n84 \n\n83 \n\n-6 \n\n-5 \n\n-4 \n\n-3 \n\n-2 \n\nprunklg factor (10'1<) \n\n-1 \n\no \n\nFigure 3: Impact of path pruning during training of growing HME's \n\nof the activation regions of all 8 experts of the standard HME in the 2-dimensional \nrange [-0.1, 1.1]2. Activation probabilities (product of gating probabilities from \nroot to expert) are colored in shades of gray from black to white. Fig. 5 shows the \nsame kind of plot for all 8 experts of the grown HME. The plots in the upper right \ncorner illustrate the class boundaries obtained by each HME. \n\nFigure 4: Expert activations for standard HME \n\nFig. 4 reveals a weakness of standard HME's: Gating networks at high levels in the \ntree can pinch off whole branches, rendering all the experts in the subtree useless. \nIn our case, half of the experts of the standard HME do not contribute to the final \ndecision at all (black boxes). The growing HME's are able to overcome this effect . \nAll the experts of the grown HME (Fig. 5) have non-zero activation patterns and \nthe overlap between experts is much higher in the growing case, which indicates a \nhigher degree of cooperation among experts. This can also be seen in the histogram \ntrees in Fig. 3, where gating networks in lower levels of the grown tree tend to \n\n\f464 \n\n1. Fritsch, M. Finke and A. Waibel \n\nFigure 5: Expert activations for grown HME \n\naverage the experts outputs. The splits formed by the gating networks also have \nimplications on the way class boundaries are formed by the HME. There are strong \nde~endencies visible between the class boundaries and some of the experts activation \nreglOns. \n\nEXPERIMENTS ON SWITCHBOARD \n\nWe recently started experiments using standard and growing HME's as estima(cid:173)\ntors of .posterior phone probabilities in a hybrid version of the JANUS [9] speech \nrecognizer. Following the work in [12], we use different HME's for each state of \na phonetic HMM. The posteriors for 52 phonemes computed by the HME's are \nconverted into scaled likelihoods by dividing by prior probabilities to account for \nthe likelihood based training and decoding of HMM's. During training, targets for \nthe HME's are generated by forced-alignment using a baseline mixture of Gaussian \nHMM system. We evaluate the system on the Switchboard spontaneous telephone \nspeech corpus. Our best current mixture of Gaussians based context-dependent \nHMM system achieves a word accuracy of 61.4% on this task, which is among the \nbest current systems [7]. We started by using phonetic context-independent (CI) \nHME's for 3-state HMM's. We restricted the training set to all dialogues involv(cid:173)\ning speakers from one dialect region (New York City), since the whole training set \ncontains over 140 hours of speech. Our aim here was, to reduce training time (the \nsubset contains only about 5% of the data) to be able to compare different HME \narchitectures. \n\nContext \n\nCI \n\nI ff experts Word Acc. \nI 64 \n33.8~ \n64 \n35.1 0 \n\nFigure 6: Preliminary results on Switchboard telephone data \n\nTo improve performance, we then build context-dependent (CD) models consisting \nof a separate HME for each biphone context and state. The Cb HME's output is \nsmoothed with the CI models based on prior conteie probabilities. Current work \nfocuses on improving context modeling (e.g. larger contexts and decision tree based \nclustering) . \nFig. 6 summarizes the results so far, showing consistently that growing HME's \noutperform equally sized standard HME's. The results are not directly comparable \n\n\fAdaptively Growing Hierarchical Mixtures of Experts \n\n465 \n\nwith our best Gaussian mixture system, since we restricted context modeling to \nbiphones and used only a small subset of the Switchboard database for training. \n\nCONCLUSIONS \n\nIn this paper, we presented a method for adaptively growing Hierarchical Mixtures \nof Experts. We showed, that the algorithm allows the HME to use the resources \n(experts) more efficiently than a standard pre-determined HME architecture. The \ntree growing algorithm leads to better classification performance compared to stan(cid:173)\ndard HME's with equal numbers of parameters. Using growing instead of fixed \nHME's as continuous density estimators in a hybrid speech recognition system also \nimproves performance. \n\nReferences \n\n[1] Dempster, A.P., Laird, N.M. & Rubin, D.B. (1977) Maximum likelihood from incom(cid:173)\nplete data via the EM algorithm. J.R. Statist. Soc. B 39, 1-38. \n[2] Jacobs, R. A., Jordan, M. 1., Nowlan, S. J., & Hinton, G. E. (1991) Adaptive mixtures \nof local experts. In Neural Computation 3, pp. 79-87, MIT press. \n\n[31 Jordan, M.l. & Jacobs RA. (1994) Hierarchical Mixtures of Experts and the EM \nAlgorithm. In Neural Computation 6, pp. 181-214. MIT press. \n\n[41 Jordan, M.l. & Jacobs, RA. (1992) Hierarchies of adaptive experts. In Advances in \nNeural Information Processing Systems 4, J. Moody, S. Hanson, and R Lippmann, eds., \npp. 985-993. Morgan Kaufmann, San Mateo, CA. \n[5] McCullagh, P. & Nelder, J.A. (1983) Generalized Linear Models. Chapman and Hall, \nLondon. \n[6] Peterson, G. E. & Barney, H. L. (1952) Control measurements used in a study of the \nvowels. J oumal of the Acoustical Society of America 24, 175-184. \n[7] Proceedings of LVCSR Hub 5 workshop, Apr. 29 - May 1 (1996) MITAGS, Linthicum \nHeights, Maryland. \n[8] Syrdal, A. K. & Gopal, H. S. (1986) A perceptual model of vowel recognition based on \nthe auditory representation of American English vowels. Journal of the Acoustical Society \nof America, 79 (4):1086-1100. \n\n[9] Zeppenfeld T., Finke M., Ries K., Westphal M. & Waibel A. (1997) Recognition of \nConversational Telephone Speech using the Janus Speech Engine. Proceedings of ICASSP \n97, Muenchen, Germany \n[10] Waterhouse, S.R, Robinson, A.J. (1994) Classification using Hierarchical Mixtures of \nt;xperts. In Proc. 1994 IEEE Workshop on Neural Networks for Signal Processing IV, pp. \n177-186. \n\n[11] Waterhouse, S.R, Robinson, A.J. (1995) Constructive Algorithms for Hierarchical \nMixtures of Experts. In Advances in Neural information Processing Systems 8. \n[12] Zhao, Y., Schwartz, R, Sroka, J. & Makhoul, J. (1995) Hierarchical Mixtures of Ex(cid:173)\nperts Methodology Applied to Continuous Speech Recognition. In ICASSP 1995, volume \n5, pp. 3443-6, May 1995. \n\n\f", "award": [], "sourceid": 1279, "authors": [{"given_name": "J\u00fcrgen", "family_name": "Fritsch", "institution": null}, {"given_name": "Michael", "family_name": "Finke", "institution": null}, {"given_name": "Alex", "family_name": "Waibel", "institution": null}]}