{"title": "Constructive Algorithms for Hierarchical Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 584, "page_last": 590, "abstract": null, "full_text": "Constructive Algorithms for Hierarchical \n\nMixtures of Experts \n\nS.R.Waterhouse \n\nA.J.Robinson \n\nCambridge University Engineering Department, \nTrumpington St., Cambridge, CB2 1PZ, England. \nTel: [+44] 1223 332754, Fax: [+44] 1223 332662, \n\nEmail: srw1001.ajr@eng.cam.ac.uk \n\nAbstract \n\nWe present two additions to the hierarchical mixture of experts \n(HME) architecture. By applying a likelihood splitting criteria to \neach expert in the HME we \"grow\" the tree adaptively during train(cid:173)\ning. Secondly, by considering only the most probable path through \nthe tree we may \"prune\" branches away, either temporarily, or per(cid:173)\nmanently if they become redundant. We demonstrate results for \nthe growing and path pruning algorithms which show significant \nspeed ups and more efficient use of parameters over the standard \nfixed structure in discriminating between two interlocking spirals \nand classifying 8-bit parity patterns. \n\nINTRODUCTION \n\nThe HME (Jordan & Jacobs 1994) is a tree structured network whose terminal \nnodes are simple function approximators in the case of regression or classifiers in the \ncase of classification. The outputs of the terminal nodes or experts are recursively \ncombined upwards towards the root node, to form the overall output of the network, \nby \"gates\" which are situated at the non-terminal nodes. \nThe HME has clear similarities with tree based statistical methods such as Classi(cid:173)\nfication and Regression Trees (CART) (Breiman, Friedman, Olshen & Stone 1984). \nWe may consider the gate as replacing the set of \"questions\" which are asked at \neach branch of CART. From this analogy, we may consider the application of the \nsplitting rules used to build CART. We start with a simple tree consisting of two \nexperts and one gate. After partially training this simple tree we apply the split(cid:173)\nting criterion to each terminal node. This evaluates the log-likelihood increase by \nsplitting each expert into two experts and a gate. The split which yields the best \nincrease in log-likelihood is then added permanently to the tree. This process of \ntraining followed by growing continues until the desired modelling power is reached. \n\n\fConstructive Algorithms for Hierarchical Mixtures of Experts \n\n585 \n\nFigure 1: A simple mixture of experts. \n\nThis approach is reminiscent of Cascade Correlation (Fahlman & Lebiere 1990) in \nwhich new hidden nodes are added to a multi-layer perceptron and trained while \nthe rest of the network is kept fixed. \n\nThe HME also has similarities with model merging techniques such as stacked re(cid:173)\ngression (Wolpert 1993), in which explicit partitions of the training set are com(cid:173)\nbined. However the HME differs from model merging in that each expert considers \nthe whole input space in forming its output. Whilst this allows the network more \nflexibility since each gate may implicitly partition the whole input space in a \"soft\" \nmanner, it leads to unnecessarily long computation in the case of near optimally \ntrained models. At anyone time only a few paths through a large network may \nhave high probability. In order to overcome this drawback, we introduce the idea \nof \"path pruning\" which considers only those paths from the root node which have \nprobability greater than a certain threshold. \n\nCLASSIFICATION USING HIERARCHICAL MIXTURES OF EXPERTS \n\nThe mixture of experts, shown in Figure 1, consists of a set of \"experts\" which \nperform local function approximation. The expert outputs are combined by a gate \nto form the overall output. In the hierarchical case, the experts are themselves \nmixtures of further experts, thus extending the architecture in a tree structured \nfashion. Each terminal node or \"expert\" may take on a variety of forms, depending \non the application. In the case of multi-way classification, each expert outputs a \nvector Yj in which element m is the conditional probability of class m (m = 1 ... M) \nwhich is computed using the soft max function: \n\nP(CmI x(n>, Wj) = exp(w~jx(n\u00bb) It exp(w~.kX(n\u00bb) \n\nk=1 \n\nwhere Wj = [wlj W2j \nclass i. \n\n... WMj] is the parameter matrix for expert j and Ci denotes \n\nThe outputs of the experts are combined using a \"gate\" which sits at the non(cid:173)\nterminal nodes. The gate outputs are estimates of the conditional probability of \nselecting the daughters of the non-terminal node given the input and the path taken \nto that node from the root node. This is once again computed using the softmax \nfunction: \n\nP(Zj I ~ (.), ~) = exp( ~ J ~(.\u00bb It \n0 \n....J \n\n-80 \n\n-100 \n\n-120 \n\n(a) \n\n(iii \n\n(iv) \n\n(b) \n\n(c) \n\n,.,fi \n\" \n,: .. \n.i \n, , \n, , \n\n.... ' :1 \n\n,.,. \n\n,I \n\nI\n\n. \n\n/ \n( .' / \n\n~ -,~ '.'- ' \n\n.\"..: '\" \n' ,,,.. \n\n.\"\"\n\n10 \n\n100 \n\nTime (5) \n\n1000 \n\nFigure 6: The effect of pruning on the two spirals classification problem by a 8 \ndeep binary branching hme:(a) Log-likelihood vs. Time (CPU seconds), with log pruning \nthresholds for experts and gates f: (i) f = -5. 6,(ii) f = -lO,(iii) f = -15,(iv) no pruning, \n(b) training set for two-spirals task; the two classes are indicated by crosses and circles, \n(c) Solution to two spirals problem. \n\nReferences \nBreiman, L., Friedman, J., Olshen, R. & Stone, C. J. (1984), Classification and \n\nRegression Trees, Wadswoth and Brooks/Cole. \n\nCun, Y. L., Denker, J. S. & Solla, S. A. (1990), Optimal brain damage, in D. S. \nTouretzky, ed., 'Advances in Neural Information Processing Systems 2', Mor(cid:173)\ngan Kaufmann, pp. 598-605. \n\nDempster, A. P., Laird, N. M. & Rubin, D. B. (1977), 'Maximum likelihood from \nincomplete data via the EM algorithm', Journal of the Royal Statistical Society, \nSeries B 39, 1-38. \n\nFahlman, S. E. & Lebiere, C. (1990), The Cascade-Correlation learning architec(cid:173)\n\nture, Technical Report CMU-CS-90-100, School of Computer Science, Carnegie \nMellon University, Pittsburgh, PA 15213. \n\nJordan, M. I. & Jacobs, R. A. (1994), 'Hierarchical Mixtures of Experts and the \n\nEM algorithm', Neural Computation 6, 181-214. \n\nMoody, J. E. (1992), The effective number of parameters: An analysis of general(cid:173)\n\nization and regularization in nonlinear learning systems, in J. E. Moody, S. J. \nHanson & R. P. Lippmann, eds, 'Advances in Neural Information Processing \nSystems 4', Morgan Kaufmann, San Mateo, California, pp. 847-854. \n\nWaterhouse, S. R. & Robinson, A. J . (1994), Classification using hierarchical mix(cid:173)\ntures of experts, in 'IEEE Workshop on Neural Networks for Signal Processing', \npp. 177-186. \n\nWaterhouse, S. R., MacKay, D. J. C. & Robinson, A. J . (1995), Bayesian methods \nfor mixtures of experts, in M. C. M. D. S. Touretzky & M. E. Hasselmo, eds, \n'Advances in Neural Information Processing Systems 8', MIT Press. \n\nWolpert, D. H . (1993), Stacked generalization, Technical Report LA-UR-90-3460, \nThe Santa Fe Institute, 1660 Old Pecos Trail, Suite A, Santa Fe, NM, 87501. \n\n\f", "award": [], "sourceid": 1165, "authors": [{"given_name": "Steve", "family_name": "Waterhouse", "institution": null}, {"given_name": "Anthony", "family_name": "Robinson", "institution": null}]}