{"title": "Statistical Mechanics of the Mixture of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 183, "page_last": 189, "abstract": null, "full_text": "Statistical Mechanics of the Mixture of \n\nExperts \n\nKukjin Kang and Jong-Hoon Oh \n\nDepartment of Physics \n\nPohang University of Science and Technology \n\nHyoja San 31, Pohang, Kyongbuk 790-784, Korea \n\nE-mail: kkj.jhohOgalaxy.postech.ac.kr \n\nAbstract \n\nWe study generalization capability of the mixture of experts learn(cid:173)\ning from examples generated by another network with the same \narchitecture. When the number of examples is smaller than a crit(cid:173)\nical value, the network shows a symmetric phase where the role \nof the experts is not specialized. Upon crossing the critical point, \nthe system undergoes a continuous phase transition to a symme(cid:173)\ntry breaking phase where the gating network partitions the input \nspace effectively and each expert is assigned to an appropriate sub(cid:173)\nspace. We also find that the mixture of experts with multiple level \nof hierarchy shows multiple phase transitions. \n\n1 \n\nIntroduction \n\nRecently there has been considerable interest among neural network community in \ntechniques that integrate the collective predictions of a set of networks[l, 2, 3, 4]. \nThe mixture of experts [1, 2] is a well known example which implements the phi(cid:173)\nlosophy of divide-and-conquer elegantly. Whereas this model are gaining more \npopularity in various applications, there have been little efforts to evaluate gener(cid:173)\nalization capability of these modular approaches theoretically. Here we present the \nfirst analytic study of generalization in the mixture of experts from the statistical \n\n\f184 \n\nK. Kang and 1. Oh \n\nphysics perspective. Use of statistical mechanics formulation have been focused \non the study of feedforward neural network architectures close to the multilayer \nperceptron[5, 6], together with the VC theory[8]. We expect that the statistical \nmechanics approach can also be effectively used to evaluate more advanced archi(cid:173)\ntectures including mixture models. \n\nIn this letter we study generalization in the mixture of experts[l] and its variety \nwith two-level hierarchy[2]. The network is trained by examples given by a teacher \nnetwork with the same architecture. We find an interesting phase transition driven \nby symmetry breaking among the experts. This phase transition is closely related \nto the 'division-and-conquer' mechanism which this mixture model was originally \ndesigned to accomplish. \n\n2 Statistical Mechanics Formulation for the Mixture of \n\nExperts \n\nThe mixture of experts[2] is a tree consisted of expert networks and gating networks \nwhich assign weights to the outputs of the experts. The expert networks sit at the \nleaves of the tree and the gating networks sit at its branching points of the tree. \nFor the sake of simplicity, we consider a network with one gating network and two \nexperts. Each expert produces its output J,lj as a generalized linear function of the \nN dimensional input x : \n\nJ,lj = /(Wj . x), \n\nj = 1,2, \n\n(1) \n\nwhere Wj is a weight vector of the j th expert with spherical constraint[5]. We \nconsider a transfer function /(x) = sgn(x) which produces binary outputs. The \nprinciple of divide-and-conquer is implemented by assigning each expert to a sub(cid:173)\nspace of the input space with different local rules. A gating network makes partitions \nin the input space and assigns each expert a weighting factor : \n\n(2) \nwhere the gating function 8(x) is the Heaviside step function. For two experts, \nthis gating function defines a sharp boundary between the two subspace which is \nperpendicular to the vector V 1 = -V 2 = V, whereas the softmax function used in \nthe original literature [2] yield a soft boundary. Now the weighted output from the \nmixture of expert is written: \n\nJ,l(V, W; x) = 2: 9j (x)J,lj (x). \n\n2 \n\nj=1 \n\n(3) \n\nThe whole network as well as the individual experts generates binary outputs. \nTherefore, it can learn only dichotomy rules. The training examples are generated \nby a teacher with the same architecture as: \n\n2 \n\nO'(xlJ) = 2: 8(VJ . x)sgn(WJ . x) , \n\nj=1 \n\n(4) \n\n\fStatistical Mechanics of the Mixture of Experts \n\n185 \n\nwhere ~o and Wl are the weights of the jth gating network and the expert of the \nteacher. \n\nThe learning of the mixture of experts is usually interpreted probabilistically, hence \nthe learning algorithm is considered as a maximum likelihood estimation. Learning \nalgorithms originated from statistical methods such as the EM algorithm are often \nused. Here we consider Gibbs algorithm with noise level T (= 1/(3) that leads to a \nGibbs distribution of the weights after a long time: \n\n(5) \nwhere Z = J dV dW exp( -(3E(V, Wj)) is the partition function. Training both the \nexperts and the gating network is necessary for a good generalization performance. \nThe energy E of the system is defined as a sum of errors over P examples: \n\np L f(V, W j; xl), \n\n1=1 \n\n(6) \n\n(7) \n\nThe performance of the network is measured by the generalization function \nf(V, W j ) = J dx f(V, Wj; x), where J dx represents an average over the whole \ninput space . The generalization error fg is defined by fg = (((f(W))T)) where ((-.-)) \ndenotes the quenched average over the examples and (- . -)T denotes the thermal \naverage over the probability distribution of Eq. (5). \n\nSince the replica calculation turns out to be intractable, we use the annealed ap(cid:173)\nproximation: \n\n((log Z)) ~ log((Z)) . \n\n(8) \n\nThe annealed approximation is exact only in the high temperature limit, but it is \nknown that the approximation usually gives qualitatively good results for the case \nof learning realizable rules[5, 6] . \n\n3 Generalization Curve and the Phase Transition \n\nThe generalization function f(V, W j) is can be written as a function of overlaps \nbetween the weight vectors of the teacher and the student: \n\nwhere \n\n2 \n\n2 LLPijfij \n\ni=l j=l \n\n(9) \n\n(10) \n\n(11) \n\n\f186 \n\nand \n\nK. Kang and J. Oh \n\nRij \n\nRij \n\n1 \n0 \n-V\u00b7\u00b7V \u00b7 \nN' \nJ' \n1 \n0 \nN Wi \u00b7Wj\n. \n\n(12) \n\n(13) \n\nis the overlap order parameters. Here, Pij is a probability that the i th expert of \nthe student learns from examples generated by the j th expert of the teacher . It \nis a volume fraction in the input space where Vi . x and VJ . x are both positive. \nFor that particular examples, the ith expert of the student gives wrong answer with \nprobability fij with respect to the j th expert of the teacher. We assume that \nthe weight vectors of the teacher, V 0, W~ and W~, are orthogonal to each other, \nthen the overlap order parameters other than the oneS shown above vanish. We \nuse the symmetry properties of the network such as Rv = RYI = R~2 = - RY2, \nR = Rll = R 22 , and r = R12 = R 21 . \nThe free energy also can be written as a function of three order parameters Rv, R, \nand r . Now we consider a thermodynamic limit where the dimension of the input \nspace N and the number of examples P goes to infinity, keeping the ratio eY = PIN \nfinite. By minimizing the free energy with respect to the order parameters, we find \nthe most probable values ofthe order parameters as well as the generalization error. \n\nFig 1.(a) plots the overlap order parameters Rv, Rand r versus eY at temperature \nT = 5. Examining the plot, we find an interesting phase transition driven by \nsymmetry breaking among the experts. Below the phase transition point eYe = 51.5, \nthe overlap between the gating networks of the teacher and the student is zero \n(Rv = 0) and the overlaps between the experts are symmetric (R = r). In the \nsymmetric phase, the gating network does not have enough examples to learn proper \npartitioning, so its performance is not much better than a random partitioning. \nConsequently each expert of the student can not specialize for the subspaces with \na particular local rule given by an expert of the teacher. Each expert has to learn \nmultiple linear rules with linear structure, which leads to a poor generalization \nperformance. Unless more than a critical amount of examples is provided, the \ndivide-and-conquer strategy does not work. \n\nUpon crossing the critical point eYe, the system undergoes a continuous phase tran(cid:173)\nsition to the symmetry breaking phase. The order parameter Rv , related to the \ngoodness of partition, begins to increase abruptly and approaches 1 with increasing \neY . The gating network now provides a better partition which is close to that of the \nteacher. The plot of order parameter Rand r, which is overlap between experts of \nteacher and student, branches at eYe and approaches 1 and 0 respectively. It means \nthat each expert specializes its role by making appropriate pair with a particular \nexpert of the teacher. Fig. l(b) plots the generalization curve (f g versus eY) in the \nsame scale. Though the generalization curve is continuous, the slope of the curve \nchanges discontinuously at the transition point so that the generalization curve has \n\n\fStatistical Mechanics of the Mixture of Experts \n\n187 \n\nO.S \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n\nI \nI \n/ \n\nI \n; \n0 \n\n0.5 \n\n0.45 \n\n0.4 \n\n0.35 \n\n0.3 \n~.25 \n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\n0 \n\n0 \n\n/,/-\\ \n. \n.. \n\" \n\n\". \n\n, , \n-' \n, \n, \n-' \n; \n\n20 \n\n40 \n\n60 \n\n20 \n\n40 \n\n60 \n\n.,.-\n\n.' \n\n--.. _-- - - -- --. \n\n100 \n\n120 \n\n140 \n\n160 \n\nISO \n\n\" \n\nSO \n\nex \n(a) \n\n120 \n\n140 \n\n160 \n\n180 \n\n100 \n\n80 \n\nex \n(b) \n\nFigure 1: (a) The overlap order parameters Rv, R, r versus 0' at T = 5. For \n0' < O'c = 51.5, we find Rv = 0 (solid line that follows x axis), and R = r \n(dashed line). At the transition point, Rv begins to increase abruptly, R (dotted \nline) and r (dashed line) branches, which approach 1 and 0 respectively. (b) The \ngeneralization curve (f g versus 0') for the mixture of experts in the same scale. A \ncusp at the transition point O'c is shown. \n\n\f188 \n\nK. Kang and J. Oh \n\n0.5 ,...,,------,---,.---,---,--.-------, \n\n0.45 \n\n0.4 \n\n0.35 \n\n0.3 \n~.25 \n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\nOL-_~ __ ~ __ ~ __ _L __ ~_~ \no \n\n200 \n\n100 \n\n150 a \n\n50 \n\n250 \n\n300 \n\nFigure 2: A typical generalization error curve for HME network with continuous \nweight. T = 5. \n\na cusp. The asymptotic behavior of fg at large 0' is given by: \n\n3 \n1 - e-\n\n1 \n0' \n\nf3' \n\nf ::::: \n\n(14) \n\nwhere the 1/0' decay is often observed in learning of other feedforward networks. \n\n4 The Mixture of Experts with Two-Level Hierarchy \n\nWe also study generalization in the hierarchical mixture of experts [2] . Consider \na two-level hierarchical mixture of experts consisted of three gating networks and \nfour experts. At the top level the tree is divided into two branch, and they are in \nturn divided into two branches at the lower level. The experts sit at the four leaves \nof the tree, and the three gating networks sit at the top and lower-level branching \npoints. The network also learns from the training examples drawn from a teacher \nnetwork with the same architecture. \n\nFIG 2. (b) shows corresponding learning curve which has two cusps related to \nthe phase transitions. For 0' < O'ct, the system is in the fully symmetric phase. \nThe gating networks do not provide correct partition for the experts at both levels \nof hierarchy and the experts cannot specialize at all. All the overlaps with the \nweights of the teacher experts have the same value. The first phase transition at \nthe smaller 0'c1 is related to the symmetry breaking by the top-level gating network. \nFor 0'c1 < 0' < O'c2, the top-level gating network partition the input space into two \nparts, but the lower-level gating network is not functioning properly. The overlap \nbetween the gating networks at the lower level of the tree and that of the teacher \nis still zero. The experts partially specialize into two groups. Specialization among \nthe same group is not accomplished yet. The overlap order parameter Rij can \n\n\fStatistical Mechanics of the Mixture of Experts \n\n189 \n\nhave two distinct values. The bigger one is the overlap with the two experts of the \nteacher for which the group is specializing, and the smaller is with the experts of \nthe teacher which belong to the other group. At the second transition point Q'c2, the \nsymmetry related to the lower-level hierarchy breaks. For c\u00a5 > C\u00a5c2, all the gating \nnetworks work properly and the input space is divided into four. Each expert makes \nappropriate pair with an expert of the teacher. Now the overlap order parameters \ncan have three distinct values. The largest is the overlap with matching expert of \nteacher. The next largest is the overlap with the neighboring teacher expert in the \ntree hierarchy. The smallest is with the experts of the other group. The two phase \ntransition result in the two cusps of the learning curve. \n\n5 Conclusion \n\nWhereas the phase transition of the mixture of experts can be interpreted as a \nsymmetry breaking phenomenon which is similar to the one already observed in the \ncommittee machine and the multi-Iayer-perceptron[6, 7], the transition is novel in \nthat it is continuous. This means that symmetry breaking is easier for the mixture \nof experts than in the multi-layer perceptron. This can be a big advantage in \nlearning of highly nonlinear rules as we do not have to worry about the existence of \nlocal minima. We find that the hierarchical mixture of experts can have multiple \nphase transitions which are related to symmetry breaking at different levels. Note \nthat symmetry breaking comes first from the higher-level branch, which is desirable \nproperty of the model. \n\nWe thank M. I. Jordan, L. K. Saul, H. Sompolinsky, H. S. Seung, H. Yoon and \nC. K won for useful discussions and comments. This work was partially supported \nby the Basic Science Special Program of the POSTECH Basic Science Research \nInstitute. \n\nReferences \n\n[1] R. A. Jacobs, M. I. Jordan, S. J. Nolwan, and G. E. Hinton, Neural Computa(cid:173)\n\ntion 3, 79 (1991). \n\n[2] M. I. Jordan, and R. A. Jacobs, Neural Computation 6, 181 (1994). \n\n[3] M.P. Perrone and L. N. Cooper, Neural Networks for Speech and Image Pro(cid:173)\n\ncessing, R. J. Mammone. Ed., Chapman-Hill, London, 1993. \n\n[4] D. Wolpert, Neural Networks, 5, 241 (1992). \n\n[5] H. S. Seung, H. Sompolinsky, and N. Tishby, Phys. Rev . A 45, 6056 (1992) . \n\n[6] K. Kang, J.-H. Oh, C. Kwon and Y. Park, Phys. Rev. E 48, 4805 (1993); K. \n\nKang, J .-H. Oh, C. Kwon and Y. Park, Phys. Rev. E 54, 1816 (1996). \n\n[7] E. Baum and D. Haussler, Neural Computation 1, 151 (1989). \n\n\f", "award": [], "sourceid": 1176, "authors": [{"given_name": "Kukjin", "family_name": "Kang", "institution": null}, {"given_name": "Jong-Hoon", "family_name": "Oh", "institution": null}]}