{"title": "Discontinuous Generalization in Large Committee Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 399, "page_last": 406, "abstract": null, "full_text": "Discontinuous Generalization in Large \n\nCommittee Machines \n\nH. Schwarze \n\nDept. of Theoretical Physics \n\nJ. Hertz \nNordita \n\nLund University \nSolvegatan 14A \n\n223 62 Lund \n\nSweden \n\nBlegdamsvej 17 \n\n2100 Copenhagen 0 \n\nDenmark \n\nAbstract \n\nThe problem of learning from examples in multilayer networks is \nstudied within the framework of statistical mechanics. Using the \nreplica formalism we calculate the average generalization error of a \nfully connected committee machine in the limit of a large number \nof hidden units. If the number of training examples is proportional \nto the number of inputs in the network, the generalization error \nas a function of the training set size approaches a finite value. If \nthe number of training examples is proportional to the number of \nweights in the network we find first-order phase transitions with a \ndiscontinuous drop in the generalization error for both binary and \ncontinuous weights. \n\n1 \n\nINTRODUCTION \n\nFeedforward neural networks are widely used as nonlinear, parametric models for the \nsolution of classification tasks and function approximation. Trained from examples \nof a given task, they are able to generalize, i.e. to compute the correct output for \nnew, unknown inputs. Since the seminal work of Gardner (Gardner, 1988) much \neffort has been made to study the properties of feedforward networks within the \nframework of statistical mechanics; for reviews see e.g. (Hertz et al., 1989; Watkin et \nal., 1993). Most of this work has concentrated on the simplest feedforward network, \nthe simple perceptron with only one layer of weights connecting the inputs with a \n\n399 \n\n\f400 \n\nSchwarze and Hertz \n\nsingle output. However, most applications have to utilize architectures with hidden \nlayers, for which only a few general theoretical results are known, e.g. (Levin et al., \n1989; Krogh and Hertz, 1992; Seung et al., 1992). \nAs an example of a two-layer network we study the committee machine (Nilsson, \n1965). This architecture has only one layer of adjustable weights, while the hidden(cid:173)\nto-output weights are fixed to + 1 so as to implement a majority decision of the \nhidden units. For binary weights this may already be regarded as the most general \ntwo-layer architecture, because any other combination of hidden-output weights can \nbe gauged to + 1 by flipping the signs of the corresponding input-hidden weights. \nPrevious work has been concerned with some restricted versions of this model, such \nas learning geometrical tasks in machines with local input-to-hidden connectivity \n(Sompolinsky and Tishby, 1990) and learning in committee machines with nonover(cid:173)\nlapping receptive fields (Schwarze and Hertz, 1992; Mato and Parga, 1992). In \nthis tree-like architecture there are no correlations between hidden units and its \nbehavior was found to be qualitatively similar to the simple perceptron. \n\nRecently, learning in fully connected committee machines has been studied within \nthe annealed approximation (Schwarze and Hertz, 1993a,b; Kang et aI, 1993), re(cid:173)\nvealing properties which are qualitatively different from the tree model. However, \nthe annealed approximation (AA) is only valid at high temperatures, and a correct \ndescription of learning at low temperatures requires the solution of the quenched \ntheory. The purpose of this paper is to extend previous work towards a better \nunderstanding of the learning properties of multilayer networks. We present results \nfor the average generalization error of a fully connected committee machine within \nthe replica formalism and compare them to results obtained within the AA. In par(cid:173)\nticular we consider a large-net limit in which both the number of inputs Nand \nthe number of hidden units K go to infinity but with K ~ N. The target rule is \ndefined by another fully connected committee machine and is therefore realizable \nby the learning network. \n\n2 THE MODEL \n\nWe consider a network with N inputs, K hidden units and a single output unit (j. \nEach hidden unit (jl, I E {I, ... , K}, is connected to the inputs 8 = (81 , .\u2022\u2022 , 8N) \nthrough the weight vector W, and performs the mapping \n\n(j1(WI , 8) = sign (Jw W, . 8). \n\n(1) \n\nThe hidden units may be regarded as outputs of simple perceptrons and will be \nreferred to as students. The factor N- 1 / 2 in (1) is included for convenience; it \nensures that in the limit N -+ 00 and for iid inputs the argument of the sign \nfunction is of order 1. The overall network output is defined as the majority vote \nof the student committee, given by \n\n(2) \n\n\fDiscontinuous Generalization in Large Committee Machines \n\n401 \n\nThis network is trained from P = aK N input-output examples ({\", T({\")), J.I. E \n{1, ... , P}, ofthe desired mapping T, where the components {r ofthe training inputs \n\nare independently drawn from a distribution with zero mean and unit variance. We \nstudy a realizable task defined by another committee machine with weight vectors \n(the teachers), hidden units Tz and an overall output T(S) of the form (2). We \nL \nwill discuss both the binary version of this model with W\" L E {\u00b1 l}N and the \ncontinuous version in which the W,'s and L's are normalized to VN. \nThe goal of learning is to find a network that performs well on unknown examples, \nwhich are not included in the training set. The network quality can be measured \nby the generalization error \n\n\u20ac({W ,}) = (0[-(T({~},S) T(S)])~, \nthe probability that a randomly chosen input is misclassified. \n\n(3) \n\nFollowing the statistical mechanics approach we consider a stochastic learning al(cid:173)\ngorithm that for long training times yields a Gibbs distribution of networks with \nthe corresponding partition function \n\nZ = J dpo({W, }) e-f1Et ({W,}) , \n\n(4) \n\n(5) \n\nwhere \n\nis the training error, {3 = liT is a formal temperature parameter, and po( {W,}) \nincludes a priori constraints on the weights. The average generalization and train(cid:173)\ning errors at thermal equilibrium, averaged over all representations of the training \nexamples, are given by \n\n\" \n\n(( (\u20ac({W,}))T)) \n1 \nP (( (Et({~}))T )), \n\n(6) \nwhere (( ... )) denotes a quenched average over the training examples and ( ... )T a \nthermal average. These quantities may be obtained from the average free energy \nF = - T (( In Z )), which can be calculated within the standard replica formalism \n(Gardner, 1988; Gyorgyi and Tishby, 1990). \n\nFollowing this approach, we introduce order parameters and make symmetry as(cid:173)\nsumptions for their values at the saddle point of the free energy; for details of the \ncalculation see (Schwarze, 1993). We assume replica symmetry (RS) and a par(cid:173)\ntial committee symmetry allowing for a specialization of the hidden units on their \nrespective teachers. Furthermore, a self-consistent solution of the saddle-point \nequations requires scaling assumptions for the order parameters. Hence, we are left \nwith the ansatz \n\n1 \n\nR'k = N (( ( ~)T . V k )) \n1 \n\nD,k = N(((W,)T,(((Wk)T)) \n\n1 \n\nC'k= N(((W\"Wk)T)) \n\n(7) \n\n\f402 \n\nSchwarze and Hertz \n\nwhere p, ~, d, q and c are of order 1. For ~ = q = 0 this solution is symmetric \nunder permutations of hidden units in the student network, while nonvanishing ~ \nand q indicate a specialization of hidden units that breaks this symmetry. The \nvalues of the order parameters at the saddle point of the replica free energy finally \nallow the calculation of the average generalization and training errors. \n\n3 THEORETICAL RESULTS \n\nIn the limit of small training set sizes, Q '\" 0(1/ K), we find a committee-symmetric \nsolution where each student weight vector has the same overlap to all the teacher \nvectors, corresponding to ~ = q = O. For both binary and continuous weights \nthe generalization error of this solution approaches a nonvanishing residual value as \nshown in figure 1. Note that the asymptotic generalization ability of the committee(cid:173)\nsymmetric solution improves with increasing noise level. \n\n0.50 \n\nDAD \n\n0.30 \n\n0.20 \n\n0.10 \n\nw \n\n0.00 \n0 \n\na) \n\n...-... \nE-< \n'-' \n0 w \n\n0.30 \n0.25 \n\n0.20 \n\n0.15 \n\n0.10 \n\n0.05 \n0.00 \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \n\" \" \n\n\" \" \" \" \n\n10 \n\n30 \n\n20 \nC( = PiN \n\n40 \n\n50 \n\nb) \n\n0.0 \n\nI \n\n, --\n\nEg \n\n.. ~~~---\n\n_ ....... --\n\n-.-.--\n\n,---, , , \n, , Et \n\n, , , , , \n\nI , \n, \n, I , , \n\n0.5 \n\n1.0 \nT \n\n1.5 \n\n2.0 \n\nFigure 1: a) Generalization (upper curve) and training (lower curve) error as func(cid:173)\ntions of 0 = K Q. The results of Monte Carlo simulations for the generalization \n(open symbols) and training (closed symbols) errors are shown for K = 5 (circles) \nand K = 15 (triangles) with T = 0.5 and N = 99. The vertical lines indicate the \npredictions of the large-K theory for the location of the phase transition Oc = K Q c \nin the binary model for K = 5 and K = 15, respectively. \nb) Temperature dependence of the asymptotic generalization and training errors for \nthe committee-symmetric solution. \n\n'\" 0(1), can the \nOnly if the number of training examples is sufficiently large, Q \ncommittee symmetry be broken in favor of a specialization of hidden units. We find \nfirst-order phase transitions to solutions with ~,q > 0 in both the continuous and \nthe binary model. While in the binary model the transition is accompanied by a \nperfect alignment of the hidden-unit weight vectors with their respective teachers \n(~ = 1), this is not possible in a continuous model. Instead, we find a close approach \nof each student vector to one of the teachers in the continuous model: At a critical \nvalue Q\" (T) of the load parameter a second minimum of the free energy appears, \ncorresponding to the specialized solution with ~, q > O. This solution becomes the \n\n\fDiscontinuous Generalization in Large Committee Machines \n\n403 \n\nglobal minimum at Ckc(T) > Ck.(T), and its generalization error decays algebraically. \nIn both models the symmetric, poorly generalizing state remains metastable for \narbitrarily large Ck. For increasing system sizes it will take exponentially long times \nfor a stochastic training algorithm to escape from this local minimum (see figure \n1a). Figure 2 shows the qualitative behavior of the generalization error for the \ncontinuous model, and the phase diagrams in figure 3 show the location of the \ntransitions for both models. \n\n1/2 \n\n\u20aco(T) \n\n--------------------=--'=-----+f--.,...I---..----\n\na. \n\na c \n\ni \ni \nI \nf \n\nI \nj \n\n~ \n\n~----------------~/~/_---------------\n\n'\" -\n~ - KN \n\np \n\na\", O(l/K) \n\na'\" 0(1) \n\nFigure 2: Schematic behavior of the generalization error in the large-K committee \nmachine with continuous weights. \n\nIn the binary model a region of negative thermodynamic entropy (below the dashed \nline in figure 3a) suggests that replica symmetry has to be broken to correctly \ndescribe the metastable, symmetric solution at large Ck. \nA comparison of the RS solution with the results previously obtained within the \nAA (Schwarze and Hertz, 1993a,b) shows that the AA gives a qualitatively correct \ndescription of the main features of the learning curve. However, it fails to predict the \ntemperature dependence of the residual generalization error (figure 1 b) and gives an \nincorrect description of the approach to this value. Furthermore, the quantitative \npredictions for the locations of the phase transitions differ considerably (figure 3). \n\n4 SIMULATIONS \n\nWe have performed Monte Carlo simulations to check our analytical findings for the \nbinary model (see figure 1a). The influence of the metastable, poorly generalizing \nstate is reflected by the fact that at low temperatures the simulations do not follow \nthe predicted phase transition but get trapped in the metastable state. Only at \nhigher temperatures do the simulations follow the first order transition (Schwarze, \n1993). Furthermore, the deviation of the training error from the theoretical result \nindicates the existence of replica symmetry breaking for finite Q. However, the gen(cid:173)\neralization error of the symmetric state is in good quantitative agreement with the \n\n\f404 \n\nSchwarze and Hertz \n\n0.8 \n\n0.6 \n\nE-< 0.4 \n\n0.2 \n\n; \n.I \n;,l' \n/ ; \n\n; \n; \nl , \n; \n; \n; \n; \nI ,. \ni \nI \ni \ni \nj \nj \n\ni , \ni \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n/ .......\u2022.\u2022. / \n\n... --.. // ..... \nl./\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \nr \n\nO.O~~~~~~! ~~~~~~~~ O.O~~~~~~~~~!~-u~~ \nb) 1.0 1.5 2.0 2.5 3.0 3.5 4.0 \n\n30 \n\n25 \n\na) \n\n5 \n\n10 \n\n15 \n20 \n0: = P/KN \n\n0: = P /KN \n\nFigure 3: Phase diagrams of the large-K committee machine. \na) continuous weights: The two left lines show the RS results for the spinodal \nline (--), where the specialized solution appears, and the location of the phase \ntransition (-). These results are compared to the predictions of the AA for the \nspinodal line (- . -) and the phase transition ( ... ). \nb) binary weights: The RS result for the location of the phase transition (-) and \nits zero-entropy line (--) are compared to the prediction of the AA for the phase \ntransition ( ... ) and its zero-entropy line (-\n\n. -). \n\ntheoretical results. \nIn order to investigate whether our analytical results for a Gibbs ensemble of com(cid:173)\nmittee machines carries over to other learning scenarios we have studied a variation \nof this model allowing the use of backpropagation. We have considered a 'soft(cid:173)\ncommittee' whose output is given by \n\nq( {W,}. S) = tanh (t. tanh (J\u00a3, . S\u00bb. \n\n(8) \n\nThe first-layer weights W, of this network were trained on examples (el', r(el'\u00bb, \nJ.\u00a3 E {l, ... , P}, defined by another soft-committee with weight vectors V, using \non-line backpropagation with the error function \n\n\u00a3(S) = (1/2)[0'({~}, S) - r(S)]2. \n\n(9) \nIn general this procedure is not guaranteed to yield a Gibbs distribution of weights \n(Hansen et al., 1993) and therefore the above analysis does not apply to this case. \nHowever, the generalization error for a network with N = 45 inputs and K = \n3 hidden units, averaged over 50 independent runs, shows the same qualitative \nbehavior as predicted for the Gibbs ensemble of committee machines (see figure 4). \nAfter an initial approach to a nonvanishing value, the average generalization error \ndecreases rather smoothly to zero. This smooth decrease of the average error is \ndue to the fact that some runs got trapped in a poorly-generalizing, committee(cid:173)\nsymmetric solution while others found a specialized solution with a close approach \nto the teacher. \n\n\fDiscontinuous Generalization in Large Committee Machines \n\n405 \n\n0.18 r--.....,.----r--.....,.---r--.....,.------r'1 \n\n0.16 \n0.1. i \n0.12 \n\n0.06 \n\n0.0. \n\n0.02 \n\n200 \n\n600 \n\n800 \n\n1000 \n\n1200 \n\nP \n\nFigure 4: Generalization error and training error of the 'soft-committee' with N = \n45 and K = 3. We have used standard on-line backpropagation for the first-layer \nweights with a learning rate 11 = 0.01 for 1000 epochs. the results are averaged over \n50 runs with different teacher networks and different training sets. \n\n5 CONCLUSION \n\nWe have presented the results of a calculation of the generalization error of a multi(cid:173)\nlayer network within the statistical mechanics approach. We have found nontrivial \nbehavior for networks with both continuous and binary weights. \nels, phase transitions from a symmetric, poorly-generalizing solution to one with \nspecialized hidden units occur, accompanied by a discontinuous drop of the gener(cid:173)\nalization error. However, the existence of a metastable, poorly generalizing solution \nbeyond the phase transition implies the possibility of getting trapped in a local \nminimum during the training process. Although these results were obtained for a \nGibbs distribution of networks, numerical experiments indicate that some of the \ngeneral results carryover to other learning scenarios. \n\nIn both mod(cid:173)\n\nAcknowledgements \n\nThe authors would like to thank M. Biehl and S. Solla for fruitful discussions. HS \nacknowledges support from the EC under the SCIENCE programme (under grant \nnumber B/SCl * /915125) and by the Danish Natural Science Council and the Danish \nTechnical Research Council through CONNECT. \n\nReferences \n\nE. Gardner (1988), J. Phys. A 21, 257. \nG. Gyorgyi and N. Tishby (1990), in Neural Networks and Spin Glasses, edited by \nK. Thuemann and R. Koberle, (World scientific, Singapore). \nL.K. Hansen, R. Pathria, and P. Salamon (1993), J. Phys. A 26, 63. \nJ. Hertz, A. Krogh, and R.G. Palmer (1989), Introduction to the Theory of Neural \n\n\f406 \n\nSchwarze and Hertz \n\nComputation, (Addison-Wesley, Redwood City, CA). \nK. Kang, J.-H. Oh, C. Kwon, and Y. Park (1993), preprint Pohang Institute of \nScience and Technology, Korea. \nA. Krogh and J . Hertz (1992), in Advances in Neural Information Processing Sys(cid:173)\ntems IV, eds. J .E. Moody, S.J. Hanson, and R.P. Lippmann, (Morgan Kaufmann, \nSan Mateo). \nE. Levin, N. Tishby, and S.A. Solla (1989), in Proc. 2nd Workshop on Computa(cid:173)\ntional Learning Theory, (Morgan Kaufmann, San Mateo). \nG. Mato and N. Parga (1992), J. Phys. A 25, 5047. \nN.J. Nilsson (1965), Learning Machines, (McGraw-Hill, New York). \nH. Schwarze (1993), J. Phys. A 26, 5781. \nH. Schwarze and J. Hertz (1992), Europhys. Lett. 20,375. \nH. Schwarze and J. Hertz (1993a), J. Phys. A 26, 4919. \nH. Schwarze and J. Hertz (1993b), in Advances in Neural Information Processing \nSystems V, (Morgan Kaufmann, San Mateo). \nH.S. Seung, H. Sompolinsky, and N. Tishby (1992), Phys. Rev. A 45, 6056. \nH. Sompolinskyand N. Tishby (1990), Europhys. Lett. 13, 567. \nT. Watkin, A. Rau, and M. Biehl (1993), Rev. Mod. Phys. 65, 499. \n\n\f", "award": [], "sourceid": 839, "authors": [{"given_name": "H.", "family_name": "Schwarze", "institution": null}, {"given_name": "J.", "family_name": "Hertz", "institution": null}]}