{"title": "Competition Among Networks Improves Committee Performance", "book": "Advances in Neural Information Processing Systems", "page_first": 592, "page_last": 598, "abstract": null, "full_text": "Competition Among Networks \nImproves Committee Performance \n\nPaul W. Munro \n\nBambang Parman to \n\nDepartment of Infonnation Science \n\nDepartment of Health Infonnation \n\nand Telecommunications \nUniversity of Pittsburgh \nPittsburgh PA 15260 \n\nmunro@sis.pitt.edu \n\nManagement \n\nUniversity of Pittsburgh \nPittsburgh PA 15260 \n\nparmanto+@pitt.edu \n\nABSTRACT \n\nThe separation of generalization error into two types, bias and variance \n(Geman, Bienenstock, Doursat, 1992), leads to the notion of error \nreduction by averaging over a \"committee\" of classifiers (Perrone, \n1993). Committee perfonnance decreases with both the average error of \nthe constituent classifiers and increases with the degree to which the \nmisclassifications are correlated across the committee. Here, a method \nfor reducing correlations is introduced, that uses a winner-take-all \nprocedure similar to competitive learning to drive the individual \nnetworks to different minima in weight space with respect to the \ntraining set, such that correlations in generalization perfonnance will be \nreduced, thereby reducing committee error. \n\n1 INTRODUCTION \n\nThe problem of constructing a predictor can generally be viewed as finding the right \ncombination of bias and variance (Geman, Bienenstock, Doursat, 1992) to reduce the \nexpected error. Since a neural network predictor inherently has an excessive number of \nparameters, reducing the prediction error is usually done by reducing variance. Methods \nfor reducing neural network complexity can be viewed as a regularization technique to \nreduce this variance. Examples of such methods are Optimal Brain Damage (Le Cun et. \nal., 1991), weight decay (Chauvin, 1989), and early stopping (Morgan & Boulard, 1990). \n\nThe idea of combining several predictors to fonn a single, better predictor (Bates & \nGranger, 1969) has been applied using neural networks in recent years (Wolpert, 1992; \nPerrone, 1993; Hashem, 1994). \n\n\fCompetition Among Networks Improves Committee Performance \n\n593 \n\n2 REDUCING MISCLASSIFICATION CORRELATION \n\nSince committee errors occur when too many individual predictors are in error, committee \nperformance improves as the correlation of network misclassifications decreases. Error \ncorrelations can be handled by using a weighted sum to generate a committee prediction; \nthe weights can be estimated by using ordinary least squares (OLS) estimators (Hashem, \n1994) or by using Lagrange multipliers (Perrone, 1993). \n\nAnother approach (Parmanto et al., 1994) is to reduce error correlation directly by \nattempting to drive the networks to different minima in weight space, that will \npresumably have different generalization syndromes, or patterns of error with respect to a \ntest set (or better yet, the entire stimulus space). \n\n2.1 Data Manipulations \n\nTraining the networks using nonidentical data has been shown to improve committee \nperformance, both when the data sets are from mutually exclusive continuous regions (eg, \nJacobs et al.,1991), or when the training subsets are arbitrarily chosen (Breiman, 1992; \nParmanto, Munro, and Doyle, 1995). Networks tend to converge to different weight \nstates, because the error surface itself depends on the training set; hence changing the data \nchanges the error surface. \n\n2.2 Auxiliary tasks \n\nAnother way to influence the networks to disagree is to introduce a second output unit \nwith a different task to each network in the committee. Thus, each network has two \noutputs, a primary unit which is trained to predict the class of the input, and a secondary \nunit, with some other task that is different than the tasks assigned to the secondary units \nof the other committee members. The success of this approach rests on the assumption \nthat the decorrelation of the network errors will more than compensate for any degradation \nof performance induced on the primary task by the auxiliary task. The presence of a \nhidden layer in each network guarantees that the two output response functions share \nsome weight parameters (i.e., the input-hidden weights), and so the learning of the \nsecondary task influences the function learned by the primary output unit. \n\nParmanto et al. (1994) acheived significant decorrelation and improved performance on a \nvaroety of tasks using one of the input variables as the training signal for the secondary \nunit. Interestingly, the secondary task does not necessarily degrade performance on the \nprimary task. Our studies, as well as those of Caruana (1995), show that extra tasks can \nfacilitate learning time and generalization performance on an individual network. On the \nother hand, certain auxiliary tasks interfere with the primary task. We have found \nhowever, that even when the individual performance is degraded, committee performance \nis nevertheless enahnced (relative to a committee of single output networks) due to the \nmagnitude of error decorrelation. \n\n3 THE COMPETITIVE COMMITTEE \n\nAn alternative to using a stationary task per se, such as replicating an input variable or \nprojecting onto principal components (as was done in Parmanto et ai, 1994), is to use a \nsignal that depends on the other networks, in such a manner that the functions computed \nby the secondary units are negatively correlated after training. This notion is reminiscent \nof competitive learning (Rumelhart and Zipser, 1986); that is, the functions computed by \nthe secondary units will partition the stimulus space. \n\nThus, a Competitive Committee Machine (CCM) is defined as a committee of neural \n\n\f594 \n\nP. W Munro and B. Parmanto \n\nnetwork classifiers, each with two output units: a primary unit trained according to the \nclassification task, and a secondary unit participating in a competitive process with \nsecondary units of the other networks in the committee; let the outputs of network i be \ndenoted Pi and Si, respectively (see Figure 1). The network weights are modified \naccording to the following variant of the back propagation procedure. \n\nWhen data item a from the training set is presented to the committee during training, \nwith input vector XCl and known output classification value yCl (binary), the networks \neach process XCl simultaneously, and the P and S output units of each network respond. \nEach P-unit receives the identical training signal, yCl, that corresponds to the input item; \nthe training signal to the S-units is zero for all networks except the network with the \ngreatest S-unit response among the committee; the maximum Si among the networks in \nthe committee receives a training signal of 1, and the others receive a training signal of O. \n\nwhere or mof are the errors attributed to the primary and secondary units respectively \nto adjust network weights with back propagationl . During the course of training, the S(cid:173)\nunit's response is explicitly trained to become sensitive to a unique region (relative to the \nother networks' S-units) of the stimulus space. This training signal is different from \ntypical \"tasks\" that are used to train neural networks in that it is not a static function of \nthe input; instead, since it depends on the other networks in the committee, it has a \ndynamic qUality. \n\n4 RESULTS \n\nSome experiments have been run using the sine wave classification task (Figure 2) of \nGeman and Bienenstock (1992). \n\nComparisons of CCM perfonnance versus the baseline perfonnance of a committee with \na simple average over a range of architectures (as indicated by the number of hidden units) \nare favorable (Figure 3). Also, note that the improvement is primarily attributable to \ndescreased correlation, since the average individual perfonnance is not significantly \naffected. \n\nVisualization of the response of the individual networks to the entire stimulus space \ngives a complete picture of how the networks generalize and shows the effect of the \ncompetition (Figure 4). For this particular data set, the classes are easily separated in the \ncentral region (note that all the networks do well here). But at the edges, there is much \nmore variance in the networks trained with competitive secondary units (Figure 5). \n\n5 DISCUSSION \n\nCaruana (1995) has demontrated significant improvement on \"target\" classification tasks \nin individual networks by adding one or more supplementary output units trained to \ncompute tasks related to the target task. The additional output unit added to each network \n\nIFor notational convenience, the derivative factor sometimes included in the definition of \n8 is not included in this description of oP and oS. \n\n\fCompetition Among Networks Improves Committee Performance \n\n595 \n\nCOMMITIEE OUTPUT \n(AVERAGE OR VOTE) \n\n(Training signal yfl \nc_ \n\n\u2022 \n\n. \n\ns \n--\n\n---. \n\n-\n\n\u2022 \u2022\u2022 \n\nInput Variables (simultaneously presented to all networks) \n\nFigure 1: A Competitive Committee Machine. Each of the K networks receives the \nsrune input and produces two outputs, P and S. The P responses of all the networks are \ncompared to a common training signal to compute an error value for backpropagation \n(dark dashed arrows); the P responses are combined (by vote or by sum) to determine the \ncommittee response. The S-unit responses are compared with each other, with the \n\"winner\" (highest response) receiving a training signal of 1, and the others receiving a \ntraining signal of O. Thus the training signal for network i is computed by comparing all \nS-unit responses, and then fed back to the S-units, hence the two-way arrows (gray). \n\nin the CCM merges a variant of Rumelhart and Zipser's (1986) competitive learning \nprocedure with backpropagation, to form a novel hybrid of a supervised training technique \nwith an unsupervised method. The training signal delivered to the secondary unit under \nCCM is more direct than an arbitrary task, in that it is defined explicitly in terms of \ndissociating response properties. \n\nNote that the training signals for the S-units differ from the P-unit training signals in \ntwo important respects: \n1. Not static: The signal depends on the S-unit responses/rom the other networks ,md \n\nhence chcmges during the course of training. \n\n2. Not uniform: It is not constant across the committee (whereas the P-unit training \n\nsignal is.) \n\n\f596 \n\nP W Munro and B. Parmanto \n\nFi!.!ure 2. A (\"/assif/ca[ion task . Training data (bottom) is srunpled from a classitication \nta;k defined hy a s'inusoid (top) conupted hy noise (middle). \n\nCommittee Performance \n\nIndiv. Performance \n\n:e \n:! \n\n...... \n\n..... \n\n......... \n\n........................... ...... -..................\u2022 \n\n14 \n\n16 \n\n10 \n\n8 \n, hidden unllS \n\n12 \n\nCorrelation \n\no~-----------\n\n.. \n\n. .... \n\n\u2022.................. \" ..... \n\n..... -......... \" ................. . \n\nc: 0 \no \n~ <.Q \na; 0 \n~ ~ \n0>0 \n~ N \no \no oL..--_________ --' \n\n. ...... _-. __ ._-.. \n\no -\n\" \n\nba&.tlne \n\n...... cem \n\n10 \n\n8 \n, hidden unns \n\n12 \n\n14 \n\n16 \n\n~r-----~~~~---; \n\nPercent Improvement \n..... \n.. ........... . \n\n..\u2022............ _.-............ .. \n\n..\u2022. \n\n/,-\n\no -\n\u2022 \n\nbas.-w \n\n,._ .. cem \n\n4 \n\n10 \n\n12 \n, hidden urI/IS \n\n14 \n\n16 \n\n4 \n\n6 \n\n10 \n\n8 \n12 \n, hidden unns \n\n14 \n\n16 \n\nFigure 3. Peliormallce of CCM. Committees of 5 networks were trained with comp(cid:173)\netitive learning (CCM) and without (baseline). Each data point is an average over 5 \nsimulations with different initial weights. \n\n\fCompetition Among Networks Improves Committee Performance \n\n597 \n\nNetwork.l (Error: 10.20%) \n\nr-==== \n\nNetwork '2 (Err\",: 15.59\"1.) \n\nNetwork .3 (Error: 15.250/.) \n\nNelwork ,4 (Error: 15.54\"/.) \n\nNetwork #5 (Error: 12.65%) \n\nCanmillee OUtput \u2022 Thtesholded (Error: 11.64) \n\nFigure 4. Generalization plots for a committee. The level of gray indicates the response \nfor each network of a committee trained without competition. The panel on the lower \nright shows the (thresholded) committee output. The average pairwise correlation of the \ncommittee is 0.91. \n\nNetwork #1 (Error: 10.21%) \n\nNetwork,2 (Error: 9.83%) \n\nNetwork #3 (Error . 1683%) \n\nNelwork'4 (Error: 14.88%) \n\nFigure 5. Gelleralization plots for a CCM committee. Comparison with Figure 4 shows \nmuch more variance among the committee at the edges. Note that the committee \nperforms much better near the right and left ends of the stimulus space than does any \nindividual network. This committee had an error rate of 8.11 % (cf 11.64% in the \nbaseline case). \n\n\f598 \n\nP. W. Munro and B. Pannanto \n\nThe weighting of oS relative to oP is an important consideration; in the simulations \nabove, the signal from the secondary unit was arbitrarily multiplied by a factor of 0.1. \nWhile we have not yet examined this systematically, it is assumed that this factor will \nmodulate the tradeoff between degradation of the primary task and reduction of error \ncorrelation. \n\nReferences \nBates, J.M., and Granger, C.W. (1969) \"The combination of forecasts,\" Operation \nResearch Quarterly, 20(4),451-468. \n\nBreiman, L, (1992) \"Stacked Regressions\", TR 367, Dept. of Statistics, Univ. of Cal. \nBerkeley. \n\nCaruana, R (1995) \"Learning many related tasks at the same time with backpropagation,\" \nIn: Advances in Neural Information Processing Systems 7. D. S. Touretsky, ed. Morgan \nKaufmann. \n\nChauvin, Y. (1989) \"A backpropagation algorithm with optimal use of hidden units.\" In \nTouretzky D., (ed.), Advances in Neural Information Processing 1, Denver, 1988, \nMorgan Kaufmann. \n\nGeman, S., Bienenstock, E., and Doursat, R. (1992) \"Neural networks and the \nbias/variance dilemma,\" Neural Computation 4, 1-58. \n\nHashem, S. (1994). Optimal Linear Combinations of Neural Networks., PhD Thesis, \nPurdue University. \n\nJacobs, R.A., Jordan, M.I., Nowlan, SJ., and Hinton, G.E. (1991) \"Adaptive mixtures \nof local experts,\" Neural Computation, 3, 79-87 \n\nLe Cun, Y., Denker J. and Solla, S. (1990). Optimal Brain Damage. In D. Touretzky \n(Ed.) Advances in Neural Information Processing Systems 2, San Mateo: Morgan \nKaufmann. 598-605. \n\nMorgan, N. & Boulard, H. (1990) Generalization and parameter estimation in feedforward \nnets: some experiments. In D. Touretzky (Ed.) Advances in Neural Information \nProcessing Systems 2 San Mateo: Morgan Kaufmann. \n\nParmanto, B., Munro, P.W., Doyle, H.R., Doria, C., Aldrighetti, L., Marino, I.R., \nMitchel, S., and Fung, JJ. (1994) \"Neural network classifier for hepatoma detection,\" \nProceedings of the World Congress of Neural Networks \n\nParmanto, B., Munro, P.W., Doyle, H.R. (1996) \"Improving committee diagnosis with \nresampling techniques,\" In: D. S. Touretzky, M. C. Mozer, M. E. Hasselmo, eds. \nAdvances in Neural Information Processing Systems 8. MIT Press: Cambridge, MA. \n\nPerrone, M.P. (1993) \"Improving Regression Estimation: Averaging Methods for \nVariance Reduction with Extension to General Convex Measure Optimization,\" PhD \nThesis, Department of Physics, Brown University. \n\nRumelhart. D.E and Zipser, D. (1986) \"Feature discovery by competitive learning,\" In: \nRumelhart, D.E.and McClelland, J.E. (Eds.), Parallel Distributed Processing: \nExplorations in the Microstructure of Cognition. MIT Press, Cambridge, MA. \n\nWolpert, D. (1992). Stacked generalization, Neural Networks, 5,241-259. \n\n\f", "award": [], "sourceid": 1196, "authors": [{"given_name": "Paul", "family_name": "Munro", "institution": null}, {"given_name": "Bambang", "family_name": "Parmanto", "institution": null}]}