{"title": "Selecting Weighting Factors in Logarithmic Opinion Pools", "book": "Advances in Neural Information Processing Systems", "page_first": 266, "page_last": 272, "abstract": null, "full_text": "Selecting weighting factors in logarithmic \n\nopinion pools \n\nTom Heskes \n\nFoundation for Neural Networks, University of Nijmegen \n\nGeert Grooteplein 21, 6525 EZ Nijmegen, The Netherlands \n\ntom@mbfys.kun.nl \n\nAbstract \n\nA simple linear averaging of the outputs of several networks as \ne.g. in bagging [3], seems to follow naturally from a bias/variance \ndecomposition of the sum-squared error. The sum-squared error of \nthe average model is a quadratic function of the weighting factors \nassigned to the networks in the ensemble [7], suggesting a quadratic \nprogramming algorithm for finding the \"optimal\" weighting factors. \nIf we interpret the output of a network as a probability statement, \nthe sum-squared error corresponds to minus the loglikelihood or \nthe Kullback-Leibler divergence, and linear averaging of the out(cid:173)\nputs to logarithmic averaging of the probability statements: the \nlogarithmic opinion pool. \nThe crux of this paper is that this whole story about model aver(cid:173)\naging, bias/variance decompositions, and quadratic programming \nto find the optimal weighting factors, is not specific for the sum(cid:173)\nsquared error, but applies to the combination of probability state(cid:173)\nments of any kind in a logarithmic opinion pool, as long as the \nKullback-Leibler divergence plays the role of the error measure. As \nexamples we treat model averaging for classification models under \na cross-entropy error measure and models for estimating variances. \n\n1 \n\nINTRODUCTION \n\nIn many simulation studies it has been shown that combining the outputs of several \ntrained neural networks yields better results than relying on a single model. For \nregression problems, the most obvious combination seems to be a simple linear \n\n\fSelecting Weighting Factors in Logarithmic Opinion Pools \n\n267 \n\naveraging of the network outputs. From a bias/variance decomposition of the sum(cid:173)\nsquared error it follows that the error of the so obtained average model is always \nsmaller or equal than the average error of the individual models. In [7] simple linear \naveraging is generalized to weighted linear averaging, with different weighting factors \nfor the different networks in the ensemble . A slightly more involved bias/variance \ndecomposition suggests a rather straightforward procedure for finding \"optimal\" \nweighting factors. \n\nMinimizing the sum-squared error is equivalent to maximizing the loglikelihood of \nthe training data under the assumption that a network output can be interpreted \nas an estimate of the mean of a Gaussian distribution with fixed variance. \nIn \nthese probabilistic terms, a linear averaging of network outputs corresponds to a \nlogarithmic rather than linear averaging of probability statements. \n\nIn this paper, we generalize the regression case to the combination of probability \nstatements of any kind. Using the Kullback-Leibler divergence as the error mea(cid:173)\nsure, we naturally arrive at the so-called logarithmic opinion pool. A bias/variance \ndecomposition similar to the one for sum-squared error then leads to an objective \nmethod for selecting weighting factors. \n\nSelecting weighting factors in any combination of probability statements is known \nto be a difficult problem for which several suggestions have been made. These \nsuggestions range from rather involved supra-Bayesian methods to simple heuristics \n(see e.g. [1, 6] and references therein). The method that follows from our analysis \nis probably somewhere in the middle: easier to compute than the supra-Bayesian \nmethods and more elegant than simple heuristics. \n\nTo stress the generality of our results, the presentation in the next section will be \nrather formal. Some examples will be given in Section 3. Section 4 discusses how \nthe theory can be transformed into a practical procedure. \n\n2 LOGARITHMIC OPINION POOLS \n\nLet us consider the general problem of building a probability model of a variable y \ngiven a particular input x. The \"output\" y may be continuous, as for example in \nregression analysis, or discrete, as for example in classification. In the latter case \nintegrals over y should be replaced by summations over all possible values of y. \nBoth x and y may be vectors of several elements; the one-dimensional notation is \nchosen for convenience. We suppose that there is a \"true\" conditional probability \nmodel q(ylx) and have a whole ensemble (also called pool or committee) of experts, \neach supplying a probability model Pa(Ylx). p{x) is the unconditional probability \ndistribution of inputs. An unsupervised scenario, as for example treated in [8], is \nobtained if we simply neglect the inputs x or consider them constant. \nWe define the distance between the true probability q(ylx) and an estimate p(ylx) \nto be the Kullback-Leibler divergence \n\n, \n\nJ\\. (q, p) == -\n\nJ J \n\ndx p(x) \n\n[p(Y1x)] \ndy q(ylx)log q(ylx) \n\nIf the densities p(x) and q(ylx) correspond to a data set containing a finite number \np of combinations {xlJ , ylJ}, minus the Kullback divergence is, up to an irrelevant \n\n\f268 \n\nT. Heskes \n\nconstant, equivalent to the loglikelihood defined as \n\nL(p, {i, Y}) == ~ L logp(y~ Ix~) . \n\n~ \n\nThe more formal use of the Kullback-Leibler divergence instead of the loglikelihood \nis convenient in the derivations that follow. \n\nWeighting factors Wa are introduced to indicate the reliability of each ofthe experts \nfr. In the following we will work with the constraints La Wa = 1, which is used in \nsome of the proofs, and Wa ? 0 for all experts fr, which is not strictly necessary, but \nmakes it easier to interpret the weighting factors and helps to prevent overfitting \nwhen weighting factors are optimized (see details below). \n\nWe define the average model j)(y/x) to be the one that is closest to the given set of \nmodels: \n\nj)(y/x) == argmin L waJ{(P,Pa) . \n\np(ylx) \n\na \n\nIntroducing a Lagrange mUltiplier for the constraint J dxp(y/x) = 1, we immediately \nfind the solution \n\nwith normalization constant \n\nZ(x) = J dy II[Pa(Y/X)]W Q \n\na \n\n(1) \n\n(2) \n\n\u2022 \n\nThis is the logarithmic opinion pool, to be contrasted with the linear opinion pool, \nwhich is a linear average'ofthe probabilities. In fact, logarithmic opinion pools have \nbeen proposed to overcome some of the weaknesses of the linear opinion pool. For \nexample, the logarithmic opinion pool is \"externally Bayesian\" , i.e., can be derived \nfrom joint probabilities using Bayes' rule [2]. A drawback of the logarithmic opinion \npool is that if any of the experts assigns probability zero to a particular outcome, \nthe complete pool assigns probability zero, no matter what the other experts claim. \nThis property of the logarithmic opinion pool, however, is only a drawback if the \nindividual density functions are not carefully estimated. The main problem for both \nlinear and logarithmic opinion pools is how to choose the weighting factors Wa. \n\nThe Kullback-Leibler divergence of the opinion pool p(y/x) can be decomposed into \na term containing the Kullback-Leibler divergences of individual models and an \n\"ambiguity\" term: \n\n(3) \n\nProof: The first term in (3) follows immediately from the numerator in (1), the \nsecond term is minus the logarithm of the normalization constant Z (x) in (2) which \ncan, using (1), be rewritten as \n\n\fSelecting Weighting Factors in Logarithmic Opinion Pools \n\n269 \n\nfor any choice of y' for which p(y'lx) is nonzero. Integration over y' with probability \nmeasure p(y'lx) then yields (3) . \n\nSince the ambiguity A is always larger than or equal to zero, we conclude that the \nKullback-Leibler divergence of the logarithmic opinion pool is never larger than the \naverage Kullback-Leibler divergences of individual experts. The larger the ambigu(cid:173)\nity, the larger the benefit of combining the experts' probability assessments. Note \nthat by using Jensen's inequality, it is also possible to show that the Kullback-Leibler \ndivergence of the linear opinion pool is smaller or equal to the average Kullback(cid:173)\nLeibler divergences of individual experts. The expression for the ambiguity, defined \nas the difference between these two, is much more involved and more difficult to \ninterpret (see e.g. [10]). \n\nThe ambiguity of the logarithmic opinion pool depends on the weighting factors \nWa, not only directly as expressed in (3), but also through p(ylx). We can make \nthis dependency somewhat more explicit by writing \n\nA = ~ ~ waw/3K(Pa ,P/3) + ~ ~ Wa [K(P,Pa) - K(Pa,p)] . \n\n(4) \n\na/3 \n\na \n\nProof: Equation (3) is valid for any choice of q(ylx). Substitute q(ylx) = p/3(ylx), \nmultiply left- and righthand side by w/3, and sum over {3. Simple manipulation of \nterms than yields the result. \n\nAlas, the Kullback-Leibler divergence is not necessarily symmetric, i.e., in general \nK (PI, P2) # K (P2, pd . However, the difference K (PI, P2) - K (p2, pd is an order \nof magnitude smaller than the divergence K(PI,P2) itself. More formally, writing \nPI (ylx) = [1 +\u20ac(ylx )]p2(ylx) with \u20ac(ylx) small, we can easily show that K (PI, P2) is of \norder (some integral over) \u20ac2(ylx) whereas K(PI,P2) - K(p2,pd is of order \u20ac3(ylx). \nTherefore, if we have reason to assume that the different models are reasonably \nclose together, we can, in a first approximation, and will, to make things tractable, \nneglect the second term in (4) to arrive at \n\nK(q,p) ~ ~ waK(q,Pa) - ~ L wa w/3 [K(Pa,P/3) + K(P/3,Pa)] . \n\n(5) \n\na \n\na ,/3 \n\nThe righthand side of this expression is quadratic in the weighting factors W a , a \nproperty which will be very convenient later on. \n\n3 EXAMPLES \n\nRegression. The usual assumption in regression analysis is that the output func(cid:173)\ntionally depends on the input x, but is blurred by Gaussian noise with standard \ndeviation (j. In other words, the probability model of an expert a can be written \n\nPa(ylx) = V ~ exp \n\nj1 [-(y - fa(x))2] \n\n2(j2 \n\n. \n\n(6) \n\nThe function fa(x) corresponds to the network's estimate of the \"true\" regression \ngiven input x. The logarithmic opinion pool (1) also leads to a normal distribution \nwith the same standard deviation (j and with regression estimate \n\n\f270 \n\nT. Heskes \n\nIn this case the Kullback-Leibler divergence \n\nis symmetric, which makes (5) exact instead of an approximation. In [7], this has \nall been derived starting from a sum-squared error measure. \n\nVariance estimation. There has been some recent interest in using neural net(cid:173)\nworks not only to estimate the mean of the target distribution, but also its variance \n(see e.g. [9] and references therein). In fact, one can use the probability density (6) \nwith input-dependent <7( x). We will consider the simpler situation in which an \ninput-dependent model is fitted to residuals y, after a regression model has been \nfitted to estimate the mean (see also [5]). The probability model of expert 0' can \nbe written \n\n( j ) _ ~ [za(x)y2] \n\n- V ~ exp -\n\n2 \n\nPa Y X \n\n' \n\nwhere l/za(x) is the experts' estimate of the residual variance given input x. The \nlogarithmic opinion pool is of the same form with za(x) replaced by \n\nz(x) = L waza(x) . \n\nHere the Kullback-Leibler divergence \n\n1 J \n\n[ z(x) \n\nz(x)] \n\" -\nI\\ (P,Pa) = 2 dx p(x) za(x) -log za(x) - 1 \n\nis asymmetric. We can use (3) to write the Kullback-Leibler divergence of the \nopinion pool explicitly in terms of the weighting factors Wa. The approximation (5), \nwith \n\n\u2022 \nIi(Pa,PP) + Ii(Pp,Pa) -\"2 \n\n\u2022 \n\ndx p(x) \n\n[za(x) - zp(x)]2 \n\nza(x)zp(x) \n\n, \n\n- 1 J \n\nis much more appealing and easier to handle. \n\nClassification. In a two-class classification problem, we can treat y as a discrete \nvariable having two possible realizations, e.g., y E {-I, I}. A convenient represen(cid:173)\ntation for a properly normalized probability distribution is \n\nPa(yjX) = 1 + exp[-2ha(x)y] . \n\n1 \n\nIn this logistic representation, the logarithmic opinion pool has the same form with \n\nThe Kullback-Leibler divergence is asymmetric, but yields the simpler form \n\nto be used in the approximation (5). For a finite set of patterns, minus the loglike(cid:173)\nlihood yields the well-known cross-entropy error. \n\n\fSelecting Weighting Factors in Logarithmic Opinion Pools \n\n271 \n\nThe probability models in these three examples are part of the exponential family. \nThe mean f 0:, inverse variance zo:, and logit ho: are the canonical parameters. It is \nstraightforward to show that, with constant dispersion across the various experts, \nthe canonical parameter of the logarithmic opinion pool is always a weighted average \nof the canonical parameters of the individual experts. Slightly more complicated \nexpressions arise when the experts are allowed to have different estimates for the \ndispersion or for probability models that do not belong to the exponential family. \n\n4 SELECTING WEIGHTING FACTORS \n\nThe decomposition (3) and approximation (5) suggest an objective method for \nselecting weighting factors in logarithmic opinion pools. We will sketch this method \nfor an ensemble of models belonging to the same class, say feedforward neural \nnetworks with a fixed number of hidden units, where each model is optimized on a \ndifferent bootstrap replicate of the available data set. \n\nSu ppose that we have available a data set consisting of P combinations {x/J, y/J }. \nAs suggested in [3], we construct different models by training them on different \nbootstrap replicates of the available data set. Optimizing nonlinear models is often \nan unstable process: small differences in initial parameter settings or two almost \nequivalent bootstrap replicates can result in completely different models. Neural \nnetworks, for example, are notorious for local minima and plateaus in weight space \nwhere models might get stuck. Therefore, the incorporation of weighting factors, \neven when models are constructed using the same pro\u00b7cedure, can yield a better gen(cid:173)\neralizing opinion pool. In [4] good results have been reported on several regression \nproblems. Balancing clearly outperformed bagging, which corresponds to Wo: = lin \nwith n the number of experts, and bumping, which proposes to keep a single expert. \n\nEach example in the available data set can be viewed as a realization of an unknown \nprobability density characterized by p(x) and q(ylx). We would like to choose the \nweighting factors Wo: such as to minimize the Kullback-Leibler divergence K(q, p) of \nthe opinion pool. If we accept the approximation (5), we can compute the optimal \nweighting factors once we know the individual Kullbacks K(q,po:) and the Kullbacks \nbetween different models K(po:, PI3). Of course, both q(ylx) and p(x) are unknown, \nand thus we have to settle for estimates. \n\nIn an estimate for K(po:,PI3) we can simply replace the average over p(x) by an \naverage over all inputs x/J observed in the data set: \n\nA similar straightforward replacement for q(ylx) in an estimate for K(q, Po:) is bi(cid:173)\nased, since each expert has, at least to some extent, been overfitted on the data set. \nIn [4] we suggest how to remove this bias for regression models minimizing sum(cid:173)\nsquared errors. Similar compensations can be found for other probability models. \n\nHaving estimates for both the individual Kullback-Leibler divergences K(q,po:) and \nthe cross terms K (Po:, PI3), we can optimize for the weighting factors Wo:. Under the \nconstraints 2:0: Wo: = 1 and Wo: 2: 0 the approximation (5) leads to a quadratic pro(cid:173)\ngramming problem. Without this approximation, optimizing the weighting factors \nbecomes a nasty exercise in nonlinear programming. \n\n\f272 \n\nT. Heskes \n\nThe solution of the quadratic programming problem usually ends up at the edge of \nthe unit cube with many weighting factors equal to zero. On the one hand, this is \na beneficial property, since it implies that we only have to keep a relatively small \nnumber of models for later processing. On the other hand, the obtained weight(cid:173)\ning factors may depend too strongly on our estimates of the individual Kullbacks \nK (q, Pet). The following version prohibits this type of overfitting. Using simple \nstatistics, we obtain a rough indication for the accuracy of our estimates K(q,pet). \nThis we use to generate several, say on the order of 20, different samples with \nestimates {K (q, pI), .. . , K (q, Pn)}. For each of these samples we solve the corre(cid:173)\nsponding quadratic programming problem and obtain a set of weighting factors. \nThe final weighting factors are obtained by averaging. In the end, there are less \nexperts with zero weighting factors, at the advantage of a more robust procedure. \n\nAcknowledgements \n\nI would like to thank David Tax, Bert Kappen, Pierre van de Laar, Wim Wiegerinck, \nand the anonymous referees for helpful suggestions. This research was supported \nby the Technology Foundation STW, applied science division of NWO and the \ntechnology programme of the Ministry of Economic Affairs. \n\nReferences \n\n[1] J. Benediktsson and P. Swain. Consensus theoretic classification methods. \n\nIEEE Transactions on Systems, Man, and Cybernetics, 22:688-704, 1992. \n\n[2] R. Bordley. A multiplicative formula for aggregating probability assessments. \n\nManagement Science, 28:1137-1148,1982. \n\n[3] L. Breiman. Bagging predictors. Machine Learning, 24:123-140, 1996. \n\n[4] T. Heskes. Balancing between bagging and bumping. In M. Mozer, M. Jordan, \nand T. Petsche, editors, Advances in Neural Information Processing Eystems \n9, pages 466-472, Cambridge, 1997. MIT Press. \n\n[5] T. Heskes. Practical confidence and prediction intervals. In M. Mozer, M. Jor(cid:173)\ndan, and T. Petsche, editors, Advances in Neural Information Processing Eys(cid:173)\ntems 9, pages 176-182, Cambridge, 1997. MIT Press. \n\n[6] R. Jacobs. Methods for combining experts' probability assessments. Neural \n\nComputation, 7:867-888, 1995. \n\n[7] A. Krogh and J. Vedelsby. Neural network ensembles, cross validation, and \nactive learning. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances \nin Neural Information Processing Eystems 7, pages 231-238, Cambridge, 1995. \nMIT Press. \n\n[8] P. Smyth and D. Wolpert. Stacked density estimation. These proceedings, 1998. \n\n[9] P. Williams. Using neural networks to model conditional multivariate densities. \n\nNeural Computation, 8:843-854, 1996. \n\n[10] D. Wolpert. On bias plus variance. Neural Computation, 9:1211-1243, 1997. \n\n\f", "award": [], "sourceid": 1413, "authors": [{"given_name": "Tom", "family_name": "Heskes", "institution": null}]}