{"title": "Combining Estimators Using Non-Constant Weighting Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 419, "page_last": 426, "abstract": null, "full_text": "Combining Estimators Using \n\nNon-Constant Weighting Functions \n\nVolker Tresp*and Michiaki Taniguchi \n\nSiemens AG, Central Research \n\nOtto-Hahn-Ring 6 \n\n81730 Miinchen, Germany \n\nAbstract \n\nThis paper discusses the linearly weighted combination of estima(cid:173)\ntors in which the weighting functions are dependent on the input . \nWe show that the weighting functions can be derived either by \nevaluating the input dependent variance of each estimator or by \nestimating how likely it is that a given estimator has seen data in \nthe region of the input space close to the input pattern. The lat(cid:173)\nter solution is closely related to the mixture of experts approach \nand we show how learning rules for the mixture of experts can be \nderived from the theory about learning with missing features. The \npresented approaches are modular since the weighting functions \ncan easily be modified (no retraining) if more estimators are ad(cid:173)\nded. Furthermore, it is easy to incorporate estimators which were \nnot derived from data such as expert systems or algorithms. \n\n1 \n\nIntroduction \n\nInstead of modeling the global dependency between input x E ~D and output y E ~ \nusing a single estimator, it is often very useful to decompose a complex mapping \n\n-'\\.t the time of the research for this paper, a visiting researcher at the Center for \n\nBiological and Computational Learning, MIT. Volker.Tresp@zfe.siemens.de \n\n\f420 \n\nVolker Tresp, Michiaki Taniguchi \n\ninto simpler mappings in the form l \n\nM \n\nn(x) = L hi(X) \n\ni=l \n\n(1) \n\nThe weighting functions hi(X) act as soft switches for the modules N Ni(X). In the \nmixture of experts (Jacobs et al., 1991) the decomposition is learned in an unsuper(cid:173)\nvised manner driven by the training data and the main goal is a system which learns \nquickly. In other cases, the individual modules are trained individually and then \ncombined using Equation 1. We can distinguish two motivations: first, in the work \non averaging estimators (Perrone, 1993, Meir, 1994, Breiman, 1992) the modules \nare trained using identical data and the weighting functions are constant and, in \nthe simplest case, all equal to one. The goal is to achieve improved estimates by \naveraging the errors of the individual modules. Second, a decomposition as descri(cid:173)\nbed in Equation 1 might represent some \"natural\" decomposition of the problem \nleading to more efficient representation and training (Hampshire and Waibel, 1989). \nA good example is a decomposition into analysis and action. hi(x) might be the \nprobability of disease i given the symptoms x, the latter consisting of a few dozen \nvariables. The amount of medication the patient should take given disease i on the \nrepresented by the output of module N Ni (x) - might only depend \nother hand -\non a few inputs such as weight, gender and age.2 Similarly, we might consider hie x) \nas the IF-part of the rule, evaluating the weight of the rule given x, and as N Ni(X) \nthe conclusion or action which should be taken under rule i (compare Tresp, Hol(cid:173)\nlatz and Ahmad, 1993). Equation 1 might also be the basis for biological models \nconsidering for example the role of neural modulators in the brain. Nowlan and \nSejnowsky (1994) recently presented a biologically motivated filter selection model \nfor visual motion in which modules provide estimates of the direction and amount \nof motion and weighting functions select the most reliable module. \n\nIn this paper we describe novel ways of designing the weighting functions. Intui(cid:173)\ntively, the weighting functions should represent the competence or the certainty of \na module, given the available information x. One possible measure is related to the \nnumber of training data that a module has seen in the neighborhood of x. There(cid:173)\nfore, P(xli), which is an estimate of the distribution of the input data which were \nused to train module i is an obvious candidate as weighting function. Alternatively, \nthe certainty a module assigns to its own prediction, represented by the inverse of \nthe variance 1/ var( N Ni ( x\u00bb is a plausible candidate for a weighting function. Both \napproaches seem to be the flip-sides of the same coin, and indeed, we can show that \nboth approaches are extremes of a unified approach. \n\nIThe hat stands for an estimates value. \n2Note, that we include the case that the weighting functions and the modules might \n\nexplicitly only depend on different subsets of x. \n\n\fCombining Estimators Using Non-Constant Weighting Functions \n\n421 \n\n1 \n\n0.8 \n\n0.6 \n\n0 .4 \n\n0.2 \n\n(e) \n\nI \n\n/ / \n\nI, \n\\ \nI \n\\ \n\\ \n\\ \nI \n\\ \n\n0 .5 r - r - - - - -T \" I \n\n(b) \n\n0.5.----------\" \n\n(d) \n\n(f) \n\nI\n' \nI,:' \nI: \nt ' \n\nl , \n\nj \n:1 \n~ \n\no \n\n-1 \n\n-1 \n\n-1 \n\n-1 \n\no \n\no \n\n1 \n\n-1 \n\n-1 \n\no \n\n1 \n\nFigure 1: (a): Two data sets (1:*, 2:0) and the underlying function (continuous). \n(b) The approximations of the two neural networks trained on the data sets (conti(cid:173)\nnuous: 1, dashed: 2). Note, that the approximation of a network is only reliable in \nthe regions of the input space in which it has \"seen\" data. (c) The weighting func(cid:173)\ntions for variance-based weighting. (d) The approximation using variance-based \nweighting (continuous). The approximation is excellent, except to the very right. \n(e) The weighting functions for density-based weighting (Gaussian mixtures appro(cid:173)\nximation). (f) The approximation using density-based weighting (continuous). In \nparticular to the right, the extrapolation is better than in (d). \n\n2 Variance-based Weighting \n\nHere, we assume that the different modules N Ni(x) were trained with different \ndata sets {(xLY~)}:~l but that they model identical input-output relationships \n(see Figure 1 a,b ). To give a concrete example, this would correspond to the case \nthat we trained two handwritten digit classifiers using different data sets and we \nwant to use both for classifying new data. \n\nIf the errors of the individual modules are uncorrelated and unbiased,3 the combined \nestimator is also unbiased and has the smallest variance if we select the weighting \nfunctions inversely proportional the the variance of the modules \n\n1 \n\nhi(X) = var(NNi(x)) \n\n(2) \n\nThis can be shown using var(2:::'l gi(x)N Ni(X)) = 2:::'1 gl(x)var(N Ni(X)) and \nusing Lagrange mUltiplier to enforce the constraint that 2:i gi(X) = 1. Intuitively, \n3The errors are un correlated since the modules were trained with different data; corre(cid:173)\n\nlation and bias are discussed in Section 8.1. \n\n\f422 \n\nVolker Tresp, Michiaki Taniguchi \n\nEquation 2 says that a module which is uncertain about its own prediction should \nalso obtain a smaller weight. We estimate the variance from the training data as \n\n(NN.( \u00bb '\" 8N Ni(xl H-:- 18N Ni(X) \n* 8w' \n\n* x \n\nvar \n\n\"\" \n\n8w \n\nHi is the Hessian, which can be approximated as \u00ab(12 is the output-noise variance, \nTibshirani, 1994) \n\n3 Density-based Weighting \n\nIn particular if the different modules were trained with data sets from different \nregions of the input space, it might be a reasonable assumption that the different \nmodules represent different input-output relationships. In terms of our example, \nthis corresponds to the problem, that we have two handwritten digit classifiers, one \ntrained with American data and one with European data. If the classifiers are used \nin an international setting, confusions are possible, since, for example, an American \nseven might be confused with a European one. Formally, we introduce an additional \nvariable which is equal to zero if the writer is American and is equal to one if the \nwriter is European. During recall, we don't know the state of that variable and \nwe are formally faced with the problem of estimation with missing inputs. From \nprevious work (Ahmad and Tresp, 1993) we know that we have to integrate over \nthe unknown input weighted by the conditional probability of the unknown input \ngiven the known variables. In this case, this translates into Equation 1, where the \nweighting function is \n\nIn our example, P(ilx) would estimate the probability that the writer is American \nor European given the data. \n\nDepending on the problem P(ilx) might be estimated in different ways. If x repres(cid:173)\nents continuous variables, we use a mixture of Gaussians model \n\n(3) \n\nwhere G(x; cij, Eij) is our notation for a normal density centered at cij and with \ncovariance Eij. \nNote that we have obtained a mixture of experts network with P( ilx) as gating net(cid:173)\nwork. A novel feature of our approach is that we maintain an estimate of the input \ndata distribution (Equation 3), which is not modeled in the original mixture of ex(cid:173)\nperts network. This is advantageous if we have training data which are not assigned \n\n\fCombining Estimators Using Non-Constant Weighting Functions \n\n423 \n\nto a module (in the mixture of experts, no data are assigned) which corresponds \nto training with missing inputs (the missing input is the missing assignment), for \nwhich the solution is known (Tresp et ai., 1994). If we use Gaussian mixtures to \napproximate P(xli), we can use generalized EM learning rules for adaptation. The \nadaptation of the parameters in the \"gating network\" which models P(xli) is the(cid:173)\nrefore somewhat simpler than in the original mixture of experts learning rules (see \nSection 8.2). \n\n4 Unified Approach \n\nIn reality, the modules will often represent different mappings, but these mappings \nare not completely independent. Let's assume that we have an excellent American \nhandwritten digit classifier but our European handwritten digit classifier is still very \npoor, since we only had few training data. We might want to take into account the \nresults of the American classifier, even if we know that the writer was European. \nMathematically, we can introduce a coupling between the modules. Let's assume \nthat the prediction ofthe i-th module NNi(X) = li(x)+{i is a noisy version of the \ntrue underlying relationship li(x) and that {i is independent Gaussian noise with \nvariance var(N Ni(X\u00bb. Furthermore, we assume that the true underlying functions \nare coupled through a prior distribution (for simplicity we only assume two modules) \n\nP(h(x), h(x)) ex: exp(--2 -(I1(x) - 12(x ) ). \n\n) 2 \n\n1 \nvare \n\nWe obtain as best estimates \n\nA \n\nh(x) = K(x) [(var(N N2 (x)) + vare) N Nt(x) + var(N Nl(x)) N N2(X)] \n\n1 \n\nA \n\nh(x) = K(x) [var(N N2(x\u00bb N Nl(X) + (var(N Nt (x\u00bb + vare) N N2(X)] \n\n1 \n\nwhere \n\nK(x) = var(N Nl (x\u00bb + var(N N2(x)) + vare. \n\nWe use density-based weighting to combine the two estimates: y(x) = P(llx)it(x)+ \nP(2Ix)i2(X). Note, that if vare -- 00 (no coupling) we obtain the density-based \nsolution and for vare -- 0 (the mappings are forced to be identical) we obtain the \nvariance-based solution. A generalization to more complex couplings can be found \nin Section 8.2.1. \n\n5 Experiments \n\nWe tested our approaches using the Boston housing data set (13 inputs, one conti(cid:173)\nnuous output). The training data set consisted of 170 samples which were divided \ninto 20 groups using k-means clustering. The clusters were then divided randomly \ninto two groups and two multi-layer perceptrons (MLP) were trained using those two \n\n\f424 \n\nVolker Tresp. Michiaki Taniguchi \n\ndata sets. Table 1 shows that the performances of the individual networks are pretty \nbad which indicates that both networks have only acquired local knowledge with \nonly limited extrapolation capability. Variance-based weighting gives considerably \nbetter performance, although density-based weighting and the unified approach are \nboth slightly better. Considering the assumptions, variance-based weighting should \nbe superior since the underlying mappings are identical. One problem might be \nthat we assumed that the modules are unbiased which might not be true in regions \nwere a given module has seen no data. \n\nTable 1: Generalization errors \n\n0.6948 1.188 I 0.4821 \n\nN N2 I variance-based I density-based I unified I \nI 0.4235 I \n\nI 0.4472 \n\n6 Error-based Weighting \n\nIn most learning tasks only one data set is given and the task is to obtain optimal \npredictions. Perrone (1994) has shown that simply averaging the estimates of a \nsmall number (i. e. 10) of neural network estimators trained on the same trai(cid:173)\nning data set often gives better performance than the best estimator out of this \nensemble. Alternatively, bootstrap samples of the original data set can be used for \ntraining (Breimann, personal communication). Instead of averaging, we propose \nthat Equation 1, where \n\nmight give superior results (error-based weighting). Res(N Ni(X)) stands for an \nestimate of the input dependent residual squared error at x. As a simple appro(cid:173)\nximation, Res(N Ni(X)) can be estimated by training a neural network with the \nresidual squared errors of N Ni. Error-based weighting should be superior to sim(cid:173)\nple averaging in particular if the estimators in the pool have different complexity. \nA more complex system would obtain larger weights in regions where the map(cid:173)\nping is complex, since an estimator which is locally too simple has a large residual \nerror, whereas in regions, where the mapping is simple, both estimators have suf(cid:173)\nficient complexity, but the simpler one has less variance. In our experiments we \nonly tried networks with the same complexity. Preliminary results indicate that \nvariance-based weighting and error-based weighting are sometimes superior to sim(cid:173)\nple averaging. The main reason seems to be that the local overfitting of a network \nis reflected in a large variance near that location in input space. The overfitting \nestimator therefore obtains a small weight in that region (compare the overfitting \nof network 1 in Figure Ib near x = 0 and the small weight of network 1 close to \nx = 0 in Figure lc). \n\n\fCombining Estimators Using Non-Constant Weighting Functions \n\n425 \n\n7 Conclusions \n\nWe have presented modular ways for combining estimators. The weighting functions \nof each module can be determined independently of the other modules such that \nadditional modules can be added without retraining of the previous system. This \ncan be a useful feature in the context of the problem of catastrophic forgetting: \nadditional data can be used to train an additional module and the knowledge in the \nremaining modules is preserved. Also note that estimators which are not derived \nfrom data can be easily included if it is possible to estimate the input dependent \ncertainty or competence of that estimator. \n\nAcknowledgements: Valuable discussions with David Cohn, Michael Duff and \nCesare Alippi are greatfully acknowledged. The first author would like to thank \nthe Center for Biological and Computational Learning (MIT) for providing and \nexcellent research environment during the summer of 1994. \n\n8 Appendix \n\n8.1 Variance-based Weighting: Correlated Errors and Bias \n\nWe maintain that Li gi(X) = 1. In general (Le. the modules have seen the same data, or \npartially the same data), we cannot assume that the errors in the individual modules are \nindependent. Let the M x M matrix O( x) be the covariance between the predictions of the \nmodules N Ni(X). With h(x) = (h 1(x) .... hM(X)T, the optimal weighting vector becomes \n\nh(x) = 0-1(X) U \nwhere u is the M-dimensional unit vector. \nIf the individual modules are biased (biasi(x) = ED(N Ni(X)) - EYI.r(ylx)),~ we form the \nM x M matrix B(x), with Bii(x) = biasi(x)biasj(x), and the minimum variance solution \nis found for \n\nn(x) = u' 0-1(X) U \n\nh(x) = (O(x) + B(X))-1 u \n\nn(x) = u' (O(x) + B(X))-1 u. \n\n8.2 Density-based Weighting: GEM-learning \n\nLet's assume a training pattern (Xlo' YIo) which is not associated with a particular module. \nIf wi is a parameter in network N Ni the error gradient becomes \n\n8errorlo __ ( -NNo( \n\n, Xlo \n\n))p\"(OI \n\nS Xlo, Ylo \n\nYlo \n\n8 \n\n\u00b0 \n\nw' \n\n-\n\n)8NNi(XIo) \n\n\u2022 \n\n8 \n\n\u00b0 \n\nw' \n\nThis equation can be derived from the solution to the problem of training with missing \nthe true i is unknown, see Tresp, Ahmad and Neuneier, 1994). This \nfeatures (here: \ncorresponds also to the M-step in a generalized EM algorithm, where the E-step calculates \n\n\" . \nP(SIXIo, YIo) = '\" p\nL..Ji \n\nP(Ylolxlo , i)P(xloli)P(i)\n\no \" \n\n(Ylolxlo, s)P(xlols)P(s) \n\n0\n\n\"\n\n. \n\n\" \nP(Ylolxlo, s) = G(ylo; N Ni(XIo) , (1 ). \n\n2 \n\n\u00b0 \n\n~ E stands for the expected value; the expectation ED is taken with respect to all data \n\nsets of the same size. \n\n\f426 \n\nVolker Tresp, Michiaki Taniguchi \n\nusing the current parameters. The M-step in the \"gating network\" P(xli) is particularly \nsimple using the well known EM-rules for Gaussian mixtures. Note, that P(module = \ni, mixture component: ilxk' Yk) needs to be calculated. \n\n8.2.1 Unified Approach: Correlated Errors and General Coupling \nLet's form the vectors N N(x) = (N N 1(x), ... N NM(X))T and /(x) = (!t(x), ... , /M(x)f. \nIn a more general case, the prior coupling between the underlying functions is described \nby \n\nP(f(x)) = G(f(x);g(x), ~g(x)) \n\nwhere g{x) = (g1{X), ... ,gM{x)f. Furthermore, in a more general case, the estimates are \nnot independent, \n\nP{N N{x)l/{x)) = G{N N(x); I(x), ~N{X)). \n\nThe minimum variance solution is now \n\nThe equations in Section 4 are special cases with M = 2, g{x) = 0, ~;1(x) = l/varcc x \n(1, -1)(1, -If, ~N(X) = 1 (var{N N 1{x)), var(N N2{X)))T (1 is the 2D-unit matrix). \n\nReferences \n\nAhmad, S. and Tresp, V. (1993) . Some Solutions to the Missing Feature Problem in Vision. \nIn S. J. Hanson, J. D. Cowan and C. L. Giles, (Eds.), Advances in Neural Information \nProcessing Systems 5. San Mateo, CA: Morgan Kaufmann. \n\nBreiman, L. (1992). Stacked Regression. Dept. of Statistics, Berkeley, TR No. 367. \n\nHampshire, J. and Waibel, A. (1989). The meta-pi network: Building Distributed Know(cid:173)\nledge Representations for Robust Pattern Recognition. TR CMU-CS-89-166, CMU, PA. \n\nJacobs, R. A., Jordan, M. 1., Nowlan, S. J. and Hinton, J. E. (1991). Adaptive Mixtures \nof Local Experts. Neuml Computation, Vol. 3, pp. 79-87. \nMeir, R. (1994). Bias, Variance and the Combination of Estimators: The Case of Linear \nLeast Squares. TR: Dept. of Electrical Engineering, Technion, Haifa. \n\nNowlan, S. J and Sejnowski, T. J. (1994). Filter Selection Model for Motion Segmentation \nand Velocity Integration. J. Opt. Soc. Am. A, Vol. 11, No. 12, pp. 1-24. \n\nPerrone, M. P. (1993). Improving Regression Estimates: Averaging Methods for Variance \nReduction with Extensions to General Convex Measure Optimization. PhD thesis. Brown \nUniversity. \n\nTibshirani, R. (1994). A Comparison of Some Error Estimates for Neural Network Models. \nTR Department of Statistics, University of Toronto. \n\nTresp, V., Ahmad, S. and Neuneier, R. (1994). Training Neural Networks with Defi(cid:173)\ncient Data. In: Cowan, J. D., Tesauro, G., and Alspector, J., eds., Advances in Neural \nInformation Processing Systems 6, San Mateo, CA, Morgan Kaufman. \n\nTresp, V., Hollatz J. and Ahmad, S. (1993). Network Structuring and Training Using \nRule-based Knowledge. In S. J. Hanson, J. D. Cowan and C. L. Giles, (Eds.), Advances \nin Neural Information Processing Systems 5, San Mateo, CA: Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 971, "authors": [{"given_name": "Volker", "family_name": "Tresp", "institution": null}, {"given_name": "Michiaki", "family_name": "Taniguchi", "institution": null}]}