{"title": "Stacked Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 668, "page_last": 674, "abstract": null, "full_text": "Stacked Density Estimation \n\nPadhraic Smyth * \n\nInformation and Computer Science \n\nUniversity of California, Irvine \n\nCA 92697-3425 \n\nsmythCics.uci.edu \n\nDavid Wolpert \n\nNASA Ames Research Center \n\nCaelum Research \n\nMS 269-2, Mountain View, CA 94035 \n\ndhwCptolemy.arc.nasa.gov \n\nAbstract \n\nIn this paper, the technique of stacking, previously only used for \nsupervised learning, is applied to unsupervised learning. Specifi(cid:173)\ncally, it is used for non-parametric multivariate density estimation, \nto combine finite mixture model and kernel density estimators. Ex(cid:173)\nperimental results on both simulated data and real world data sets \nclearly demonstrate that stacked density estimation outperforms \nother strategies such as choosing the single best model based on \ncross-validation, combining with uniform weights, and even the sin(cid:173)\ngle best model chosen by \"cheating\" by looking at the data used \nfor independent testing. \n\n1 \n\nIntroduction \n\nMultivariate probability density estimation is a fundamental problem in exploratory \ndata analysis, statistical pattern recognition and machine learning. One frequently \nestimates density functions for which there is little prior knowledge on the shape \nof the density and for which one wants a flexible and robust estimator (allowing \nmultimodality if it exists). In this context, the methods of choice tend to be finite \nmixture models and kernel density estimation methods. For mixture modeling, \nmixtures of Gaussian components are frequently assumed and model choice reduces \nto the problem of choosing the number k of Gaussian components in the model \n(Titterington, Smith and Makov, 1986) . For kernel density estimation, kernel \nshapes are typically chosen from a selection of simple unimodal densities such as \nGaussian, triangular, or Cauchy densities, and kernel bandwidths are selected in a \ndata-driven manner (Silverman 1986; Scott 1994). \nAs argued by Draper (1996), model uncertainty can contribute significantly to pre-\n\n\u2022 Also with the Jet Propulsion Laboratory 525-3660, California Institute of Technology, \n\nPasadena, CA 91109 \n\n\fStacked Density Estimation \n\n669 \n\ndictive error in estimation. While usually considered in the context of supervised \nlearning, model uncertainty is also important in unsupervised learning applications \nsuch as density estimation. Even when the model class under consideration contains \nthe true density, if we are only given a finite data set, then there is always a chance \nof selecting the wrong model. Moreover, even if the correct model is selected, there \nwill typically be estimation error in the parameters of that model. These difficulties \nare summarized by wri ting \n\nP(f I D) = L J dOMP(OM I D,M) x P(M I D) x fM,9M' \n\nM \n\n(1) \n\nwhere f is a density, D is the data set, M is a model, and OM is a set of values for \nthe parameters for model M. The posterior probability P( M I D) reflects model \nuncertainty, and the posterior P(OM I D , M) reflects uncertainty in setting the \nparameters even once one knows the model. Note that if one is privy to P(M, OM), \nthen Bayes' theorem allows us to write out both of our posteriors explicitly, so that \nwe explicitly have P(f I D) (and therefore the Bayes-optimal density) given by \na weighted average of the fM ,9M\" (See also Escobar and West (1995)). However \neven when we know P(M, OM), calculating the combining weights can be difficult . \nThus, various approximations and sampling techniques are often used, a process \nthat necessarily introduces extra error (Chickering and Heckerman 1997) . More \ngenerally, consider the case of mis-specified models where the model class does not \ninclude the true model, so our presumption for P(M, OM) is erroneous. In this case \noften one should again average. \n\nThus, a natural approach to improving density estimators is to consider empirically(cid:173)\ndriven combinations of multiple density models. There are several ways to do this, \nespecially if one exploits previous combining work in supervised learning. For exam(cid:173)\nple, Ormontreit and Tresp (1996) have shown that \"bagging\" (uniformly weighting \ndifferent parametrizations of the same model trained on different bootstrap sam(cid:173)\nples) , originally introduced for supervised learning (Breiman 1996a) , can improve \naccuracy for mixtures of Gaussians with a fixed number of components. Another \nsupervised learning technique for combining different types of models is \"stacking\" \n(Wolpert 1992), which has been found to be very effective for both regression and \nclassification (e .g., Breiman (1996b)) . This paper applies stacking to density esti(cid:173)\nmation , in particular to combinations involving kernel density estimators together \nwith finite mixture model estimators. \n\n2 Stacked Density Estimation \n\n2.1 Background on Density Estimation with Mixtures and Kernels \n\nConsider a set of d real-valued random variables X = {Xl, . . . , xd} Upper case \nsymbols denote variable name.s (such as Xi) and lower-case symbols a particular \nvalue of a variable (such as xJ). ~ is a realization of the vector variable X. J(~) \nis shorthand for f(X = ~) and represents the joint probability distribution of X. \nD = {~1 ' .. . ' ~N} is a training data set where each sample ~i' 1 :::; i :::; N is an \nindependently drawn sample from the underlying density function J(~) . \n\nA commonly used model for density estimation is the finite mixture model with k \ncomponents, defined as: \n\nk \n\nfk(~J = L aigi(~), \n\ni=l \n\n(2) \n\n\fP. Smyth and D. Wolpert \n670 \nwhere I:~=1 Ctj = 1. The component gj's are usually relatively simple unimodal \ndensities such as Gaussians. Density estimation with mixtures involves finding the \nlocations, shapes, and weights of the component densities from the data (using \nfor example the Expectation-Maximization (EM) procedure). Kernel density esti(cid:173)\nmation can be viewed as a special case of mixture modeling where a component \nis centered at each data point, given a weight of 1/ N, and a common covariance \nstructure (kernel shape) is estimated from the data. \nThe quality of a particular probabilistic model can be evaluated by an appropriate \nscoring rule on independent out-of-sample data, such as the test set log-likelihood \n(also referred to as the log-scoring rule in the Bayesian literature). Given a test \ndata set Dte3t , the test log likelihood is defined as \n\nlogf(Dte3tlfk(~)) = l: logfk(~i) \n\nDteof \n\n(3) \n\nThis quantity can play the role played by classification error in classification or \nsquared error in regression. For example, cross-validated estimates of it can be \nused to find the best number of clusters to fit to a given data set (Smyth, 1996) . \n\n2.2 Background on Stacking \n\nStacking can be used either to combine models or to improve a single model. In \nthe former guise it proceeds as follows . First, subsamples of the training set are \nformed. Next the models are all trained on one subsample and resultant joint \npredictive behavior on another subs ample is observed, together with information \nconcerning the optimal predictions on the elements in that other subsample. This \nis repeated for other pairs of subsamples of the training set. Then an additional \n(\"stacked\" ) model is trained to learn, from the subsample-based observations, the \nrelationship between the observed joint predictive behavior of the models and the \noptimal predictions. Finally, this learned relationship is used in conjunction with \nthe predictions of the individual models being combined (now trained on the entire \ndata set) to determine the full system's predictions. \n\n2.3 Applying Stacking to Density Estimation \n\nConsider a set of M different density models, fm(~), 1 ~ m ~ M. In this paper each \nof these models will be either a finite mixture with a fixed number of component \ndensities or a kernel density estimate with a fixed kernel and a single fixed global \nbandwidth in each dimension. (In general though no such restrictions are needed.) \nThe procedure for stacking the M density models is as follows: \n\n1. Partition the training data set D v times, exactly as in v-fold cross valida(cid:173)\n\ntion (we use v = 10 throughout this paper), and for each fold: \n(a) Fit each of the M models to the training portion ofthe partition of D . \n(b) Evaluate the likelihood of each data point in the test partition of D, \n\nfor each of the M fitted models. \n\n2. After doing this one has M density estimates for each of N data points, \nand therefore a matrix of size N x M, where each entry is fm(~) , the \nout-of-sample likelihood of the mth model on the ith data point. \n\n3. Use that matrix to estimate the combination coefficients {Pl, ... , PM} that \nmaximize the log-likelihood at the points ~i of a stacked density model of \n\n\fStacked Density Estimation \n\nthe form: \n\nfstacked (.~) = I': f3m f m (~J. \n\nM \n\nm=l \n\n671 \n\nSince this is itself a mixture model, but where the fm(~i) are fixed, the EM \nalgorithm can be used to (easily) estimate the f3m. \n\n4. Finally, re-estimate the parameters of each of the m component density \nmodels using all of the training data D. The stacked density model is then \nthe linear combination of those density models, with combining coefficients \ngiven by the f3m. \n\n3 Experimental Results \n\nIn our stacking experiments M = 6: three triangular kernels with bandwidths of \n0.1,0.4, and 1.5 of the standard deviation (of the full data set) in each dimension, \nand three Gaussian mixture models with k = 2,4, and 8 components. This set of \nmodels was chosen to provide a reasonably diverse representational basis for stack(cid:173)\ning. We follow roughly the same experimental procedure as described in Breiman \n(1996b) for stacked regression: \n\n\u2022 Each data set is randomly split into training and test partitions 50 times, \nwhere the test partition is chosen to be large enough to provide reasonable \nestimates of out-of-sample log-likelihood. \n\n\u2022 The following techniques are run on each training partition: \n\n1. Stacking: The stacked combination of the six constituent models. \n2. Cross-Validation: The single best model as indicated by the max(cid:173)\nimum likelihood score of the M = 6 single models in the N x M \ncross-validated table of likelihood scores. \n\n3. Uniform Weighting: A uniform average of the six models. \n4. \"Cheating:\" The best single model, i.e., the model having the largest \n\nlikelihood on the test data partition, \n\n5. Truth: The true model structure, if the true model is one of the six \n\ngenerating the data (only valid for simulated data). \n\n\u2022 The log-likelihoods of the models resulting from these techniques are cal(cid:173)\n\nculated on the test data partition. The log-likelihood of a single Gaussian \nmodel (parameters determined on the training data) is subtracted from \neach model's log-likelihood to provide some normalization of scale. \n\n3.1 Results on Real Data Sets \n\nFour real data sets were chosen for experimental evaluation. The diabetes data \nset consists of 145 data points used in Gaussian clustering studies by Banfield and \nRaftery (1991) and others. Fisher's iris data set is a classic data set in 4 dimensions \nwith 150 data points. Both of these data sets are thought to consist roughly of \n3 clusters which can be reasonably approximated by 3 Gaussians. The Barney \nand Peterson vowel data (2 dimensions, 639 data points) contains 10 distinct vowel \nsounds and so is highly multi-modal. The star-galaxy data (7 dimensions, 499 data \npoints) contains non-Gaussian looking structure in various 2d projections. \nTable 1 summarizes the results. In all cases stacking had the highest average log(cid:173)\nlikelihood, even out-performing \"cheating\" (the single best model chosen from the \ntest data). (Breiman (1996b) also found for regression that stacking outperformed \n\n\f672 \n\nP. Smyth and D. Wolperl \n\nTable 1: Relative performance of stacking multiple mixture models, for various data \nsets, measured (relative to the performance of a single Gaussian model) by mean \nlog-likelihood on test data partitions. The maximum for each data set is underlined. \nII Data Set \nI Gaussian I Cross-Validation I \"Cheating\" I Uniform I Stacking II \nDiabetes \n\nFisher's Iris \n\nVowel \n\nStar-Galaxy \n\n-352.9 \n-52.6 \n128.9 \n-257.0 \n\n27.8 \n18.3 \n53.5 \n678.9 \n\n30.4 \n21.2 \n54.6 \n721.6 \n\n29.2 \n18.3 \n40.2 \n789.1 \n\n31.8 \n22.5 \n55.8 \n888.9 \n\nTable 2: Average across 20 runs of the stacked weights found for each constituent \nmodel. The columns with h = .. . are for the triangular kernels and the columns \nwith k = . .. are for the Gaussian mixtures. \n\nI h=O.1 I h=O.4 I h=1.5 I k = 2 I k = 4 I k = 8 1/ \n\nII Data Set \nDIabetes \n\nFisher's Iris \n\nVowel \n\nStar-Galaxy \n\n0.01 \n0.02 \n0.00 \n0.00 \n\n0.09 \n0.16 \n0.25 \n0.04 \n\n0.03 \n0.00 \n0.00 \n0.03 \n\n0.13 \n0.26 \n0.02 \n0.03 \n\n0.41 \n0.40 \n0.20 \n0.27 \n\n0.32 \n0.16 \n0.53 \n0.62 \n\nthe \"cheating\" method.) We considered two null hypotheses: stacking has the same \npredictive accuracy as cross-validation, and it has the same accuracy as uniform \nweighting. Each hypothesis can be rejected with a chance ofless than 0.01% of being \nincorrect, according to the Wilcoxon signed-rank test i.e., the observed differences \nin performance are extremely strong even given the fact that this particular test is \nnot strictly applicable in this situation. \n\nOn the vowel data set uniform weighting performs much worse than the other \nmethods: it is closer in performance to stacking on the other 3 data sets. On three \nof the data sets, using cross-validation to select a single model is the worst method. \n\"Cheating\" is second-best to stacking except on the star-galaxy data, where it \nis worse than uniform weighting also: this may be because the star-galaxy data \nprobably induces the greatest degree of mis-specification relative to this 6-model \nclass (based on visual inspection). \nTable 2 shows the averages of the stacked weight vectors for each data set. The \nmixture components generally got higher weight than the triangular kernels. The \nvowel and star-galaxy data sets have more structure than can be represented by any \nof the component models and this is reflected in the fact that for each most weight \nis placed on the most complex mixture model with k = 8. \n\n3.2 Results on Simulated Data with no Model Mis-Specification \n\nWe simulated data from a 2-dimensional 4-Gaussian mixture model with a reason(cid:173)\nable degree of overlap (this is the data set used in Ripley (1994) with the class \nlabels removed) and compared the same models and combining/selection schemes \nas before, except that \"truth\" is also included, i.e., the scheme which always se(cid:173)\nlects the true model structure with k = 4 Gaussians. For each training sample \nsize, 20 different training data sets were simulated, and the mean likelihood on an \nindependent test data set of size 1000 was reported. \n\n\fStacked Density Estimation \n\n673 \n\n250 \n\nl-\nw \nUl \nI-\nfl]200 \nI-\n~ \n\n..J \n\n~lSO \n~ \ni \n8 \n..J 100 \n~ \n\n~ \n\nCh.-ung \n\nso \n\n+ \n\nI \nI \nI \n\n0 \n20 \n\n.0 \n\n. -. \n\n. ~ .\n\n..... \n.' 0 \n-----\nlJnIfonn \n\n, \n\nSlacking \n\n.+ .. . ' \n0 \n\n~ \n\n~ \n\n~ \n\n.~ \n\n, \n\n<' \n\n~ \n\n~~-: . \u2022 \n,--\n\u2022 \n\nI \nI \nI \n\nI \". \n\n)Y \n\n/ \n\n/ \n\nTrueK \n\n~ \n\n60 \n\n80 \n\n100 \n\n120 \n\nTRAINING SAMPLE SIZE \n\n1.0 \n\n160 \n\n180 \n\n200 \n\nFigure 1: Plot of mean log-likelihood (relative to a single Gaussian model) for \nvarious density estimation schemes on data simulated from a 4-component Gaussian \nmixture. \n\nNote that here we are assured of having the true model in the set of models be(cid:173)\ning considered, something that is presumably never exactly the case in the real \nworld (and presumably was not the case for the experiments recounted in Table \n1.) Nonetheless, as indicated in (Figure 1), stacking performed about the same as \nthe \"cheating\" method and significantly outperformed the other methods, includ(cid:173)\ning \"truth.\" (Results where some of the methods had log-likelihoods lower than the \nsingle Gaussian are not shown for clarity). \n\nThe fact that \"truth\" performed poorly on the smaller sample sizes is due to the fact \nthat with smaller sample sizes it was often better to fit a simpler model with reliable \nparameter estimates (which is what \"cheating\" typically would do) than a more \ncomplex model which may overfit (even when it is the true model structure). As the \nsample size increases, both \"truth\" and cross-validation approach the performance \nof \"cheating\" and stacking: uniform weighting is universally poorer as one would \nexpect when the true model is within the model class. The stacked weights at the \ndifferent sample sizes (not shown) start out with significant weight on the triangular \nkernel model and gradually shift to the k = 2 Gaussian mixture model and finally \nto the (true) k = 4 Gaussian model as sample size grows. Thus, stacking is seen \nto incur no penalty when the true model is within the model class being fit. In \nfact the opposite is true; for small sample sizes stacking outperforms other density \nestimation techniques which place full weight on a single (but poorly parametrized) \nmodel. \n\n4 Discussion and Conclusions \n\nSelecting a global bandwidth for kernel density estimation is still a topic of debate \namong statisticians. Stacking allows the possibility of side-stepping the issue of \na single bandwidth by combining kernels with different bandwidths and different \nkernel shapes. A stacked combination of such kernel estimators is equivalent to using \n\n\f674 \n\nP. Smyth and D. Wolpert \n\na single composite kernel that is a convex combination of the underlying kernels. \nFor example, kernel estimators based on finite support kernels can be regularized \nin a data-driven manner by combining them with infinite support kernels. The key \npoint is that the shape and width of the resulting \"effective\" kernel i8 driven by the \ndata. \nIt is also worth noting that by combining Gaussian mixture models with different \nk values one gets a hierarchical \"mixture of mixtures\" model. This hierarchical \nmodel can provide a natural multi-scale representation of the data, which is clearly \nsimilar in spirit to wavelet density estimators, although the functional forms and \nestimation methodologies for each technique can be quite different. There is also \na representational similarity to Jordan and Jacob's (1994) \"mixture of experts\" \nmodel where the weights are allowed to depend directly on the inputs. Exploiting \nthat similarity, one direction for further work is to investigate adaptive weight \nparametrizations in the stacked density estimation context. \n\nAcknowledgements \n\nThe work of P.S. was supported in part by NSF Grant IRI-9703120 and in part by \nthe Jet Propulsion Laboratory, California Institute of Technology, under a contract \nwith the National Aeronautics and Space Administration . \n\nReferences \nBanfield, J. D., and Raftery, A. E., 'Model-based Gaussian and non-Gaussian \n\nclustering, ' Biometrics, 49, 803-821, 1993. \n\nBreiman, L. , 'Bagging predictors,' Machine Learning, 26(2), 123-140, 1996a. \nBreiman, L., 'Stacked regressions, ' Machine Learning, 24, 49-64, 1996b. \nChickering, D. M., and Heckerman, D., 'Efficient approximations for the marginal \nlikelihood of Bayesian networks with hidden variables,' Machine Learning, \nIn press. \n\nDraper, D, 'Assessment and propagation of m0del uncertainty (with discussion),' \n\nJournal of the Royal Statistical Society B, 57, 45-97, 1995. \n\nEscobar, M. D., and West, M., 'Bayesian density estimation and inference with \n\nmixtures,' J. Am. Stat. Assoc., 90, 577-588, 1995. \n\nJordan, M. 1. and Jacobs, R. A., 'Hierarchical mixtures of experts and the EM \n\nalgorithm,' Neural Computation, 6, 181-214, 1994. \n\nMadigan, D., and Raftery, A. E., 'Model selection and accounting for model un(cid:173)\n\ncertainty in graphical models using Occam's window,' J. Am. Stat. Assoc., \n89, 1535-1546, 1994. \n\nOrmeneit, D., and Tresp, V., 'Improved Gaussian mixture density estimates us(cid:173)\n\ning Bayesian penalty terms and network averaging,' in Advances in Neural \nInformation Processing 8, 542-548, MIT Press, 1996. \n\nRipley, B. D. 1994. 'Neural networks and related methods for classification (with \n\ndiscussion),' J. Roy. Stat. Soc. B, 56,409-456. \n\nSmyth, P.,'Clustering using Monte-Carlo cross-validation,' in Proceedings of the \n\nSecond International Conference on Knowledge Discovery and Data Min(cid:173)\ning, Menlo Park, CA: AAAI Press, pp.126-133, 1996. \n\nTitterington, D. M., A. F. M. Smith, U. E. Makov, Statistical Analysis of Finite \n\nMixture Distributions, Chichester, UK: John Wiley and Sons, 1985 \nWolpert, D. 1992. 'Stacked generalization,' Neural Networks, 5, 241-259, \n\n\f", "award": [], "sourceid": 1353, "authors": [{"given_name": "Padhraic", "family_name": "Smyth", "institution": null}, {"given_name": "David", "family_name": "Wolpert", "institution": null}]}