{"title": "Accurate Uncertainty Estimation and Decomposition in Ensemble Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8952, "page_last": 8963, "abstract": "Ensemble learning is a standard approach to building machine learning systems that capture complex phenomena in real-world data. An important aspect of these systems is the complete and valid quantification of model uncertainty. We introduce a Bayesian nonparametric ensemble (BNE) approach that augments an existing ensemble model to account for different sources of model uncertainty. BNE augments a model\u2019s prediction and distribution functions using Bayesian nonparametric machinery. It has a theoretical guarantee in that it robustly estimates the uncertainty patterns in the data distribution, and can decompose its overall predictive uncertainty into distinct components that are due to different sources of noise and error. We show that our method achieves accurate uncertainty estimates under complex observational noise, and illustrate its real-world utility in terms of uncertainty decomposition and model bias detection for an ensemble in predict air pollution exposures in Eastern Massachusetts, USA.", "full_text": "Accurate Uncertainty Estimation and Decomposition\n\nin Ensemble Learning\n\nJeremiah Zhe Liu\u2217\n\nGoogle Research & Harvard University\n\nzhl112@mail.harvard.edu\n\nMarianthi-Anna Kioumourtzoglou\n\nColumbia University\n\nmk3961@cumc.columbia.edu\n\nJohn Paisley\n\nColumbia University\n\njpaisley@columbia.edu\n\nBrent A. Coull\n\nHarvard University\n\nbcoull@hsph.harvard.edu\n\nAbstract\n\nEnsemble learning is a standard approach to building machine learning systems\nthat capture complex phenomena in real-world data. An important aspect of\nthese systems is the complete and valid quanti\ufb01cation of model uncertainty. We\nintroduce a Bayesian nonparametric ensemble (BNE) approach that augments an\nexisting ensemble model to account for different sources of model uncertainty.\nBNE augments a model\u2019s prediction and distribution functions using Bayesian\nnonparametric machinery. It has a theoretical guarantee in that it robustly estimates\nthe uncertainty patterns in the data distribution, and can decompose its overall\npredictive uncertainty into distinct components that are due to different sources of\nnoise and error. We show that our method achieves accurate uncertainty estimates\nunder complex observational noise, and illustrate its real-world utility in terms of\nuncertainty decomposition and model bias detection for an ensemble in predict air\npollution exposures in Eastern Massachusetts, USA.\n\n1\n\nIntroduction\n\nEnsemble learning has a long history in areas such as robust engineering system design [4], \ufb01nancial\ninvestment management [20], and weather and climate forecasting [35], where high-risk decisions\nand critical projections are made in the presence of noise and uncertainty. Failure to accurately\nquantify the predictive uncertainty in these ensemble systems can lead to severe consequences [1],\nsuch as the market crash of 2008.\nTo properly quantify predictive uncertainty, it is important\nfor an ensemble learning system to recognize different types\nof uncertainties that arise in the modeling process. In ma-\nchine learning modeling, two distinct types of uncertain-\nties exist: aleatoric uncertainty and epistemic uncertainty\n[22] (see Figure 1). Aleatoric uncertainty arises due to the\nstochastic variability inherent in the data generating process,\nfor example due to an imperfect sensor, and is described\nby the cumulative distribution function (CDF) F(y|x,\u0398) of\nthe data speci\ufb01ed by a given model. On the other hand,\nepistemic uncertainty arises due to our lack of knowledge\nabout the data generating mechanism. A model\u2019s epistemic\n\nFigure 1: A decomposition of different\ntypes of uncertainty by the Bayesian\nNonparametric Ensemble (BNE).\n\n\u2217Work done at Harvard University.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nOverallUncertaintyEpistemicParametric\u03c9Structural(i.e.ModelMisspeci\ufb01cation)PredictionFunction\u03b4DistributionFunctionGAleatoric\funcertainty can be reduced by collecting more data, whereas aleatoric uncertainty is irreducible since\nit is inherent to the data generating mechanism. A machine learning model\u2019s epistemic uncertainty\ncan arise from two sources [42]: parametric uncertainty that re\ufb02ects uncertainty associated with\nestimating the model parameters under the current model speci\ufb01cation, which can be described\nby a Bayesian model\u2019s posterior p(\u0398|y,x); and structural uncertainty that re\ufb02ects the uncertainty\nabout whether a given model speci\ufb01cation is suf\ufb01cient for describing the data, i.e. whether there\nexists a systematic discrepancy between CDF F(y|x,\u0398) based on the model and the data-generating\ndistribution F\u2217(y|x).\nThe goal of uncertainty estimation is to properly characterize both a model\u2019s aleatoric and epistemic\nuncertainties [24, 42]. In regions that are well represented by the training data, a model\u2019s aleatoric\nuncertainty should accurately estimate the data-generating distribution by \ufb02exibly capturing the\nstochastic pattern in the data (i.e., calibration [19]), while in regions unexplored by the training data,\nthe model\u2019s epistemic uncertainty should increase to capture the model\u2019s lack of con\ufb01dence in the\nresulting predictions (i.e. out-of-distribution generalization [24]). Within the epistemic uncertainty,\nthe structural uncertainty needs to be estimated to identify the sources of structural biases in the\nensemble model, and to quantify how these structural biases may impact the model output, something\nnecessary for the continuous model validation and re\ufb01nement of a running ensemble system [40, 34].\nA comprehensive framework for quantifying these three types of uncertainties is currently lacking\nin the ensemble learning literature. We refer readers to Supplementary Section A for a full review\nand how our work is related to existing literature. Brie\ufb02y, existing methods typically handle the\naleatoric uncertainty using an assumed distribution family (e.g., Gaussian) [24, 48] that may not\ncapture the stochastic patterns in the data (e.g. asymmetry, heavy-tailedness, multimodality, or their\ncombinations). Work exists on quantifying epistemic uncertainty, although ensemble methods mainly\nwork with collections of base models of the same class, and usually do not explicitly characterize the\nmodel\u2019s structural uncertainty [6, 9, 10, 50, 24, 51, 27].\nIn this work, we develop an ensemble model that addresses all three sources of predictive uncer-\ntainty. Our speci\ufb01c contributions are: 1) We propose Bayesian Nonparametric Ensemble (BNE),\nan augmentation framework that mitigates misspeci\ufb01cation in the original ensemble model and\n\ufb02exibly quanti\ufb01es all three sources of predictive uncertainty (Section 2). 2) We establish BNE\u2019s\nmodel properties in uncertainty characterization, including its theoretical guarantee with respect\nto consistent estimation of aleatoric uncertainty, and its ability to decompose different sources of\nepistemic uncertainties (Section 3). 3) We demonstrate through experiments that the proposed method\nachieves accurate uncertainty estimation under complex observational noise and improves predictive\naccuracy (Section 4), and illustrate our method by predicting ambient \ufb01ne particle pollution in Eastern\nMassachusetts, USA by ensembling three different existing prediction models developed by multiple\nresearch groups (Section 5).\n\n2 Bayesian Nonparametric Ensemble\n\nIn this section, we introduce the Bayesian Nonparametric Ensemble (BNE), an augmentation frame-\nwork for ensemble learning. We focus on the application of BNE to regression tasks. Given an\nensemble model, BNE mitigates the original model\u2019s misspeci\ufb01cation in the prediction function and\nin the distribution function using Bayesian nonparametric machinery. As a result, BNE enables an\nensemble to \ufb02exibly quantify aleatoric uncertainty in the data, and account for both the parametric\nand the structural uncertainties.\nWe build the full BNE model by starting from the classic ensemble model. Denoting F\u2217(y|x) the CDF\nof data-generating distribution for an continuous outcome. Given an observation pair {x,y} \u2208 Rp \u00d7R\nwhere y \u223c F\u2217(y|x) and a set of base model predictors { fk}K\nk=1, a classic ensemble model assumes the\nform\n\nY =\n\nK\n\n\u2211\n\nk=1\n\nfk(x)\u03c9k + \u03b5,\n\n(1)\n\nwhere \u03c9 = {\u03c9k}K\nk=1 are the ensemble weights assigned to each base model, and \u03b5 is a random variable\ndescribing the distribution of the outcome. For simplicity of exposition, in the rest of this section we\nassume \u03c9 and \u03b5 follow independent Gaussian priors, which corresponds to a classic stacking model\nassuming a Gaussian outcome [10].\n\n2\n\n\fIn practice, given a set of predictors { fk}K\nk=1\u2019s built by domain experts, a practitioner needs to \ufb01rst\nspecify a distribution family for \u03b5 (e.g. Gaussian such that \u03b5 \u223c N(0,\u03c3\u03b5 )), then estimate \u03c9 and \u03b5 using\ncollected data. During this process, two types of model biases can arise: bias in prediction function\n\u00b5 = \u2211K\nk=1 fk(x)\u03c9k caused by the systematic bias shared among all the base predictors fk\u2019s; and bias\nin distribution speci\ufb01cation caused by assuming a distribution family for \u03b5 that fails to capture the\nstochastic pattern in the data, producing inaccurate estimates of aleatoric uncertainty. BNE mitigates\nthese two types of biases that exist in (1) using Bayesian nonparametric machinery.\n\nMitigate prediction bias using residual process \u03b4 To mitigate model\u2019s structural bias in pre-\ndiction, BNE \ufb01rst adds to (1) a \ufb02exible residual process \u03b4 (x), so the ensemble model becomes a\nsemiparametric model [11, 39]:\n\nY =\n\nK\n\n\u2211\n\nk=1\n\nfk(x)\u03c9k + \u03b4 (x) + \u03b5.\n\n(2)\n\nIn this work, we model \u03b4 (x) nonparametrically using a Gaussian process (GP) with zero mean\nfunction 0(x) = 0 and kernel function k\u03b4 (x,x(cid:48)). The residual process \u03b4 (x) adds additional \ufb02exibility\nof the model\u2019s mean function E(Y|x), and domain experts can select a \ufb02exible kernel for \u03b4 to best\napproximate the data-generating function of interest (e.g., a RBF kernel to approximate arbitrary\ncontinuous functions over a compact support [33]). As a result, in densely-sampled regions that are\nwell captured by the training data, \u03b4 (x) will con\ufb01dently mitigate the prediction bias between the\nobservation y and the prediction function \u2211K\nk=1 fk(x)\u03c9k. However, in sparsely-sampled regions, the\nposterior mean of \u03b4 (x) will be shrunk back towards 0(x) = 0, so as to leave the predictions of the\noriginal ensemble (1) intact (since these expert-built base models presumably have been specially\ndesigned for the problem being considered) and the posterior uncertainty of \u03b4 (x) will be larger to\nre\ufb02ect the model\u2019s increased structural uncertainty in its prediction function at location x.\nWe recommend selecting k\u03b4 from the shift-invariant kernel family k(x,x(cid:48)) = g(x\u2212x(cid:48)). Shift-invariant\nkernels are well suited for characterizing a model\u2019s epistemic uncertainty, since the resulting predictive\nvariances are explicitly characterized by the distance from the training data, which yields predictive\nuncertainty that increases as the prediction location of interest is farther away from data [36].\nWe write the model CDF of (2) as \u03a6\u03b5 (y|x, \u00b5). In the case \u03b5 \u223c N(0,\u03c3 2\nwith mean \u00b5 and variance \u03c3 2\nhierarchical Gaussian process with mean function \u2211K\n\n\u03b5 ), \u03a6\u03b5 is a Gaussian CDF\n\u03b5 . Notice that since \u03b4 (x) is a Gaussian process, (2) speci\ufb01es Y as a\nk=1 fk(x)\u03c9k and kernel function k\u03b4 (x,x(cid:48)) + \u03c3 2\n\u03b5 .\nMitigate distribution bias using calibration function G Although \ufb02exible in its mean prediction,\nthe model in (2) can still be restrictive in its distributional assumptions. That is, at a given location\nx \u2208 Rp, because the model corresponds to a Gaussian process speci\ufb01cation for Y , the posterior of\n(2) still follows a Gaussian distribution [36]. Consequently, when the data distribution is multi-\nmodal, non-symmetric, or heavy-tailed, the model in (2) can still fail to capture the underlying\ndata-generating distribution F\u2217(y|x), resulting in systematic discrepancy between \u03a6\u03b5 (y|x, \u00b5) and\nF\u2217(y|x).\nTo mitigate this bias in the speci\ufb01cation of the data distribution, BNE further augments \u03a6\u03b5 (y|x, \u00b5) by\nusing a nonparametric function G to \"calibrate\" the model\u2019s distributional assumption using observed\n\ndata z = {y,x}, i.e., BNE models its CDF as F(y|x, \u00b5) = G(cid:2)\u03a6\u03b5 (y|x, \u00b5)(cid:3). As a result, the full BNE\n\nmodel\u2019s CDF is a \ufb02exible nonparametric function capable of modeling a wide range of complex\ndistributions. In this work, we model G using a Gaussian process with identity mean function I(x) = x\nand kernel function kG, and we impose probit-based likelihood constraints on G so it respects the\nmathematical property of a CDF (i.e. monotonic and bounded between [0,1], see Section B for detail).\nAs a result, the full BNE model\u2019s CDF follows a constrained Gaussian process (CGP) [29, 30, 38]:\n\n(cid:16)\n\u03a6\u03b5 (y|x, \u00b5), kG\n(3)\n\u221a\n3d/l) \u2217\nwhere z = {y,x}. In this work, we set kG to the Mat\u00e9rn 3\nexp(\u2212\u221a\n2 Gaussian process corresponds\nto the space of H\u00f6lder continuous functions that are at least once differentiable, allowing F to \ufb02exibly\nmodel the space of (Lipschitz) continuous CDFs F\u2217(y|x) whose probability density function (PDF)\nexist [46]. Consequently, in regions well represented by the training data, the BNE\u2019s model CDF will\n\n3d/l) where d = ||x\u2212 x(cid:48)||2. The sample space of a Mat\u00e9rn 3\n\n2 kernel kMat\u00e9rn 3/2(d) = (1 +\n\nF(y|x, \u00b5) \u223c CGP\n\n(cid:0)z,z(cid:48)(cid:1)(cid:17)\n\n,\n\n3\n\n\f\ufb02exibly capture the complex patterns in the data distribution. In regions outside the training data, the\nBNE\u2019s model CDF will fall back to \u03a6\u03b5 (y|x, \u00b5), not interfering with the generalization behavior of\nthe original ensemble model. Additionally, the posterior uncertainty in (3) will re\ufb02ect the model\u2019s\nadditional structural uncertainty with respect to its distribution speci\ufb01cation.\n\nFigure 2: Illustrative example illustrating impact of G on model\u2019s posterior predictive distribution.\nDashed Line: True Distribution F\u2217(y|x), Black Ticks: Observations, Red Shade: predictive density of\n\u2211k fk\u03c9k, Grey Shade: predictive density of \u03a6\u03b5 (y|x, \u00b5) (Gaussian assumption), Blue Shade: predictive\ndensity of G\u25e6 \u03a6\u03b5 (y|x, \u00b5) (nonparametric noise correction).\nTo further illustrate the role G plays in the BNE\u2019s ability to \ufb02exibly characterize an outcome distribu-\ntion, we consider an illustrative example where we run the BNE model both with and without G to\npredict y at a \ufb01xed location of x (i.e. estimating the conditional distribution F\u2217(y|x) at \ufb01xed location\nof x) where y|x \u223c Gamma(1.5,2) (Figure 2). As shown, the posterior distribution of \u03a6\u03b5 (y|x, \u00b5) (grey\nshade) fails to capture the skewness in the data\u2019s empirical distribution, and consequently yields a\nbiased maximum a posterior (MAP) estimate due to its restrictive distributional assumptions. On\n\nthe other hand, the full BNE model F(y|x, \u00b5) = G(cid:2)\u03a6\u03b5 (y|x, \u00b5)(cid:3) is able to calibrate its predictive\n\ndistribution (blue shade) toward the data distribution using G, and consequently produces improved\ncharacterization of F\u2217(y|x) and improved MAP estimate.\n\nModel Summary To recap, given a classic ensemble model (1), BNE nonparametrically augments\nthe model\u2019s prediction function with a residual process \u03b4 , and augments the model\u2019s distribution\nfunction with a calibration function G. Speci\ufb01cally, for data y|x that is generated from the distribution\nF\u2217(y|x), the full BNE assumes the following model:\n\nF\u2217(y|x) = G(cid:2)\u03a6\u03b5 (y|x, \u00b5)(cid:3), \u00b5 =\n\nK\n\n\u2211\n\nk=1\n\nfk(x)\u03c9k + \u03b4 (x).\n\n(4)\n\nThe priors are de\ufb01ned to be\n\nG \u223c CGP(I,kG),\n\n\u03b4 \u223c GP(0,k\u03b4 ), \u03c9 \u223c N(0,\u03c3 2\n\n\u03c9I),\n\nwhere kG is the Mat\u00e9rn 3\n2 kernel, and k\u03b4 is a shift-invariant kernel to be chosen by the domain expert\n(we set it to Mat\u00e9rn 3\n2 in this work). The zero-mean GP ensures the ensemble bias term \u03b4 reverts to\nzero out of sample, while the identity-mean GP allows the noise process to be white Gaussian noise \u03b5\nout of sample. In other words, this prior structure allows BNE to \ufb02exibly capture data distribution\nwhere data exists, and revert to the classic ensemble otherwise.\nBNE\u2019s hyper-parameters are the Mat\u00e9rn length-scale parameters l\u03b4 and lG, and the prior variances\n\u03c3\u03c9 and \u03c3\u03b5. Consistent with the existing GP approaches, we place the inverse-Gamma priors on the\nl\u03b4 and lG and the Half Normal priors on \u03c3\u03c9 and \u03c3\u03b5 [43]. Posterior sampling is performed using\nHamiltonian Monte Carlo (HMC) [2], for which we pre-orthogonalize kernel matrices with respect\nto their mean functions to avoid parameter non-identi\ufb01ability [31, 37]. The time complexity for\nsampling from the BNE posterior is O(N3) due to the need to invert the N \u00d7 N kernel matrices. For\nlarge datasets, we can consider the parallel MCMC scheme proposed in [26] which partitions the data\ninto K subsets and estimates the predictive intervals with reduced complexity O(N3/K2). Section C\ndescribes posterior inference in further detail.\n\n3 Characterizing Model Uncertainties with BNE\n\n3.1 Mitigating Model Bias under Uncertainty\n\nIn this section we study the contribution of BNE\u2019s model components to an ensemble\u2019s prediction\nand predictive uncertainty estimation. For a model with predictive CDF F(y|x), we notice that\n\n4\n\n\fis expressed as E(y|x) =(cid:82)\n\nthe model\u2019s predictive behavior is completely characterized by F(y|x): a model\u2019s predictive mean\ny\u2208R[I(y > 0)\u2212 F(y|x)]dy, and a model\u2019s (1\u2212 q)% predictive interval is\nexpressed as Uq(y|x) = [F\u22121(1\u2212 q\n2|x)] [12]. Consequently, BNE improves upon an\nensemble model\u2019s prediction and uncertainty estimation by building a \ufb02exible model for F that better\ncaptures the data-generating F\u2217(y|x).\n\n2|x), F\u22121(1 + q\n\nBias Correction for Prediction and Uncertainty Estimation We can express the predictive mean\nof BNE as:\n\nE(y|x,\u03c9,\u03b4 ,G) =\n\nK\n\n\u2211\nk=1\n\nfk(x)\u03c9k + \u03b4 (x)\n\n+\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\ndue to \u03b4\n\n(cid:90)\n(cid:124)\n\ny\u2208Y\n\n\u03a6(y|x, \u00b5)\u2212 G(cid:2)\u03a6(y|x, \u00b5)(cid:3)(cid:105)\n(cid:104)\n\n(cid:123)(cid:122)\n\ndue to G\n\ndy\n\n.\n\n(5)\n\n(cid:125)\n\nSee Supplementary E for derivation. As shown, the predictive mean for the full BNE is composed of\nthree parts: 1) the predictive mean of the original ensemble \u2211K\nk=1 fk(x)\u03c9k; 2) the term \u03b4 representing\n\nBNE\u2019s \"direct correction\" to the prediction function; and 3) the term(cid:82)(cid:2)\u03a6(y|x, \u00b5)\u2212 G[\u03a6(y|x, \u00b5)](cid:3)dy\n\nrepresenting BNE\u2019s \"indirect correction\" to prediction obtained upon the relaxation of the original\nGaussian assumption in model CDF. We denote these two error-correction terms as D\u03b4 (y|x) and\nDG(y|x).\nTo express BNE\u2019s estimated predictive uncertainty, we denote as \u03a6\u03b5,\u03c9 the predictive CDF of the\noriginal ensemble (1) (i.e. with mean \u2211k fk\u03c9k and variance \u03c3 2\n\u03b5 ). Then BNE\u2019s predictive interval is:\n+ \u03b4 (x), \u03a6\u22121\n\u03b5,\u03c9\n\nUq(y|x,\u03c9,\u03b4 ,G) =\n\n(cid:16)\nG\u22121(1\u2212 q\n2\n\n\u03a6\u22121\n\u03b5,\u03c9\n\nComparing (6) to the predictive interval of original ensemble [\u03a6\u22121\n2 )], we see\nthat the locations of the BNE predictive interval endpoints are adjusted by the residual process \u03b4 ,\nwhile the spread of the predictive interval (i.e. the predictive uncertainty) is calibrated by G.\n\n.\n\n+ \u03b4 (x)\n\u03b5,\u03c9 (1 + q\n\n(cid:16)\n\n(cid:17)\nq\nG\u22121(1 +\n|x)\n2\n\u03b5,\u03c9 (1\u2212 q\n2 ),\u03a6\u22121\n\n(cid:17)\n\n|x)\n\n(cid:104)\n\n(cid:105)\n\n(6)\n\nP(cid:0)D\u03b4\n\nQuantifying Uncertainty in Bias Correction A salient feature of BNE is that it can quantify its\nuncertainty in bias correction. This is because the bias correction terms D\u03b4 and DG are random\nquantities that have posterior distributions (since they are functions of \u03b4 and G). Speci\ufb01cally, we\ncan quantify the posterior uncertainty in whether D\u03b4 and DG are different from zero by estimating\n\n(cid:0)y|x(cid:1) > 0(cid:1), i.e., the percentiles of 0 in the posterior distribution of D\u03b4\n\n(cid:0)y|x(cid:1) > 0(cid:1) and P(cid:0)DG\n\nand DG. Values close to 0 or 1 indicate strong evidence that model bias impacts model prediction.\nValues close to 0.5 indicate a lack of evidence of this impact, since the posterior distributions of\nthese error-correction terms are roughly centered around zero. This approach can be generalized to\ndescribe the impact of the distribution biases on other properties of the predictive distribution (e.g.\nskewness, multi-modality, etc. see Section E for detail).\n\n3.2 Consistent Estimation of Aleatoric Uncertainty\n\n2|x), F\u22121(1 + q\n\nRecall that a model characterize the aleatoric uncertainty in data through its model CDF. As it\nis clear from the expression of predictive interval Uq(y|x) = [F\u22121(1\u2212 q\n2|x)], for a\nmodel to reliably estimate its predictive uncertainty, the model CDF F should be estimated to be\nconsistent with the data-generating CDF F\u2217(y|x), such that, for example, the 95% predictive interval\nU0.95(y|x) indeed contains the observations y \u223c F\u2217(y|x) 95% of the time. This consistency property\nis known in the probabilistic forecast literature as calibration [19], and de\ufb01nes a mathematically\nrigorous condition for a model to achieve reliable estimation of its predictive uncertainty. To this end,\nusing the \ufb02exible calibration function G, BNE enables its model CDF to consistently capturing the\ndata-generating F\u2217(y|x):\nTheorem 1 (Posterior Consistency). Let F = G[\u03a6] be a realization of the CGP prior de\ufb01ned in\n(3). Suppose that the true data-generating CDF F\u2217(y|x) is contained in the support of F. Given\n{yi,xi}n\ni=1, a random sample from F\u2217(y|x), denote the expectation with respect to F\u2217 as E\u2217 and\ndenote the posterior distribution as \u03a0n. There exists a sequence \u03b5n \u2192 0 and suf\ufb01ciently large M such\nthat\n\n(cid:16)||F\u2217 \u2212 F||2 \u2265 M\u03b5n\n\n(cid:12)(cid:12)(cid:12){yi,xi}n\n\ni=1\n\n(cid:17) \u2192 0.\n\nE\u2217\u03a0n\n\n5\n\n\fWe defer the full proof to Section D. This result states that, as the sample size grows, the BNE\u2019s\nposterior distribution of F concentrates around the true data-generating CDF F\u2217, therefore consistently\ncapture the aleatoric uncertainty in the data distribution. By setting kG to the Mat\u00e9rn 3\n2 kernel, the\nprior support of BNE is large and contains the space of compactly supported, Lipschitz continuous\nF\u2217\u2019s whose PDF exist [5, 46]. The convergence speed of the posterior F depends both on the distance\nof F\u2217 relative to the prior distribution, and on how close the smoothness of the Mat\u00e9rn prior matches\nthe smoothness of F\u2217 [44, 45]. To this end, the BNE improves its speed of convergence by centering\nF\u2019s prior mean to \u03a6(y|x,\u03c9) and by estimating the kernel hyperparameter lG adaptively through an\ninverse Gamma prior.\n\n3.3 Uncertainty Decomposition\n\nFor an ensemble model that is augmented by BNE, the goal of uncertainty decomposition is to\nunderstand how different sources of uncertainty combine to impact the ensemble model\u2019s predictive\ndistribution, and to distinguish the contribution of each source in driving the overall predictive\nuncertainty. As shown in Figure 1, the posterior uncertainty in each of a BNE\u2019s model parameters\n{\u03c9,\u03b4 ,G} accounts for an important source of model uncertainty. Consequently, both the aleatoric and\nepistemic uncertainties are quanti\ufb01ed by the BNE\u2019s posterior distribution, and can be distinguished\nthrough a careful decomposition of the model posterior.\nWe \ufb01rst show how to separate the aleatoric and epistemic uncertainties in BNE\u2019s posterior predictive\ndistribution. Consistent with existing approaches, we use entropy to measure the overall uncertainty\ny\u2208Y f (y|\u03b8 )\u2217 log f (y|\u03b8 )dy [18, 32]. Entropy\nmeasures the average amount of information contained in a distribution, and is reduced to a function\nof variance when the distribution is Gaussian. Given a posterior distribution of the model parameters\np(\u03c9,\u03b4 ,G), we separate the aleatoric and epistemic uncertainties in the ensemble model\u2019s predictive\n\nin a model\u2019s predictive distribution: H (y|x,\u03b8 ) = \u2212(cid:82)\ndistribution p(y|x) =(cid:82) f (y|x,G,\u03b4 ,\u03c9)d p(\u03c9,\u03b4 ,G) as [15]:\n(cid:124)\n\nH (y|x) = I(cid:0)(\u03c9,\u03b4 ,G),y(cid:12)(cid:12)x(cid:1)\n(cid:125)\n\n(cid:105)\n(cid:104)H (y(cid:12)(cid:12)x,G,\u03b4 ,\u03c9)\n(cid:125)\n(cid:123)(cid:122)\n\n+EG,\u03b4 ,\u03c9\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n(7)\n\n,\n\nepistemic\n\naleatoric\n\nwhere the second term measures the model\u2019s aleatoric uncertainty (i.e. which describes the noise\npatterns inherence to y) by computing the expected entropy coming from the model distribution\nf (y|x,G,\u03b4 ,\u03c9) that is averaged over the model\u2019s posterior belief about {G,\u03b4 ,\u03c9}. The \ufb01rst term is\nthe mutual information between p(\u03c9,\u03b4 ,G) and p(y|x), and measures a model\u2019s overall epistemic\nuncertainty (both parametric and structural) encoded in the joint posterior p(\u03c9,\u03b4 ,G) [14, 18].\nWe now show how to separate the overall epistemic uncertainty I ((\u03c9,\u03b4 ,G),y|x) into its parametric\nand structural components. This further decomposition is important in understanding how the\nensemble model\u2019s predictive uncertainty changes by accounting for the fact that its prediction and\ndistribution functions may be misspeci\ufb01ed. Speci\ufb01cally, recall that an ensemble model\u2019s parametric\nuncertainty is the uncertainty about the ensemble weights under the current model speci\ufb01cation\n(i.e. by assuming \u03b4 = 0,G = I). Therefore the model\u2019s parametric uncertainty is encoded in the\nconditional posterior p(\u03c9|\u03b4 = 0,G = I) and can be measured by the conditional mutual information\nI (\u03c9,y|x,\u03b4 = 0,G = I). The model\u2019s structural uncertainty contains two components: (1) uncertainty\nabout the prediction function (accounted by \u03b4 ) and (2) uncertainty about the distribution function\n(accounted by G). The \ufb01rst component describes the model\u2019s additional uncertainty about \u03c9 and \u03b4\nunder current distribution assumption (i.e. by assuming G = I), which is encoded in the difference\nbetween p(\u03c9,\u03b4|G = I) and p(\u03c9|\u03b4 = 0,G = I). The second component describes the model\u2019s\nadditional uncertainty about \u03c9, \u03b4 and G by relaxing also the distribution assumption, which is\nencoded in the difference between p(\u03c9,\u03b4 ,G) and p(\u03c9,\u03b4|G = I). By measuring these additional\nuncertainties using differences between mutual information, we decompose the overall epistemic\nuncertainty as:\n\nI(cid:0)(\u03c9,\u03b4 ,G),y(cid:12)(cid:12)x(cid:1) = I ((\u03c9,\u03b4 ,G),y|x)\u2212 I ((\u03c9,\u03b4 ),y|x,G = I)\n(cid:125)\n\n(cid:124)\n\n+\n\n(cid:124)\n\nI ((\u03c9,\u03b4 ),y|x,G = I)\u2212 I (\u03c9,y|x,\u03b4 = 0,G = I)\n\n+ I (\u03c9,y|x,\u03b4 = 0,G = I)\n\n.\n\n(cid:125)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nparametric\n\n(cid:125)\n\nwhere I (\u03b8 ,y|\u03b8(cid:48)) =(cid:82) f (\u03b8 ,y|\u03b8(cid:48))log\n\nf (\u03b8|\u03b8(cid:48)) f (y|\u03b8(cid:48))d\u03b8dy denotes the conditional mutual information.\nAll three uncertainty terms in the above expression are non-negative (see Section F.1). Computing\n\n(cid:123)(cid:122)\n(cid:123)(cid:122)\n\nstructural,G\n\nstructural,\u03b4\n\nf (\u03b8 ,y|\u03b8(cid:48))\n\n6\n\n\fthese uncertainty terms is straightforward since under BNE, p(\u03c9|\u03b4 = 0,G = I) and p(\u03c9,\u03b4|G = I)\nboth have closed form, which correspond to the posterior of (1) and (2), respectively. We present an\nexample of such a decomposition in the air pollution application (Section 5).\n\n4 Experiments\nThis section reports an in-depth validation of the proposed method on a nonlinear function approxima-\ntion task with complex (heterogeneous and heavy-tailed) observation noise. We illustrate the method\u2019s\nability in uncertainty decomposition and bias detection by visualizing the decomposition of model\u2019s\npredictive distribution into their aleatoric, parametric, and structural components, and also visualize\nthe impact of model bias to the model\u2019s output distribution using method described in Section 3.1.\nWe then interrogate the method\u2019s operating characteristics in prediction (RMSE to true E\u2217(y|x)) and\nuncertainty quanti\ufb01cation (L1 distance to true F\u2217(y|x)), and these metrics\u2019 convergence behavior with\nrespect to the increasing sample sizes. We consider a time series problem with heterosdecastic noise\nwith varying degree of skewness in P(y|x) and with imbalanced sampling probability in x (see Figure\n3). The detailed experiment settings are documented in Supplementary G.\n\n(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 3: First Column: (a) Uncertainty decomposition in the original ensemble; (d) Posterior\ncon\ufb01dence in 3a\u2019s bias in predictive mean due to prediction function misspeci\ufb01cation. Second\nColumn: (b) Uncertainty decomposition in the BNE model without G; (e) Posterior con\ufb01dence\nin 3b\u2019s bias in predictive mean and variance due to distribution misspeci\ufb01cation. Third Column:\n(c) Uncertainty decomposition in the full BNE model; (f) Data generation mechanism. Blue Line:\nPosterior Mean Prediction. Shaded Region: 90% posterior credible intervals for model\u2019s parametric,\nstructural and aleatoric uncertainties.\n\nUncertainty Quanti\ufb01cation and Decomposition Figure 3 visually illustrates the role each model\ncomponent play in BNE\u2019s ability in prediction (predictive mean) and uncertainty quanti\ufb01cation\n(90% predictive intervals), and furthermore, how the structural uncertainty encoded in \u03b4 and G\nis used to diagnose the impact of model bias on its predictive distribution. We run the original\nBayesian ensemble (i.e. BNE without \u03b4 and G), the Bayesian Additive Ensemble (BAE) (i.e. BNE\nwithout G), and the full BNE model on 100 data points (red dots in Figure 3a-3e). As shown, the\noriginal ensemble model (Figure 3a), restricted by its parametric assumption, produces a predictive\ndistribution that fail to capture observations even in the training set. The BAE (Figure 3b) improves\nthe original ensemble by mitigating the systematic bias in prediction, and help the model to better\naccount for its epistemic uncertainty by increasing the predictive uncertainty at locations where data\nis scarse. However, BAE\u2019s aleatoric uncertainty is still biased in that it is roughly constant throughout\nthe range of x, failing to account for the heterogeneity in observation\u2019s variance, a pattern that is\nevident in data\u2019s empirical distribution. Finally, the full BNE model (Figure 3c) \ufb02exibly transforms\nits predictive distribution to better capture the empirical distribution of the data. As a result, it is able\nto properly account for the heterogeneity in the observation noise, and at the same time produced\nimproved prediction. Figure 3d and 3e quantify the impact of original ensemble\u2019s model bias on\nmodel\u2019s predictive mean and variances (see Section G for further description).\n\n7\n\n1.51.00.50.00.51.01.50.000.250.500.751.00Predictive Mean Bias0.00.10.20.30.40.50.60.70.80.91.0\fOperating Characteristics We benchmark BNE against its abalated version (BAE) and also classic\nand recent nonparametric and ensemble methods: the classic Kernel Conditional Distribution Esti-\nmator (CondKDE) that \ufb01ts the conditional distribution nonparametrically using kernel estimators\nwith cross-validated bandwidth [28]. The Bayesian Mixture of Experts (BME) combines the predic-\ntive distributions adaptively using softmax-transformed Gaussian weights as \u2211k \u03c0k(x)\u03c6 (y| fk,\u03c3k) [50].\nThe recent Bayesian Stacking (stack) [51] which uses non-adaptive weights \u2211k \u03c0k\u03c6 (y| fk,\u03c3k) but\ncalibrates \u03c0k using leave-one-out cross validation, and \ufb01nally the Deep Ensemble (DeepEns), which\n\ufb01ts a mixture of Gaussians parametrized using neural networks [24]. We vary sample size between\nE(y|x) =(cid:82) [I(y > 0)\u2212 F(y|x)]dy, a model\u2019s improvement in estimating F is re\ufb02ected directly in the\n100 and 1000, and repeat the simulation 50 times in each setting. Figure 4 shows the results. We \ufb01rst\nobserve that the patterns of change in RMSE and L1 distance are similar. This is due to the fact that\n\nimprovement in RMSE. As shown, the RMSE and L1 distance for both stack and BAE stabilized\nat higher values due to their lack of \ufb02exibility in capturing the heterogeneity in the data, producing\nbiased model estimates even in large sample. On the other hand, the mixture-of-Gaussian estimators\n(BME and DeepEns) and nonparametric estimators (CondKDE and BNE) continuously improve due\nto the \ufb02exibility in their distribution assumptions. Comparing between the best performing models\n(DeepEns, CondKDE and BNE), we notice that DeepEns has worse generalization performance in\nsmall samples, likely due to the instability of neural network estimators in the low data regime. The\nperformance for BNE and CondKDE are comparable in this time series experiment. However we\nnote that it is usually dif\ufb01cult to generalize kernel density estimators to higher dimensions [41].\n\n(cid:82) |F(y|x)\u2212 F\u2217(y|x)|dy.\n\nFigure 4: Model\u2019s convergence behavior in prediction and uncertainty estimation with respect to\nF\u2217(y|x). Left: RMSE. Right: L1 distance L(F,F\u2217) = \u2211x\u2208X\n5 Application: Spatial integration of air pollution models in Massachusetts\nIn this section, we apply BNE to a real-world air pollution prediction ensemble system in Eastern\nMassachusetts consisted of three state-of-the-art PM2.5 exposure models ([23, 16, 47]). We introduce\nthe background of air pollution ensemble systems in Section H. Our goals are to understand the\ndriving factors behind the ensemble system\u2019s uncertainty, and detect the ensemble model\u2019s systematic\nbias in predicting annual air pollution concentrations. We implement our ensemble framework on the\nbase models\u2019 out-of-sample predictions at 43 monitors in Eastern Massachusetts in 2011.\nFigure 5 visualizes the BNE\u2019s posterior predictions and uncertainty decomposition across the study\nregion. Further results are summarized in Section H. As shown, due to the sparsity in monitoring\nlocations (only 43 in this modeling area), the model\u2019s overall uncertainty is driven mainly by the\ntwo types of epistemic uncertainties. More speci\ufb01cally, the model\u2019s parametric uncertainty in 5(c)\nhighlights spatial regions where the disagreement in base model predictions has substantial in\ufb02uence\non the overall model uncertainty (e.g., the regions northwest to the City of Boston), suggesting further\ninvestigations of the performance of individual model predictions in these regions. Further, BNE\u2019s\nposterior estimates in the base models\u2019 systematic bias, i.e. P(D\u03b4 (y|x) > 0), suggests evidence of\nover-estimated PM2.5 concentrations slightly north of Boston by the coast, and also around a monitor\nwest from Boston, in Worcester, MA (see Supplementary Figure H.3).\n\n6 Discussion and Future Work\nWe developed a principled Bayesian nonparametric augmentation framework for ensemble learning\nto: 1) mitigate model bias in the prediction and distribution function, and 2) account for model\n\n8\n\n\f(a) Posterior Mean\n\n(b) Overall Uncertainty\n\n(c) Parametric Uncertainty\n\n(d) Structural Uncertainty\n\nFigure 5: Posterior Mean and Uncertainty Decomposition in the BNE model.\n\nuncertainties from different sources (aleatoric, parametric, structural) for a continuous outcome\nwith complex observational noise. The main features of this method are accurate estimation of the\naleatoric uncertainty, and principled detection and quanti\ufb01cation of model misspeci\ufb01cation in terms\nof its impact on model prediction. Experiments showed that the method produces well-calibrated\nestimation of aleatoric uncertainty and improved prediction under complex observational noise,\nand also a complete quanti\ufb01cation of different sources of epistemic uncertainty. Application to a\nreal-world air pollution prediction problem shows how this method can help in understanding the\nfactors driving model uncertainty, and in detecting the systematic errors in the ensemble system.\nThere are three important future directions for this work. The \ufb01rst direction is to adapt the BNE\nframework developed here to high dimension scenarios. This can be achieved by choosing kernel\nfunctions for \u03b4 and G that are suitable for high-dimensional problems. Example choices include\nthe additive kernel [17] or (deep) neural network kernel [3, 25]. Alternatively, one could also\nbuild variable selection into the model using shrinkage priors such as the Automatic Relevance\nDetermination (ARD), spike-and-slab, or Horseshoe [7, 49]. The second direction is to develop\ninference algorithm for BNE that are scalable to large dataset and at the same time produces rigorous\nuncertainty estimates. This is dif\ufb01cult with traditional variational inference algorithms since they\nusually does not enjoys a guarantee in fully capturing the posterior distribution. The third direction is\nto develop methods to model other important sources of uncertainty (e.g. algorithmic [8, 13, 21] and\ndata uncertainty [14, 32]) and to quantify their impact on model prediction.\nAcknowledgement Authors would like to thank Lorenzo Trippa, Jeff Miller, Boyu Ren at Harvard\nBiostatistics, Yoon Kim at Harvard CS and Ge Liu at MIT EECS for the insightful comments and\nfruitful discussion. This publication was made possible by USEPA grant RD-83587201. Its contents\nare solely the responsibility of the grantee and do not necessarily represent the of\ufb01cial views of the\nUSEPA. Further, USEPA does not endorse the purchase of any commercial products or services\nmentioned in the publication. Funding was also provided by NIH grants ES030616 and ES000002.\n\n9\n\n\fReferences\n[1] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man\u00e9. Concrete Problems\n\nin AI Safety. arXiv:1606.06565 [cs], June 2016. arXiv: 1606.06565.\n\n[2] C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343\u2013\n\n373, Dec. 2008.\n\n[3] F. Bach. Breaking the Curse of Dimensionality with Convex Neural Networks. arXiv:1412.8690\n\n[cs, math, stat], Dec. 2014. arXiv: 1412.8690.\n\n[4] H.-G. Beyer and B. Sendhoff. Robust optimization \u2013 A comprehensive survey. Computer\n\nMethods in Applied Mechanics and Engineering, 196(33):3190\u20133218, July 2007.\n\n[5] P. Billingsley. Probability and Measure. Wiley, Hoboken, N.J, anniversary edition edition, Feb.\n\n2012.\n\n[6] C. H. Bishop and K. T. Shanley. Bayesian Model Averaging\u2019s Problematic Treatment of Extreme\nWeather and a Paradigm Shift That Fixes It. Monthly Weather Review, 136(12):4641\u20134652, Dec.\n2008.\n\n[7] J. F. Bobb, L. Valeri, B. Claus Henn, D. C. Christiani, R. O. Wright, M. Mazumdar, J. J.\nGodleski, and B. A. Coull. Bayesian kernel machine regression for estimating the health effects\nof multi-pollutant mixtures. Biostatistics (Oxford, England), 16(3):493\u2013508, July 2015.\n\n[8] L. Bottou and O. Bousquet. The Tradeoffs of Large Scale Learning. In J. C. Platt, D. Koller,\nY. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20,\npages 161\u2013168. Curran Associates, Inc., 2008.\n\n[9] L. Breiman. Bagging predictors. Machine Learning, 24(2):123\u2013140, Aug. 1996.\n\n[10] L. Breiman. Stacked regressions. Machine Learning, 24(1):49\u201364, July 1996.\n\n[11] A. Buja, T. Hastie, and R. Tibshirani. Linear Smoothers and Additive Models. The Annals of\n\nStatistics, 17(2):453\u2013510, June 1989.\n\n[12] G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, Australia ; Paci\ufb01c Grove,\n\nCA, 2nd edition edition, June 2001.\n\n[13] Y.-C. Chen. Statistical Inference with Local Optima. arXiv:1807.04431 [math, stat], July 2018.\n\narXiv: 1807.04431.\n\n[14] S. Depeweg, J.-M. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft. Decomposition of Uncer-\ntainty in Bayesian Deep Learning for Ef\ufb01cient and Risk-sensitive Learning. In International\nConference on Machine Learning, pages 1184\u20131193, July 2018.\n\n[15] S. Depeweg, J. M. Hern\u00e1ndez-Lobato, F. Doshi-Velez, and S. Udluft. Uncertainty Decomposi-\ntion in Bayesian Neural Networks with Latent Variables. arXiv:1706.08495 [stat], June 2017.\narXiv: 1706.08495.\n\n[16] Q. Di, P. Koutrakis, and J. Schwartz. A hybrid prediction model for PM2.5 mass and components\nusing a chemical transport model and land use regression. Atmospheric Environment, 131:390\u2013\n399, Apr. 2016.\n\n[17] N. Durrande, D. Ginsbourger, O. Roustant, and L. Carraro. Additive Covariance Kernels for\nHigh-Dimensional Gaussian Process Modeling. arXiv:1111.6233 [stat], Nov. 2011. arXiv:\n1111.6233.\n\n[18] Y. Gal. Uncertainty in Deep Learning. PhD Thesis, University of Cambridge, 2016.\n\n[19] T. Gneiting, F. Balabdaoui, and A. E. Raftery. Probabilistic forecasts, calibration and sharpness.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243\u2013268, Apr.\n2007.\n\n[20] L. Guiso and G. Parigi.\n\nInvestment and Demand Uncertainty. The Quarterly Journal of\n\nEconomics, 114(1):185\u2013227, 1999.\n\n10\n\n\f[21] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep Learning with Limited\nNumerical Precision. In International Conference on Machine Learning, pages 1737\u20131746,\nJune 2015.\n\n[22] A. D. Kiureghian and O. Ditlevsen. Aleatory or epistemic? Does it matter? Structural Safety,\n\n31(2):105\u2013112, Mar. 2009.\n\n[23] I. Kloog, A. A. Chudnovsky, A. C. Just, F. Nordio, P. Koutrakis, B. A. Coull, A. Lyapustin,\nY. Wang, and J. Schwartz. A new hybrid spatio-temporal model for estimating daily multi-year\nPM2.5 concentrations across northeastern USA using high resolution aerosol optical depth data.\nAtmospheric Environment, 95:581\u2013590, Oct. 2014.\n\n[24] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and Scalable Predictive Uncertainty\nEstimation using Deep Ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,\nS. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems\n30, pages 6402\u20136413. Curran Associates, Inc., 2017.\n\n[25] K. Lee, H. Lee, K. Lee, and J. Shin. Training Con\ufb01dence-calibrated Classi\ufb01ers for Detecting\n\nOut-of-Distribution Samples. Nov. 2017.\n\n[26] C. Li, S. Srivastava, and D. B. Dunson. Simple, scalable and accurate posterior interval\n\nestimation. Biometrika, 104(3):665\u2013680, Sept. 2017.\n\n[27] M. Li and D. B. Dunson. Comparing and weighting imperfect models using D-probabilities.\n\nJournal of the American Statistical Association, pages 1\u201333, Apr. 2019.\n\n[28] Q. Li and J. Racine. Cross-validated local linear nonparametric regression. Statistica Sinica,\n\n14(2):485\u2013512, 2004.\n\n[29] J. Z. Liu. Gaussian Process Regression and Classi\ufb01cation under Mathematical Constraints with\n\nLearning Guarantees. arXiv:1904.09632 [cs, math, stat], Apr. 2019. arXiv: 1904.09632.\n\n[30] M. Lorenzi and M. Filippone. Constraining the Dynamics of Deep Probabilistic Models.\n\narXiv:1802.05680 [stat], Feb. 2018. arXiv: 1802.05680.\n\n[31] S. N. MacEachern. Comment on article by Jain and Neal. Bayesian Analysis, 2(3):483\u2013494,\n\nSept. 2007.\n\n[32] A. Malinin and M. Gales.\n\nPredictive Uncertainty Estimation via Prior Networks.\n\narXiv:1802.10501 [cs, stat], Feb. 2018. arXiv: 1802.10501.\n\n[33] C. A. Micchelli, Y. Xu, and H. Zhang. Universal Kernels. J. Mach. Learn. Res., 7:2651\u20132667,\n\nDec. 2006.\n\n[34] T. F. R. B. of Governors in Washington DC. The Federal Reserve Supervision and Regulation\n\nLetters 11-7: Guidance on Model Risk Management, Apr. 2011.\n\n[35] Y. Qian, C. Jackson, F. Giorgi, B. Booth, Q. Duan, C. Forest, D. Higdon, Z. J. Hou, and\nG. Huerta. Uncertainty Quanti\ufb01cation in Climate Modeling and Projection. Bulletin of the\nAmerican Meteorological Society, 97(5):821\u2013824, Jan. 2016.\n\n[36] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. University\n\nPress Group Limited, Jan. 2006. Google-Books-ID: vWtwQgAACAAJ.\n\n[37] B. J. Reich, J. S. Hodges, and V. Zadnik. Effects of residual smoothing on the posterior of the\n\n\ufb01xed effects in disease-mapping models. Biometrics, 62(4):1197\u20131206, Dec. 2006.\n\n[38] J. Riihim\u00e4ki and A. Vehtari. Gaussian processes with monotonicity information. In Proceedings\nof the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 645\u2013\n652, Mar. 2010.\n\n[39] D. Ruppert, M. P. Wand, and R. J. Carroll. Semiparametric Regression. Cambridge University\n\nPress, Cambridge ; New York, 1 edition edition, July 2003.\n\n11\n\n\f[40] R. G. Sargent. Veri\ufb01cation and validation of simulation models. In Proceedings of the 2010\n\nWinter Simulation Conference, pages 166\u2013183, Dec. 2010.\n\n[41] D. W. Scott. The Curse of Dimensionality and Dimension Reduction. In Multivariate Density\n\nEstimation, pages 217\u2013240. John Wiley & Sons, Ltd, 2015.\n\n[42] T. J. Sullivan. Introduction to Uncertainty Quanti\ufb01cation. Texts in Applied Mathematics.\n\nSpringer International Publishing, 2015.\n\n[43] S. D. Team. Stan User\u2019s Guide. 2018.\n\n[44] A. van der Vaart and H. van Zanten. Rates of contraction of posterior distributions based on\n\nGaussian process priors. The Annals of Statistics, 36(3):1435\u20131463, June 2008.\n\n[45] A. W. van der Vaart and J. H. van Zanten. Adaptive Bayesian Estimation Using a Gaussian\nRandom Field with Inverse Gamma Bandwidth. The Annals of Statistics, 37(5B):2655\u20132675,\n2009.\n\n[46] A. W. van der Vaart and J. H. van Zanten. Information rates of nonparametric Gaussian process\n\nmethods. Journal of Machine Learning Research, 12:2\u201395\u20132119, 2011.\n\n[47] A. van Donkelaar, R. V. Martin, R. J. D. Spurr, and R. T. Burnett. High-Resolution Satellite-\nDerived PM2.5 from Optimal Estimation and Geographically Weighted Regression over North\nAmerica. Environmental Science & Technology, 49(17):10482\u201310491, Sept. 2015.\n\n[48] T. Vandal, E. Kodra, J. Dy, S. Ganguly, R. Nemani, and A. R. Ganguly. Quantifying Uncertainty\nin Discrete-Continuous and Skewed Data with Bayesian Deep Learning. Proceedings of the\n24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining - KDD\n\u201918, pages 2377\u20132386, 2018. arXiv: 1802.04742.\n\n[49] G. Vo and D. Pati. Sparse Additive Gaussian Process with Soft Interactions. Open Journal of\n\nStatistics, 07(04):567\u2013588, 2017.\n\n[50] S. R. Waterhouse, D. MacKay, and A. J. Robinson. Bayesian Methods for Mixtures of Experts.\nIn D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information\nProcessing Systems 8, pages 351\u2013357. MIT Press, 1996.\n\n[51] Y. Yao, A. Vehtari, D. Simpson, and A. Gelman. Using Stacking to Average Bayesian Predictive\n\nDistributions (with Discussion). Bayesian Analysis, 13(3):917\u20131003, Sept. 2018.\n\n12\n\n\f", "award": [], "sourceid": 4806, "authors": [{"given_name": "Jeremiah", "family_name": "Liu", "institution": "Google Research / Harvard"}, {"given_name": "John", "family_name": "Paisley", "institution": "Columbia University"}, {"given_name": "Marianthi-Anna", "family_name": "Kioumourtzoglou", "institution": "Columbia University"}, {"given_name": "Brent", "family_name": "Coull", "institution": "Harvard University"}]}