{"title": "Reliable training and estimation of variance networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6326, "page_last": 6336, "abstract": "We propose and investigate new complementary methodologies for estimating predictive variance networks in regression neural networks. We derive a locally aware mini-batching scheme that results in sparse robust gradients, and we show how to make unbiased weight updates to a variance network. Further, we formulate a heuristic for robustly fitting both the mean and variance networks post hoc. Finally, we take inspiration from posterior Gaussian processes and propose a network architecture with similar extrapolation properties to Gaussian processes. The proposed methodologies are complementary, and improve upon baseline methods individually. Experimentally, we investigate the impact of predictive uncertainty on multiple datasets and tasks ranging from regression, active learning and generative modeling. Experiments consistently show significant improvements in predictive uncertainty estimation over state-of-the-art methods across tasks and datasets.", "full_text": "Reliable training and estimation of variance networks\n\nNicki S. Detlefsen\u2217 \u2020\n\nnsde@dtu.dk\n\nMartin J\u00f8rgensen* \u2020\n\nmarjor@dtu.dk\n\nS\u00f8ren Hauberg \u2020\nsohau@dtu.dk\n\nAbstract\n\nWe propose and investigate new complementary methodologies for estimating\npredictive variance networks in regression neural networks. We derive a locally\naware mini-batching scheme that results in sparse robust gradients, and we show\nhow to make unbiased weight updates to a variance network. Further, we formulate\na heuristic for robustly \ufb01tting both the mean and variance networks post hoc. Finally,\nwe take inspiration from posterior Gaussian processes and propose a network\narchitecture with similar extrapolation properties to Gaussian processes. The\nproposed methodologies are complementary, and improve upon baseline methods\nindividually. Experimentally, we investigate the impact of predictive uncertainty on\nmultiple datasets and tasks ranging from regression, active learning and generative\nmodeling. Experiments consistently show signi\ufb01cant improvements in predictive\nuncertainty estimation over state-of-the-art methods across tasks and datasets.\n\n1\n\nIntroduction\n\nThe quality of mean predictions has dramatically increased in the last decade with the rediscovery of\nneural networks [LeCun et al., 2015]. The predictive variance, however, has turned out to be a more\nelusive target, with established solutions being subpar. The general \ufb01nding is that neural networks\ntend to make overcon\ufb01dent predictions [Guo et al., 2017] that can be harmful or offensive [Amodei\net al., 2016]. This may be explained by neural networks being general function estimators that does\nnot come with principled uncertainty estimates. Another explanation is that variance estimation is a\nfundamentally different task than mean estimation, and that the tools for mean estimation perhaps do\nnot generalize. We focus on the latter hypothesis within regression.\nTo illustrate the main practical problems in variance estimation, we\nconsider a toy problem where data is generated as y = x \u00b7 sin(x) +\n0.3\u00b7\u00011 +0.3\u00b7x\u00b7\u00012, with \u00011, \u00012 \u223c N (0, 1) and x is uniform on [0, 10]\n(Fig. 1). As is common, we do maximum likelihood estimation of\nN (\u00b5(x), \u03c32(x)), where \u00b5 and \u03c32 are neural nets. While \u00b5 provides\nan almost perfect \ufb01t to the ground truth, \u03c32 shows two problems: \u03c32\nis signi\ufb01cantly underestimated and \u03c32 does not increase outside the\ndata support to capture the poor mean predictions.\nThese \ufb01ndings are general (Sec. 4), and alleviating them is the main\npurpose of the present paper. We \ufb01nd that this can be achieved by a\ncombination of methods that 1) change the usual mini-batching to be location aware; 2) only optimize\nvariance conditioned on the mean; 3) for scarce data, we introduce a more robust likelihood function;\nand 4) enforce well-behaved interpolation and extrapolation of variances. Points 1 and 2 are achieved\nthrough changes to the training algorithm, while 3 and 4 are changes to model speci\ufb01cations. We\nempirically demonstrate that these new tools signi\ufb01cantly improve on state-of-the-art across datasets\nin tasks ranging from regression to active learning, and generative modeling.\n\nFigure 1: Max. likelihood \ufb01t\nof N (\u00b5(x), \u03c32(x)) to data.\n\n\u2217Equal contribution\n\u2020Section for Cognitive Systems, Technical University of Denmark\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Related work\n\nGaussian processes (GPs) are well-known function approximators with built-in uncertainty estima-\ntors [Rasmussen and Williams, 2006]. GPs are robust in settings with a low amount of data, and\ncan model a rich class of functions with few hyperparameters. However, GPs are computationally\nintractable for large amounts of data and limited by the expressiveness of a chosen kernel. Advances\nlike sparse and deep GPs [Snelson and Ghahramani, 2006, Damianou and Lawrence, 2013] partially\nalleviate this, but neural nets still tend to have more accurate mean predictions.\nUncertainty aware neural networks model the predictive mean and variance as two separate neural\nnetworks, often as multi-layer perceptrons. This originates with the work of Nix and Weigend [1994]\nand Bishop [1994]; today, the approach is commonly used for making variational approximations\n[Kingma and Welling, 2013, Rezende et al., 2014], and it is this general approach we investigate.\nBayesian neural networks (BNN) [MacKay, 1992] assume a prior distribution over the network\nparameters, and approximate the posterior distribution. This gives direct access to the approximate\npredictive uncertainty. In practice, placing an informative prior over the parameters is non-trivial.\nEven with advances in stochastic variational inference [Kingma and Welling, 2013, Rezende et al.,\n2014, Hoffman et al., 2013] and expectation propagation [Hern\u00e1ndez-Lobato and Adams, 2015], it is\nstill challenging to perform inference in BNNs.\nEnsemble methods represent the current state-of-the-art. Monte Carlo (MC) Dropout [Gal and\nGhahramani, 2016] measure the uncertainty induced by Dropout layers [Hinton et al., 2012] arguing\nthat this is a good proxy for predictive uncertainty. Deep Ensembles [Lakshminarayanan et al.,\n2017] form an ensemble from multiple neural networks trained with different initializations. Both\napproaches obtain ensembles of correlated networks, and the extent to which this biases the predictive\nuncertainty is unclear. Alternatives include estimating con\ufb01dence intervals instead of variances\n[Pearce et al., 2018], and gradient-based Bayesian model averaging [Maddox et al., 2019].\nApplications of uncertainty include reinforcement learning, active learning, and Bayesian optimiza-\ntion [Szepesv\u00e1ri, 2010, Huang et al., 2010, Frazier, 2018]. Here, uncertainty is the crucial element\nthat allows for systematically making a trade-off between exploration and exploitation. It has also\nbeen shown that uncertainty is required to learn the topology of data manifolds [Hauberg, 2018].\nThe main categories of uncertainty are epistemic and aleatoric uncertainty [Kiureghian and\nDitlevsen, 2009, Kendall and Gal, 2017]. Aleatoric uncertainty is induced by unknown or un-\nmeasured features, and, hence, does not vanish in the limit of in\ufb01nite data. Epistemic uncertainty\nis often referred to as model uncertainty, as it is the uncertainty due to model limitations. It is this\ntype of uncertainty that Bayesian and ensemble methods generally estimate. We focus on the overall\npredictive uncertainty, which re\ufb02ects both epistemic and aleatoric uncertainty.\n\n3 Methods\n\nThe opening remarks (Sec. 1) highlighted two common problems that appear when \u00b5 and \u03c32 are\nneural networks. In this section we analyze these problems and propose solutions.\nPreliminaries. We assume that datasets D = {xi, yi}N\ni=1 contain i.i.d. observations yi \u2208 R, xi \u2208\n(cid:80)\nRD. The targets yi are assumed to be conditionally Gaussian, p\u03b8(y|x) = N (y|\u00b5(x), \u03c32(x)), where\n\u00b5 and \u03c32 are continuous functions parametrized by \u03b8 = {\u03b8\u00b5, \u03b8\u03c32}. The maximum likelihood estimate\n(MLE) of the variance of i.i.d. observations {yi}N\ni(yi \u2212 \u02c6\u00b5)2, where \u02c6\u00b5 is the sample\nmean. This MLE does not exist based on a single observation, unless the mean \u00b5 is known, i.e. the\nmean is not a free parameter. When yi is Gaussian, the residuals (yi \u2212 \u00b5)2 are gamma distributed.\n\ni=1 is\n\n1\n\nN\u22121\n\n3.1 A local likelihood model analysis\n\nBy assuming that both \u00b5 and \u03c32 are continuous functions, we are implicitly saying that \u03c32(x) is\ncorrelated with \u03c32(x + \u03b4) for suf\ufb01ciently small \u03b4, and similar for \u00b5. Consider the local likelihood\nestimation problem [Loader, 1999, Tibshirani and Hastie, 1987] at a point xi,\n\nlog \u02dcp\u03b8(yi|xi) =\n\nwj(xi) log p\u03b8(yj|xj),\n\n(1)\n\nN(cid:88)\n\nj=1\n\n2\n\n\fwhere wj is a function that declines as (cid:107)xj \u2212 xi(cid:107) increases, implying that the local likelihood at xi\nis dependent on the points nearest to xi. Notice \u02dcp\u03b8(yi|xi) = p\u03b8(yi|xi) if wj(xi) = 1i=j. Consider,\nwith this w, a uniformly drawn subsample (i.e. a standard mini-batch) of the data {xk}M\nk=1 and its\ncorresponding stochastic gradient of Eq. 1 with respect to \u03b8\u03c32. If for a point, xi, no points near it\nare in the subsample, then no other point will in\ufb02uence the gradient of \u03c32(xi), which will point in\nthe direction of the MLE, that is highly uninformative as it does not exist unless \u00b5(xi) is known.\nLocal data scarcity, thus, implies that while we have suf\ufb01cient data for \ufb01tting a mean, locally we\nhave insuf\ufb01cient data for \ufb01tting a variance. Essentially, if a point is isolated in a mini-batch, all\ninformation it carries goes to updating \u00b5 and none is present for \u03c32.\nIf we do not use mini-batches, we encounter that gradients wrt. \u03b8\u00b5 and \u03b8\u03c32 will both be scaled with\n2\u03c32(x) meaning that points with small variances effectively have higher learning rates [Nix and\nWeigend, 1994]. This implies a bias towards low-noise regions of data.\n\n1\n\n3.2 Horvitz-Thompson adjusted stochastic gradients\n\nWe will now consider a solution to this problem within the local likelihood framework, which will\ngive us a reliable, but biased, stochastic gradient for the usual (nonlocal) log-likelihood. We will then\nshow how this can be turned into an unbiased estimator.\nIf we are to add some local information, giving more reliable gradients, we should choose a w in Eq.1\nthat re\ufb02ects this. Assume for simplicity that wj(xi) = 1(cid:107)xi\u2212xj(cid:107) 0. The gradient\nof log \u02dcp\u03b8(y|xi) will then be informative, as more than one observation will contribute to the local\nvariance if d is chosen appropriately. Accordingly, we suggest a practical mini-batching algorithm\nthat samples a random point xj and we let the mini-batch consist of the k nearest neighbors of xj.3\nIn order to allow for more variability in a mini-batch, we suggest sampling m points uniformly,\nand then sampling n points among the k nearest neighbors of each of the m initially sampled\npoints. Note that this is a more informative sample, as all observations in the sample are likely to\nin\ufb02uence the same subset of parameters in \u03b8, effectively increasing the degrees of freedom4, hence\nthe quality of variance estimation. In other words, if the variance network is suf\ufb01ciently expressive,\nour Monte Carlo gradients under this sampling scheme are of smaller variation and more sparse. In\nthe supplementary material, we empirically show that this estimator yields signi\ufb01cantly more sparse\ngradients, which results in improved convergence. Pseudo-code of this sampling-scheme, can be\nfound in the supplementary material.\nWhile such a mini-batch would give rise to an informative stochastic gradient, it would not be an\nunbiased stochastic gradient of the (nonlocal) log-likelihood. This can, however, be adjusted by using\nthe Horvitz-Thompson (HT) algorithm [Horvitz and Thompson, 1952], i.e. rescaling the log-likelihood\ncontribution of each sample xj by its inclusion probability \u03c0j. With this, an unbiased estimate of the\nlog-likelihood (up to an additive constant) becomes\n\nlog(\u03c32(xi)) \u2212 (yi \u2212 \u00b5(xi))2\n\n2\u03c32(xi)\n\n\u2212 1\n2\n\nlog(\u03c32(xj)) \u2212 (yj \u2212 \u00b5(xj))2\n\n2\u03c32(xj)\n\n\u2212 1\n2\n\n1\n\u03c0j\n\n(2)\n\nwhere O denotes the mini-batch. With the nearest neighbor mini-batching, the inclusion probabilities\ncan be calculated as follows. The probability that observation j is in the sample is n/k if it is among\nthe k nearest neighbors of one of the initial m points, which are chosen with probability m/N, i.e.\n\n(cid:26)\n\nN(cid:88)\n\ni=1\n\n(cid:27)\n\n(cid:26)\n\n(cid:27)\n\n\u2248 (cid:88)\n\nxj\u2208O\n\nN(cid:88)\n\ni=1\n\nn\nk\nwhere Ok(i) denotes the k nearest neighbors of xi.\n\n\u03c0j =\n\nm\nN\n\n1j\u2208Ok(i),\n\n(3)\n\nComputational costs The proposed sampling scheme requires an upfront computational cost of\nO(N 2D) before any training can begin. We stress that this is pre-training computation and not\n\n3By convention, we say that the nearest neighbor of a point is the point itself.\n4Degrees of freedom here refers to the parameters in a Gamma distribution \u2013 the distribution of variance\nestimators under Gaussian likelihood. Degrees of freedom in general is a quite elusive quantity in regression\nproblems.\n\n3\n\n\fupdated during training. The cost is therefore relative small, compared to training a neural network\nfor small to medium size datasets. Additionally, we note that the search algorithm does not have to\nbe precise, and we could therefore take advantage of fast approximate nearest neighbor algorithms\n[Fu and Cai, 2016].\n\n3.3 Mean-variance split training\n\nThe most common training strategy is to \ufb01rst optimize \u03b8\u00b5 assuming a constant \u03c32, and then proceed\nto optimize \u03b8 = {\u03b8\u00b5, \u03b8\u03c32} jointly, i.e. a warm-up of \u00b5. As previously noted, the MLE of \u03c32 does\nnot exist when only a single observation is available and \u00b5 is unknown. However, the MLE does\nexist when \u00b5 is known, in which case it is \u02c6\u03c32(xi) = (yi \u2212 \u00b5(xi))2, assuming that the continuity of\n\u03c32 is not crucial. This observation suggests that the usual training strategy is substandard as \u03c32 is\nnever optimized assuming \u00b5 is known. This is easily solved: we suggest to never updating \u00b5 and \u03c32\nsimultaneously, i.e. only optimize \u00b5 conditioned on \u03c32, and vice versa. This reads as sequentially\noptimizing p\u03b8(y|\u03b8\u00b5) and p\u03b8(y|\u03b8\u03c32), as we under these conditional distributions we may think of \u00b5\nand \u03c32 as known, respectively. We will refer to this as mean-variance split training (MV).\n\n3.4 Estimating distributions of variance\n\nWhen \u03c32(xi) is in\ufb02uenced by few observations, underestimation is still likely due to the left skewness\ni = (yi \u2212 \u00b5(xi))2. As always, when in a low data regime, it is sensible\nof the gamma distribution of \u02c6\u03c32\nto be Bayesian about it; hence instead of point estimating \u02c6\u03c32\ni we seek to \ufb01nd a distribution. Note\nthat we are not imposing a prior, we are training the parameters of a Bayesian model. We choose\nthe inverse-Gamma distribution, as this is the conjugate prior of \u03c32 when data is Gaussian. This\nmeans \u03b8\u03c32 = {\u03b8\u03b1, \u03b8\u03b2} where \u03b1, \u03b2 > 0 are the shape and scale parameters of the inverse-Gamma\nrespectively. So the log-likelihood is now calculated by integrating out \u03c32\n\nN (yi|\u00b5i, \u03c32\n\ni )d\u03c32\n\nlog p\u03b8(yi) = log\n\ni = log t\u00b5i,\u03b1i,\u03b2i (yi),\n\n(4)\ni \u223c INV-GAMMA(\u03b1i, \u03b2i) and \u03b1i = \u03b1(xi), \u03b2i = \u03b2(xi) are modeled as neural networks.\nwhere \u03c32\nHaving an inverse-Gamma prior changes the predictive distribution to a located-scaled5 Student-t\ndistribution, parametrized with \u00b5, \u03b1 and \u03b2. Further, the t-distribution is often used as a replacement\nof the Gaussian when data is scarce and the true variance is unknown and yields a robust regression\n[Gelman et al., 2014, Lange et al., 1989]. We let \u03b1 and \u03b2 be neural networks that implicitly determine\nthe degrees of freedom and the scaling of the distribution. Recall the higher the degrees of freedom,\nthe better the Gaussian approximation of the t-distribution.\n\n(cid:90)\n\n3.5 Extrapolation architecture\n\n\u02c6\u03c32(x0) =(cid:0)1 \u2212 \u03bd(\u03b4(x0))(cid:1)\u02c6\u03c32\n\nIf we evaluate the local log-likelihood (Eq. 1) at a point x0 far away from all data points, then\nthe weights wi(x0) will all be near (or exactly) zero. Consequently, the local log-likelihood is\napproximately 0 regardless of the observed value y(x0), which should be interpreted as a large\nentropy of y(x0). Since we are working with Gaussian and t-distributed variables, we can recreate\nthis behavior by exploiting the fact that entropy is only an increasing function of the variance. We can\nre-enact this behavior by letting the variance tend towards an a priori determined value \u03b7 if x0 tends\naway from the training data. Let {ci}L\ni=1 be points in RD that represent the training data, akin to\ninducing points in sparse GPs [Snelson and Ghahramani, 2006]. Then de\ufb01ne \u03b4(x0) = mini (cid:107)ci\u2212x0(cid:107)\nand\n\n\u03b8 + \u03b7\u03bd(\u03b4(x0)),\n\n(5)\nwhere \u03bd : [0,\u221e) (cid:55)\u2192 [0, 1] is a surjectively increasing function. Then the variance estimate will go to\n\u03b7 as \u03b4 \u2192 \u221e at a rate determined by \u03bd. In practice, we choose \u03bd to be a scaled-and-translated sigmoid\nfunction: \u03bd(x) = sigmoid((x + a)/\u03b3), where \u03b3 is a free parameter we optimize during training and\na \u2248 \u22126.9077\u03b3 to ensure that \u03bd(0) \u2248 0. The inducing points ci are initialized with k-means and\noptimized during training. This choice of architecture is similar to that attained by posterior Gaussian\nprocesses when the associated covariance function is stationary. It is indeed the behavior of these\nestablished models that we aim to mimic with Eq. 5.\n\n5This means y \u223c F , where F = \u00b5 + \u03c3t(\u03bd). The explicit density can be found in the supplementary material.\n\n4\n\n\f4 Experiments\n\n4.1 Regression\n\nTo test our methodologies we conduct multiple experiments in various settings. We compare our\nmethod to state-of-the-art methods for quantifying uncertainty: Bayesian neural network (BNN)\n[Hern\u00e1ndez-Lobato and Adams, 2015], Monte Carlo Dropout (MC-Dropout) [Gal and Ghahramani,\n2016] and Deep Ensembles (Ens-NN) [Lakshminarayanan et al., 2017]. Additionally we compare to\ntwo baseline methods: standard mean-variance neural network (NN) [Nix and Weigend, 1994] and\nGPs (sparse GPs (SGP) when standard GPs are not applicable) [Rasmussen and Williams, 2006]. We\nrefer to our own method(s) as Combined, since we apply all the methodologies described in Sec. 3.\nImplementation details and code can be found in the supplementary material. Strict comparisons\nof the models should be carefully considered; having two seperate networks to model mean and\nvariance seperately (as NN, Ens-NN and Combined) means that all the predictive uncertainty, i.e. both\naleatoric and episteminc, is modeled by the variance networks alone. BNN and MC-Dropout have a\nhigher emphasis on modeling epistemic uncertainty, while GPs have the cleanest separation of noise\nand model uncertainty estimation. Despite the methods quantifying different types of uncertainty,\ntheir results can still be ranked by test set log-likelihood, which is a proper scoring function.\n\nToy regression. We \ufb01rst return to the toy problem of Sec. 1, where we consider 500 points from\ny = x \u00b7 sin(x) + 0.3 \u00b7 \u00011 + 0.3 \u00b7 x \u00b7 \u00012, with \u00011, \u00012 \u223c N (0, 1). In this example, the variance is\nheteroscedastic, and models should estimate larger variance for larger values of x. The results6 can\nbe seen in Figs. 2 and 3. Our approach is the only one to satisfy all of the following: capture the\nheteroscedasticity, extrapolate high variance outside data region and not underestimating within.\n\nFigure 2: From top left to bottom right: GP, NN,\nBNN, MC-Dropout, Ens-NN, Combined.\n\nFigure 3: Standard deviation estimates\nas a function of x.\n\nVariance calibration. To our knowledge, no benchmark for quantifying variance estimation exists.\nWe propose a simple dataset with known uncertainty information. More precisely, we consider\nweather data from over 130 years.7 Each day the maximum temperature is measured, and the\nuncertainty is then given as the variance in temperature over the 130 years. The \ufb01tted models can\nbe seen in Fig. 4. Here we measure performance by calculating the mean error in uncertainty:\nest(xi)|. The numbers are reported above each \ufb01t. We observe that our\nErr = 1\nN\nCombined model achieves the lowest error of all the models, closely followed by Ens-NN and GP.\nBoth NN, BNN and MC-Dropout all severely underestimate the uncertainty.\n\n(cid:80)N\ni=1 |\u03c32\n\ntrue(xi) \u2212 \u03c32\n\nAblation study. To determine the in\ufb02uence of each methodology from Sec. 3, we experimented\nwith four UCI regression datasets (Fig. 5). We split our contributions in four: the locality sampler\n(LS), the mean-variance split (MV), the inverse-gamma prior (IG) and the extrapolating architecture\n(EX). The combined model includes all four tricks. The results clearly shows that LS and IG\nmethodologies has the most impact on test set log likelihood, but none of the methodologies perform\nworse than the baseline model. Combined they further improves the results, indicating that the\nproposed methodologies are complementary.\n\n6The standard deviation plotted for Combined, is the root mean of the inverse-Gamma.\n7https://mrcc.illinois.edu/CLIMATE/Station/Daily/StnDyBTD2.jsp\n\n5\n\n\u000f\u000b\u0012\r\u000b\r\u000f\u000b\u0012\u0012\u000b\r\u0001\u000b\u0012\u000e\r\u000b\r\u000e\u000f\u000b\u0012x\r\u000e\u000f\u0010\u0011\u0012\u0013\u000189/\u0005%7:0\u000389/\u0002!\u001f\u001f\u001e\u001a\n\u001b7454:9\u001c38\n\u001f\u001f\u0019\u001f\u001f\u001a42-\u000430/\f(a) GP\n\n(f) Combined\nFigure 4: Weather data with uncertainties. Dots are datapoints, green lines are the true uncertainty,\nblue curves are mean predictions and the blue shaded areas are the estimated uncertainties.\n\n(d) MC-Dropout\n\n(e) Ens-NN\n\n(c) BNN\n\n(b) NN\n\nFigure 5: The complementary methodologies from Sec. 3 evaluated on UCI benchmark datasets.\n\nUCI benchmark. We now follow the experimental setup from Hern\u00e1ndez-Lobato and Adams\n[2015], by evaluating models on a number of regression datasets from the UCI machine learning\ndatabase. Additional to the standard benchmark, we have added 4 datasets. Test set log-likelihood can\nbe seen in Table 1, and the corresponding RMSE scores can be found in the supplementary material.\nOur Combined model performs best on 10 of the 13 datasets. For the small Boston and Yacht datasets,\nthe standard GP performs the best, which is in line with the experience that GPs perform well when\ndata is scarce. On these datasets our model is the best-performing neural network. On the Energy\nand Protein datasets Ens-NN perform the best, closely followed by our Combined model. One clear\nadvantage of our model compared to Ens-NN is that we only need to train one model, whereas\nEns-NN need to train 5+ (see the supplementary material for training times for each model). The\nworst performing model in all cases is the baseline NN model, which clearly indicates that the usual\ntools for mean estimation does not carry over to variance estimation.\n\nActive learning. The performance of active learning depends on predictive uncertainty [Settles,\n2009], so we use this to demonstrate the improvements induced by our method. We use the same\nnetwork architectures and datasets as in the UCI benchmark. Each dataset is split into: 20% train, 60%\npool and 20% test. For each active learning iteration, we \ufb01rst train a model, evaluate the performance\non the test set and then estimate uncertainty for all datapoints in the pool. We then select the n points\nwith highest variance (corresponding to highest entropy [Houlsby et al., 2012]) and add these to the\n\nGP\n\nNN\n\nBNN\n\nSGP\n\nCombined\n\nMC-Dropout\n\nEns-NN\n\nD\nN\n13 \u22121.76 \u00b1 0.3 \u22121.85 \u00b1 0.25 \u22123.64 \u00b1 0.09 \u22122.59 \u00b1 0.11 \u22122.51 \u00b1 0.31 \u22122.45 \u00b1 0.25 \u22122.09 \u00b1 0.09\n506\nBoston\n- 3.74 \u00b1 0.53 \u22122.03 \u00b1 0.14 \u22121.1 \u00b1 1.76 \u22121.08 \u00b1 0.05 \u22120.44 \u00b1 7.28 4.35 \u00b1 0.16\n10721 7\nCarbon\n8 \u22122.13 \u00b1 0.14 \u22122.29 \u00b1 0.12 \u22124.23 \u00b1 0.07 \u22123.31 \u00b1 0.05 \u22123.11 \u00b1 0.12 \u22123.06 \u00b1 0.32 \u22121.78 \u00b1 0.04\n1030\nConcrete\n8 \u22121.85 \u00b1 0.34 \u22122.22 \u00b1 0.15 \u22123.78 \u00b1 0.04 \u22122.07 \u00b1 0.08 \u22122.01 \u00b1 0.11 \u22121.48 \u00b1 0.31 \u22121.68 \u00b1 0.13\n768\nEnergy\n- 2.01 \u00b1 0.02 \u22120.08 \u00b1 0.02 0.95 \u00b1 0.08 0.95 \u00b1 0.15\n1.18 \u00b1 0.03 2.49 \u00b1 0.07\n8192\n8\nKin8nm\n5.55 \u00b1 0.05 7.27 \u00b1 0.13\n- 3.47 \u00b1 0.21 3.71 \u00b1 0.05 3.80 \u00b1 0.09\n11934 16\n-\nNaval\n- \u22121.9 \u00b1 0.03 \u22124.26 \u00b1 0.14 \u22122.89 \u00b1 0.01 \u22122.89 \u00b1 0.14 \u22122.77 \u00b1 0.04 \u22121.19 \u00b1 0.03\n4\nPower plant 9568\n- \u22122.95 \u00b1 0.09 \u22122.91 \u00b1 0.00 \u22122.93 \u00b1 0.14 \u22122.80 \u00b1 0.02 \u22122.83 \u00b1 0.05\n-\nProtein\n45730 9\n- \u22124.07 \u00b1 0.01 \u22124.92 \u00b1 0.10 \u22123.06 \u00b1 0.14 \u22122.91 \u00b1 0.19 \u22123.01 \u00b1 0.05 \u22122.43 \u00b1 0.05\nSuperconduct 21263 81\n0.96 \u00b1 0.18 \u22120.08 \u00b1 0.01 \u22121.19 \u00b1 0.11 \u22120.98 \u00b1 0.01 \u22120.94 \u00b1 0.01 \u22120.93 \u00b1 0.09 1.21 \u00b1 0.23\n11\nWine (red)\n1599\n- \u22120.14 \u00b1 0.05 \u22121.29 \u00b1 0.09 \u22121.41 \u00b1 0.17 \u22121.26 \u00b1 0.01 \u22120.99 \u00b1 0.06 0.40 \u00b1 0.42\nWine (white) 4898\n11\n7 0.16 \u00b1 1.22 \u22120.38 \u00b1 0.32 \u22124.12 \u00b1 0.17 \u22121.65 \u00b1 0.05 \u22121.55 \u00b1 0.12 \u22121.18 \u00b1 0.21 \u22120.07 \u00b1 0.05\n308\nYacht\n- \u22125.21 \u00b1 0.87 \u22123.97 \u00b1 0.34 \u22123.78 \u00b1 0.01 \u22123.42 \u00b1 0.02 \u22123.01 \u00b1 0.14\nYear\n515345 90\nTable 1: Dataset characteristics and tests set log-likelihoods for the different methods. A - indicates\nthe model was infeasible to train. Bold highlights the best results.\n\n-\n\n6\n\n\u0012\r\u000e\r\r\u000e\u0012\r\u000f\r\r\u000f\u0012\r\u0010\r\r\u0010\u0012\rx\r\u000f\r\u0011\r\u0013\r\u0001\r\u000e\r\ry\u001c77\u0014\u000e\u000b\u0001\u0011\r\u0012\r\u000e\r\r\u000e\u0012\r\u000f\r\r\u000f\u0012\r\u0010\r\r\u0010\u0012\rx\r\u000f\r\u0011\r\u0013\r\u0001\r\u000e\r\ry\u001c77\u0014\u000f\u000b\u0001\u0011050100150200250300350x020406080100yErr=7.6\r\u0012\r\u000e\r\r\u000e\u0012\r\u000f\r\r\u000f\u0012\r\u0010\r\r\u0010\u0012\rx\r\u000f\r\u0011\r\u0013\r\u0001\r\u000e\r\ry\u001c77\u0014\u0010\u000b\u0001\u000e\r\u0012\r\u000e\r\r\u000e\u0012\r\u000f\r\r\u000f\u0012\r\u0010\r\r\u0010\u0012\rx\r\u000f\r\u0011\r\u0013\r\u0001\r\u000e\r\ry\u001c77\u0014\u000e\u000b\u0001\u0013\r\u0012\r\u000e\r\r\u000e\u0012\r\u000f\r\r\u000f\u0012\r\u0010\r\r\u0010\u0012\rx\r\u000f\r\u0011\r\u0013\r\u0001\r\u000e\r\ry\u001c77\u0014\u000e\u000b\u00133.753.503.253.002.752.502.252.00logp(x)NNNN+LSNN+MVNN+EXNN+IGCombinedBoston4.03.53.02.52.0logp(x)NNNN+LSNN+MVNN+EXNN+IGCombinedConcrete1.00.50.00.51.01.5logp(x)NNNN+LSNN+MVNN+EXNN+IGCombinedWine (red)1.51.00.50.00.51.0logp(x)NNNN+LSNN+MVNN+EXNN+IGCombinedWine (white)\ftraining set. We set n = 1% of the initial pool size. This is repeated 10 times, such that the last model\nis trained on 30%. We repeat this on 10 random training-test splits to compute standard errors.\nFig. 6,show the evolution of average RMSE for each method during the data collection process for\nthe Boston, Superconduct and Wine (white) datasets (all remaining UCI datasets are visualized in the\nsupplementary material). In general, we observe two trends. For some datasets we observe that our\nCombined model outperforms all other models, achieving signi\ufb01cantly faster learning. This indicates\nthat our model is better at predicting the uncertainty of the data in the pool set. On datasets where the\nsampling process does not increase performance, we are on par with other models.\n\nFigure 6: Average test set RMSE and standard errors in active learning. The remaining datasets are\nshown in the supplementary material.\n\n4.2 Generative models\n\nTo show a broader application of our approach, we also explore it in the context of generative\nmodeling. We focus on variational autoencoders (VAEs) [Kingma and Welling, 2013, Rezende et al.,\n2014] that are popular deep generative models. A VAE model the generative process:\n\np\u03b8(x|z) =N(cid:0)x|\u00b5\u03b8(z), \u03c32\n\u03b8 (z)(cid:1)\n\nor p\u03b8(x|z) =B(cid:0)x|\u00b5\u03b8(z)(cid:1),\n\n(cid:90)\n\n(6)\n\np(x) =\n\np\u03b8(x|z)p(z)dz,\nwhere p(z) = N (0, Id).\nThis is trained by introducing a variational approximation\nq\u03c6(z|x) = N (z|\u00b5\u03c6(x), \u03c32\n\u03c6(x)) and then jointly training p\u03b8 and q\u03c6. For our purposes, it\nsuffcient to note that a VAE estimates both a mean and a variance function. Thus using standard\ntraining methods, the same problems arise as in the regression setting. Mattei and Frellsen [2018]\nhave recently shown that estimating a VAE is ill-posed unless the variance is bounded from below.\nIn the literature, we often \ufb01nd that\n1. Variance networks are avoided by using a Bernoulli distribution, even if data is not binary.\n2. Optimizing VAEs with a Gaussian posterior is considerably harder than the Bernoulli case. To\novercome this, the variance is often set to a constant e.g. \u03c32(z) = 1. The consequence is that the\nlog-likelihood reconstruction term in the ELBO collapses into an L2 reconstruction term.\n3. Even though the generative process is given by Eq. 6, samples shown in the literature are often\nreduced to \u02dcx = \u00b5(z), z \u223c N (0, I). This is probably due to the wrong/meaningless variance term.\nWe aim to \ufb01x this by training the posterior variance \u03c32\n\u03b8 (z) with our Combined method. We do not\nchange the encoder variance \u03c32\n\n\u03c6(x) and leave this to future study.\n\nArti\ufb01cial data. We \ufb01rst evaluate the bene\ufb01ts of more reliable variance networks in VAEs on\narti\ufb01cial data. We generate data inspired by the two moon dataset8, which we map into four\ndimensions. The mapping is thoroughly described in the supplementary material, and we emphasize\nthat we have deliberately used mappings that MLP\u2019s struggle to learn, thus with a low capacity\nnetwork the only way to compensate is to learn a meaningful variance function.\nIn Fig. 7 we plot pairs of output dimensions using 5000 generated samples. For all pairwise\ncombinations we refer to the supplementary material. We observe that samples from our Comb-VAE\ncapture the data distribution in more detail than a standard VAE. For VAE the variance seems to be\n\n8https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.\n\nhtml\n\n7\n\n\fGround truth\n\nVAE\n\nComb-VAE\n\nFigure 7: The ground truth and generated distributions.\nTop: x1 vs. x2. Bottom: x2 vs x3.\n\nFigure 8: Variance esti-\nmates in latent space for\nstandard VAE (top) and\nour Comb-VAE (bottom).\nBlue points are the encoded\ntraining data.\n\nELBO\n\nlog p(x)\n\nVAE\nComb-VAE\nVAE\nComb-VAE\n\nSVHN\n3696.35 \u00b1 2.94\n3701.41 \u00b1 5.84\n3606.28 \u00b1 2.75\n3614.39 \u00b1 7.91\nTable 2: Generative modeling of 4 datasets. For each dataset we report training ELBO and test set\nlog-likelihood. The standard errors are calculated over 3 trained models with random initialization.\n\nFashionMNIST CIFAR10\n1506.31 \u00b1 2.71\n1621.29 \u00b1 7.23\n1481.38 \u00b1 3.68\n1567.23 \u00b1 4.82\n\nMNIST\n2053.01 \u00b1 1.60\n2152.31 \u00b1 3.32\n1914.77 \u00b1 2.15\n2018.37 \u00b1 4.35\n\n1980.84 \u00b1 3.32\n2057.32 \u00b1 8.13\n1809.43 \u00b1 10.32\n1891.39 \u00b1 20.21\n\ndata. In Fig. 8, we calculated the accumulated variance(cid:80)D\n\nunderestimated, which is similar to the results from regression. The poor sample quality of a standard\nVAE can partially be explained by the arbitrariness of decoder variance function \u03c32(z) away from\nj (z) over a grid of latent points.\nWe clearly see that for the standard VAE, the variance is low where we have data and arbitrary away\nfrom data. However, our method produces low-variance region where the two half moons are and\na high variance region away from data. We note that Arvanitidis et al. [2018] also dealt with the\nproblem of arbitrariness of the decoder variance. However their method relies on post-\ufb01tting of the\nvariance, whereas ours is \ufb01tted during training. Additionally, we note that [Takahashi et al., 2018]\nalso successfully modeled the posterior of a VAE as a Student t-distribution similar to our proposed\nmethod, but without the extrapolation and different training procedure.\n\nj=1 \u03c32\n\nImage data. For our last set of experiments we \ufb01tted a standard VAE and our Comb-VAE to\nfour datasets: MNIST, FashionMNIST, CIFAR10, SVHN. We want to measure whether there is an\nimprovement to generative modeling by getting better variance estimation. The details about network\narchitecture and training can be found in the supplementary material. Training set ELBO and test\nset log-likelihoods can be viewed in Table 2. We observe on all datasets that, on average tighter\nbounds and higher log-likelihood are achieved, indicating that we better \ufb01t the data distribution. We\nquantitatively observe (see Fig. 9) that variance has a more local structure for Comb-VAE and that\nthe variance re\ufb02ects the underlying latent structure.\n\n5 Discussion & Conclusion\n\nWhile variance networks are commonly used for modeling the predictive uncertainty in regression\nand in generative modeling, there have been no systematic studies of how to \ufb01t these to data. We\nhave demonstrated that tools developed for \ufb01tting mean networks to data are subpar when applied to\n\n8\n\n2101201012312101232202021012330212022202321012010123121012322020210123302120222023210120101231210123220202101233021202220232101201012312101232202021012330212022202321012010123121012322020210123302120222023210120101231210123220202101233021202220230.10.20.30.40.50.65101520253035\fFigure 9: Generated MNIST images on a grid in latent space using the standard variance network\n(left) and proposed variance network (right).\n\nvariance estimation. The key underlying issue appears to be that it is not feasible to estimate both a\nmean and a variance at the same time, when data is scarce.\nWhile it is bene\ufb01cial to have separate estimates of both epistemic and aleatoric uncertainty, we have\nfocused on predictive uncertainty, which combine the two. This is a lesser but more feasible goal.\nWe have proposed a new mini-batching scheme that samples locally to ensure that variances are better\nde\ufb01ned during model training. We have further argued that variance estimation is more meaningful\nwhen conditioned on the mean, which implies a change to the usual training procedure of joint\nmean-variance estimation. To cope with data scarcity we have proposed a more robust likelihood that\nmodel a distribution over the variance. Finally, we have highlighted that variance networks need to\nextrapolate differently from mean networks, which implies architectural differences between such\nnetworks. We speci\ufb01cally propose a new architecture for variance networks that ensures similar\nvariance extrapolations to posterior Gaussian processes from stationary priors.\nOur methodologies depend on algorithms that computes Euclidean distances. Since these often break\ndown in high dimensions, this indicates that our proposed methods may not be suitable for high\ndimensional data. Since we mostly rely on nearest neighbor computations, that empirical are known\nto perform better in high dimensions, our methodologies may still work in this case. Interestingly, the\nvery de\ufb01nition of variance is dependent on Euclidean distance and this may indicate that variance\nis inherently dif\ufb01cult to estimate for high dimensional data. This could possible be circumvented\nthrough a learned metric.\nExperimentally, we have demonstrated that proposed methods are complementary and provide\nsigni\ufb01cant improvements over state-of-the-art. In particular, on benchmark data we have shown\nthat our method improves upon the test set log-likelihood without improving the RMSE, which\ndemonstrate that the uncertainty is a signi\ufb01cant improvement over current methods. Another indicator\nof improved uncertainty estimation is that our method speeds up active learning tasks compared\nto state-of-the-art. Due to the similarities between active learning, Bayesian optimization, and\nreinforcement learning, we expect that our approach carries signi\ufb01cant value to these \ufb01elds as well.\nFurthermore, we have demonstrated that variational autoencoders can be improved through better\ngenerative variance estimation. Finally, we note that our approach is directly applicable alongside\nensemble methods, which may further improve results.\n\n9\n\n\fAcknowledgements. This project has received funding from the European Research Council (ERC)\nunder the European Union\u2019s Horizon 2020 research and innovation programme (grant agreement\nno 757360). NSD, MJ and SH were supported in part by a research grant (15334) from VILLUM\nFONDEN. We gratefully acknowledge the support of NVIDIA Corporation with the donation of\nGPU hardware used for this research.\n\nReferences\nD. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man\u00e9. Concrete problems in ai safety.\n\narXiv preprint arXiv:1606.06565, 2016.\n\nG. Arvanitidis, L. K. Hansen, and S. Hauberg. Latent space oddity: on the curvature of deep generative models.\n\nIn International Conference on Learning Representations, 2018.\n\nC. M. Bishop. Mixture density networks. Technical report, Citeseer, 1994.\n\nA. Damianou and N. D. Lawrence. Deep gaussian processes. Proceedings of the 16th International Conference\n\non Arti\ufb01cial Intelligence and Statistics (AISTATS), 2013.\n\nP. I. Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.\n\nC. Fu and D. Cai. Efanna : An extremely fast approximate nearest neighbor search algorithm based on knn\n\ngraph. 09 2016.\n\nY. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep\n\nlearning. In international Conference on Machine Learning, pages 1050\u20131059, 2016.\n\nA. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. Bayesian Data Analysis. CRC\n\nPress, 2014.\n\nC. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In Proceedings of\n\nthe 34th International Conference on Machine Learning-Volume 70, pages 1321\u20131330. JMLR. org, 2017.\n\nS. Hauberg. Only bayes should learn a manifold (on the estimation of differential geometric structure from data).\n\narXiv preprint arXiv:1806.04994, 2018.\n\nJ. M. Hern\u00e1ndez-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of bayesian neural\n\nnetworks. In International Conference on Machine Learning, pages 1861\u20131869, 2015.\n\nG. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by\n\npreventing co-adaptation of feature detectors. arXiv preprint, arXiv, 07 2012.\n\nM. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine\n\nLearning Research, 14(1):1303\u20131347, 2013.\n\nD. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a \ufb01nite universe.\n\nJournal of the American Statistical Association, 47(260):663\u2013685, 1952.\n\nN. Houlsby, F. Huszar, Z. Ghahramani, and J. M. Hern\u00e1ndez-lobato. Collaborative gaussian processes for\npreference learning. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in\nNeural Information Processing Systems 25, pages 2096\u20132104. Curran Associates, Inc., 2012.\n\nS.-J. Huang, R. Jin, and Z.-H. Zhou. Active learning by querying informative and representative examples. In\n\nAdvances in neural information processing systems, pages 892\u2013900, 2010.\n\nA. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In\n\nAdvances in neural information processing systems, pages 5574\u20135584, 2017.\n\nD. P. Kingma and M. Welling. Auto-encoding variational bayes. ICLR, 12 2013.\n\nA. D. Kiureghian and O. Ditlevsen. Aleatory or epistemic? does it matter? Structural Safety, 31(2):105 \u2013 112,\n\n2009. Risk Acceptance and Risk Communication.\n\nB. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using\n\ndeep ensembles. In Advances in Neural Information Processing Systems, pages 6402\u20136413, 2017.\n\nK. L. Lange, R. J. A. Little, and J. M. G. Taylor. Robust statistical modeling using the t distribution. Journal of\n\nthe American Statistical Association, 84(408):881\u2013896, 1989.\n\n10\n\n\fY. LeCun, Y. Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n\nC. Loader. Local Regression and Likelihood. Springer, New York, 1999.\n\nD. J. C. MacKay. A Practical Bayesian Framework for Backpropagation Networks. Neural Comput., 4(3):\n\n448\u2013472, may 1992.\n\nW. Maddox, T. Garipov, P. Izmailov, D. Vetrov, and A. G. Wilson. A Simple Baseline for Bayesian Uncertainty\n\nin Deep Learning. CoRR, feb 2019.\n\nP.-A. Mattei and J. Frellsen. Leveraging the exact likelihood of deep latent variable models. In Proceedings of\nthe 32Nd International Conference on Neural Information Processing Systems, NIPS\u201918, pages 3859\u20133870,\nUSA, 2018. Curran Associates Inc.\n\nD. Nix and A. Weigend. Estimating the mean and variance of the target probability distribution. In Proc. 1994\n\nIEEE Int. Conf. Neural Networks, pages 55\u201360 vol.1. IEEE, 1994.\n\nT. Pearce, M. Zaki, A. Brintrup, and A. Neely. High-Quality Prediction Intervals for Deep Learning: A\nDistribution-Free, Ensembled Approach. In Proceedings of the 35th International Conference on Machine\nLearning, feb 2018.\n\nC. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. University Press Group Limited,\n\n2006.\n\nD. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep\ngenerative models. In E. P. Xing and T. Jebara, editors, Proceedings of the 31st International Conference\non Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278\u20131286, Bejing,\nChina, 22\u201324 Jun 2014. PMLR.\n\nB. Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of\n\nComputer Sciences, 2009.\n\nE. Snelson and Z. Ghahramani. Sparse gaussian processes using pseudo-inputs.\n\ninformation processing systems, pages 1257\u20131264, 2006.\n\nIn Advances in neural\n\nC. Szepesv\u00e1ri. Algorithms for reinforcement learning. Synthesis lectures on arti\ufb01cial intelligence and machine\n\nlearning, 4(1):1\u2013103, 2010.\n\nH. Takahashi, T. Iwata, Y. Yamanaka, M. Yamada, and S. Yagi. Student-t variational autoencoder for robust den-\nsity estimation. In Proceedings of the Twenty-Seventh International Joint Conference on Arti\ufb01cial Intelligence,\nIJCAI-18, pages 2696\u20132702. International Joint Conferences on Arti\ufb01cial Intelligence Organization, 7 2018.\n\nR. Tibshirani and T. Hastie. Local likelihood estimation. Journal of the American Statistical Association, 82\n\n(398):559\u2013567, 1987.\n\n11\n\n\f", "award": [], "sourceid": 3425, "authors": [{"given_name": "Nicki", "family_name": "Skafte", "institution": "Technical University of Denmark"}, {"given_name": "Martin", "family_name": "J\u00f8rgensen", "institution": "Technical University of Denmark"}, {"given_name": "S\u00f8ren", "family_name": "Hauberg", "institution": "Technical University of Denmark"}]}