{"title": "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles", "book": "Advances in Neural Information Processing Systems", "page_first": 6402, "page_last": 6413, "abstract": "Deep neural networks (NNs) are powerful black box predictors that have recently achieved impressive performance on a wide spectrum of tasks. Quantifying predictive uncertainty in NNs is a challenging and yet unsolved problem. Bayesian NNs, which learn a distribution over weights, are currently the state-of-the-art for estimating predictive uncertainty; however these require significant modifications to the training procedure and are computationally expensive compared to standard (non-Bayesian) NNs. We propose an alternative to Bayesian NNs that is simple to implement, readily parallelizable, requires very little hyperparameter tuning, and yields high quality predictive uncertainty estimates. Through a series of experiments on classification and regression benchmarks, we demonstrate that our method produces well-calibrated uncertainty estimates which are as good or better than approximate Bayesian NNs. To assess robustness to dataset shift, we evaluate the predictive uncertainty on test examples from known and unknown distributions, and show that our method is able to express higher uncertainty on out-of-distribution examples. We demonstrate the scalability of our method by evaluating predictive uncertainty estimates on ImageNet.", "full_text": "Simple and Scalable Predictive Uncertainty\n\nEstimation using Deep Ensembles\n\nBalaji Lakshminarayanan Alexander Pritzel Charles Blundell\n\nDeepMind\n\n{balajiln,apritzel,cblundell}@google.com\n\nAbstract\n\nDeep neural networks (NNs) are powerful black box predictors that have recently\nachieved impressive performance on a wide spectrum of tasks. Quantifying pre-\ndictive uncertainty in NNs is a challenging and yet unsolved problem. Bayesian\nNNs, which learn a distribution over weights, are currently the state-of-the-art\nfor estimating predictive uncertainty; however these require signi\ufb01cant modi\ufb01ca-\ntions to the training procedure and are computationally expensive compared to\nstandard (non-Bayesian) NNs. We propose an alternative to Bayesian NNs that\nis simple to implement, readily parallelizable, requires very little hyperparameter\ntuning, and yields high quality predictive uncertainty estimates. Through a series\nof experiments on classi\ufb01cation and regression benchmarks, we demonstrate that\nour method produces well-calibrated uncertainty estimates which are as good or\nbetter than approximate Bayesian NNs. To assess robustness to dataset shift, we\nevaluate the predictive uncertainty on test examples from known and unknown\ndistributions, and show that our method is able to express higher uncertainty on\nout-of-distribution examples. We demonstrate the scalability of our method by\nevaluating predictive uncertainty estimates on ImageNet.\n\nIntroduction\n\n1\nDeep neural networks (NNs) have achieved state-of-the-art performance on a wide variety of machine\nlearning tasks [35] and are becoming increasingly popular in domains such as computer vision\n[32], speech recognition [25], natural language processing [42], and bioinformatics [2, 61]. Despite\nimpressive accuracies in supervised learning benchmarks, NNs are poor at quantifying predictive\nuncertainty, and tend to produce overcon\ufb01dent predictions. Overcon\ufb01dent incorrect predictions can be\nharmful or offensive [3], hence proper uncertainty quanti\ufb01cation is crucial for practical applications.\nEvaluating the quality of predictive uncertainties is challenging as the \u2018ground truth\u2019 uncertainty\nestimates are usually not available. In this work, we shall focus upon two evaluation measures that\nare motivated by practical applications of NNs. Firstly, we shall examine calibration [12, 13], a\nfrequentist notion of uncertainty which measures the discrepancy between subjective forecasts and\n(empirical) long-run frequencies. The quality of calibration can be measured by proper scoring rules\n[17] such as log predictive probabilities and the Brier score [9]. Note that calibration is an orthogonal\nconcern to accuracy: a network\u2019s predictions may be accurate and yet miscalibrated, and vice versa.\nThe second notion of quality of predictive uncertainty we consider concerns generalization of the\npredictive uncertainty to domain shift (also referred to as out-of-distribution examples [23]), that is,\nmeasuring if the network knows what it knows. For example, if a network trained on one dataset is\nevaluated on a completely different dataset, then the network should output high predictive uncertainty\nas inputs from a different dataset would be far away from the training data. Well-calibrated predictions\nthat are robust to model misspeci\ufb01cation and dataset shift have a number of important practical uses\n(e.g., weather forecasting, medical diagnosis).\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThere has been a lot of recent interest in adapting NNs to encompass uncertainty and probabilistic\nmethods. The majority of this work revolves around a Bayesian formalism [4], whereby a prior\ndistribution is speci\ufb01ed upon the parameters of a NN and then, given the training data, the posterior\ndistribution over the parameters is computed, which is used to quantify predictive uncertainty.\nSince exact Bayesian inference is computationally intractable for NNs, a variety of approximations\nhave been developed including Laplace approximation [40], Markov chain Monte Carlo (MCMC)\nmethods [46], as well as recent work on variational Bayesian methods [6, 19, 39], assumed density\n\ufb01ltering [24], expectation propagation [21, 38] and stochastic gradient MCMC variants such as\nLangevin diffusion methods [30, 59] and Hamiltonian methods [53]. The quality of predictive\nuncertainty obtained using Bayesian NNs crucially depends on (i) the degree of approximation due\nto computational constraints and (ii) if the prior distribution is \u2018correct\u2019, as priors of convenience\ncan lead to unreasonable predictive uncertainties [50]. In practice, Bayesian NNs are often harder\nto implement and computationally slower to train compared to non-Bayesian NNs, which raises\nthe need for a \u2018general purpose solution\u2019 that can deliver high-quality uncertainty estimates and yet\nrequires only minor modi\ufb01cations to the standard training pipeline.\nRecently, Gal and Ghahramani [15] proposed using Monte Carlo dropout (MC-dropout) to estimate\npredictive uncertainty by using Dropout [54] at test time. There has been work on approximate\nBayesian interpretation [15, 29, 41] of dropout. MC-dropout is relatively simple to implement\nleading to its popularity in practice. Interestingly, dropout may also be interpreted as ensemble model\ncombination [54] where the predictions are averaged over an ensemble of NNs (with parameter\nsharing). The ensemble interpretation seems more plausible particularly in the scenario where the\ndropout rates are not tuned based on the training data, since any sensible approximation to the true\nBayesian posterior distribution has to depend on the training data. This interpretation motivates the\ninvestigation of ensembles as an alternative solution for estimating predictive uncertainty.\nIt has long been observed that ensembles of models improve predictive performance (see [14] for a\nreview). However it is not obvious when and why an ensemble of NNs can be expected to produce\ngood uncertainty estimates. Bayesian model averaging (BMA) assumes that the true model lies within\nthe hypothesis class of the prior, and performs soft model selection to \ufb01nd the single best model within\nthe hypothesis class [43]. In contrast, ensembles perform model combination, i.e. they combine the\nmodels to obtain a more powerful model; ensembles can be expected to be better when the true model\ndoes not lie within the hypothesis class. We refer to [11, 43] and [34, \u00a72.5] for related discussions.\nIt is important to note that even exact BMA is not guaranteed be robust to mis-speci\ufb01cation with\nrespect to domain shift.\nSummary of contributions: Our contribution in this paper is two fold. First, we describe a simple and\nscalable method for estimating predictive uncertainty estimates from NNs. We argue for training\nprobabilistic NNs (that model predictive distributions) using a proper scoring rule as the training\ncriteria. We additionally investigate the effect of two modi\ufb01cations to the training pipeline, namely\n(i) ensembles and (ii) adversarial training [18] and describe how they can produce smooth predictive\nestimates. Secondly, we propose a series of tasks for evaluating the quality of the predictive uncertainty\nestimates, in terms of calibration and generalization to unknown classes in supervised learning\nproblems. We show that our method signi\ufb01cantly outperforms (or matches) MC-dropout. These tasks,\nalong with our simple yet strong baseline, serve as an useful benchmark for comparing predictive\nuncertainty estimates obtained using different Bayesian/non-Bayesian/hybrid methods.\nNovelty and Signi\ufb01cance: Ensembles of NNs, or deep ensembles for short, have been successfully\nused to boost predictive performance (e.g. classi\ufb01cation accuracy in ImageNet or Kaggle contests)\nand adversarial training has been used to improve robustness to adversarial examples. However, to\nthe best of our knowledge, ours is the \ufb01rst work to investigate their usefulness for predictive uncer-\ntainty estimation and compare their performance to current state-of-the-art approximate Bayesian\nmethods on a series of classi\ufb01cation and regression benchmark datasets. Compared to Bayesian\nNNs (e.g. variational inference or MCMC methods), our method is much simpler to implement,\nrequires surprisingly few modi\ufb01cations to standard NNs, and well suited for distributed computation,\nthereby making it attractive for large-scale deep learning applications. To demonstrate scalability of\nour method, we evaluate predictive uncertainty on ImageNet (and are the \ufb01rst to do so, to the best of\nour knowledge). Most work on uncertainty in deep learning focuses on Bayesian deep learning; we\nhope that the simplicity and strong empirical performance of our approach will spark more interest in\nnon-Bayesian approaches for predictive uncertainty estimation.\n\n2\n\n\f2 Deep Ensembles: A Simple Recipe For Predictive Uncertainty Estimation\n\n2.1 Problem setup and High-level summary\nn=1, where\nWe assume that the training dataset D consists of N i.i.d. data points D = {xn, yn}N\nx 2 RD represents the D-dimensional features. For classi\ufb01cation problems, the label is assumed\nto be one of K classes, that is y 2{ 1, . . . , K}. For regression problems, the label is assumed to\nbe real-valued, that is y 2 R. Given the input features x, we use a neural network to model the\nprobabilistic predictive distribution p\u2713(y|x) over the labels, where \u2713 are the parameters of the NN.\nWe suggest a simple recipe: (1) use a proper scoring rule as the training criterion, (2) use adversarial\ntraining [18] to smooth the predictive distributions, and (3) train an ensemble. Let M denote the\nm=1 denote the parameters of the ensemble. We \ufb01rst\nnumber of NNs in the ensemble and {\u2713m}M\ndescribe how to train a single neural net and then explain how to train an ensemble of NNs.\n\n2.2 Proper scoring rules\nScoring rules measure the quality of predictive uncertainty (see [17] for a review). A scoring rule\nassigns a numerical score to a predictive distribution p\u2713(y|x), rewarding better calibrated predictions\nover worse. We shall consider scoring rules where a higher numerical score is better. Let a scoring\nrule be a function S(p\u2713, (y, x)) that evaluates the quality of the predictive distribution p\u2713(y|x) relative\nto an event y|x \u21e0 q(y|x) where q(y, x) denotes the true distribution on (y, x)-tuples. The expected\nscoring rule is then S(p\u2713, q) = R q(y, x)S(p\u2713, (y, x))dydx. A proper scoring rule is one where\nS(p\u2713, q) \uf8ff S(q, q) with equality if and only if p\u2713(y|x) = q(y|x), for all p\u2713 and q. NNs can then be\ntrained according to measure that encourages calibration of predictive uncertainty by minimizing the\nloss L(\u2713) = S(p\u2713, q).\nIt turns out many common NN loss functions are proper scoring rules. For example, when maximizing\nlikelihood, the score function is S(p\u2713, (y, x)) = log p\u2713(y|x), and this is a proper scoring rule due\nto Gibbs inequality: S(p\u2713, q) = Eq(x)q(y|x) log p\u2713(y|x) \uf8ff Eq(x)q(y|x) log q(y|x). In the case of\nmulti-class K-way classi\ufb01cation, the popular softmax cross entropy loss is equivalent to the log\nk=1k=y \nlikelihood and is a proper scoring rule. Interestingly, L(\u2713) = S(p\u2713, (y, x)) = K1PK\np\u2713(y = k|x)2, i.e., minimizing the squared error between the predictive probability of a label and\n\none-hot encoding of the correct label, is also a proper scoring rule known as the Brier score [9].\nThis provides justi\ufb01cation for this common trick for training NNs by minimizing the squared error\nbetween a binary label and its associated probability and shows it is, in fact, a well de\ufb01ned loss with\ndesirable properties.1\n2.2.1 Training criterion for regression\nFor regression problems, NNs usually output a single value say \u00b5(x) and the parameters are optimized\n\nto minimize the mean squared error (MSE) on the training set, given byPN\n\nHowever, the MSE does not capture predictive uncertainty. Following [47], we use a network\nthat outputs two values in the \ufb01nal layer, corresponding to the predicted mean \u00b5(x) and variance2\n2(x) > 0. By treating the observed value as a sample from a (heteroscedastic) Gaussian distribution\nwith the predicted mean and variance, we minimize the negative log-likelihood criterion:\n\nn=1yn \u00b5(xn)2.\n\n log p\u2713(yn|xn) =\n\nlog 2\n\u2713(x)\n2\n\n+y \u00b5\u2713(x)2\n\n\u2713(x)\n\n22\n\n+ constant.\n\n(1)\n\nWe found the above to perform satisfactorily in our experiments. However, two simple extensions are\nworth further investigation: (i) Maximum likelihood estimation over \u00b5\u2713(x) and 2\n\u2713(x) might over\ufb01t;\none could impose a prior and perform maximum-a-posteriori (MAP) estimation. (ii) In cases where\nthe Gaussian is too-restrictive, one could use a complex distribution e.g. mixture density network [5]\nor a heavy-tailed distribution.\n\n1Indeed as noted in Gneiting and Raftery [17], it can be shown that asymptotically maximizing any proper\n\nscoring rule recovers true parameter values.\n\n2We enforce the positivity constraint on the variance by passing the second output through the softplus\n\nfunction log(1 + exp(\u00b7)), and add a minimum variance (e.g. 106) for numerical stability.\n\n3\n\n\f2.3 Adversarial training to smooth predictive distributions\nAdversarial examples, proposed by Szegedy et al. [55] and extended by Goodfellow et al. [18], are\nthose which are \u2018close\u2019 to the original training examples (e.g. an image that is visually indistin-\nguishable from the original image to humans), but are misclassi\ufb01ed by the NN. Goodfellow et al.\n[18] proposed the fast gradient sign method as a fast solution to generate adversarial examples.\nGiven an input x with target y, and loss `(\u2713, x, y) (e.g. log p\u2713(y|x)), the fast gradient sign method\ngenerates an adversarial example as x0 = x + \u270f signrx `(\u2713, x, y), where \u270f is a small value such\nthat the max-norm of the perturbation is bounded. Intuitively, the adversarial perturbation creates\na new training example by adding a perturbation along a direction which the network is likely to\nincrease the loss. Assuming \u270f is small enough, these adversarial examples can be used to augment\nthe original training set by treating (x0, y) as additional training examples. This procedure, referred\nto as adversarial training,3 was found to improve the classi\ufb01er\u2019s robustness [18].\nInterestingly, adversarial training can be interpreted as a computationally ef\ufb01cient solution to smooth\nthe predictive distributions by increasing the likelihood of the target around an \u270f-neighborhood of\nthe observed training examples. Ideally one would want to smooth the predictive distributions along\nall 2D directions in {1,1}D; however this is computationally expensive. A random direction\nmight not necessarily increase the loss; however, adversarial training by de\ufb01nition computes the\ndirection where the loss is high and hence is better than a random direction for smoothing predictive\ndistributions. Miyato et al. [44] proposed a related idea called virtual adversarial training (VAT),\nwhere they picked x = arg maxx KLp(y|x)||p(y|x + x); the advantage of VAT is that\n\nit does not require knowledge of the true target y and hence can be applied to semi-supervised\nlearning. Miyato et al. [44] showed that distributional smoothing using VAT is bene\ufb01cial for ef\ufb01cient\nsemi-supervised learning; in contrast, we investigate the use of adversarial training for predictive\nuncertainty estimation. Hence, our contributions are complementary; one could use VAT or other\nforms of adversarial training, cf. [33], for improving predictive uncertainty in the semi-supervised\nsetting as well.\n\n2.4 Ensembles: training and prediction\nThe most popular ensembles use decision trees as the base learners and a wide variety of method\nhave been explored in the literature on ensembles. Broadly, there are two classes of ensembles:\nrandomization-based approaches such as random forests [8], where the ensemble members can\nbe trained in parallel without any interaction, and boosting-based approaches where the ensemble\nmembers are \ufb01t sequentially. We focus only on the randomization based approach as it is better suited\nfor distributed, parallel computation. Breiman [8] showed that the generalization error of random\nforests can be upper bounded by a function of the strength and correlation between individual trees;\nhence it is desirable to use a randomization scheme that de-correlates the predictions of the individual\nmodels as well as ensures that the individual models are strong (e.g. high accuracy). One of the\npopular strategies is bagging (a.k.a. bootstrapping), where ensemble members are trained on different\nbootstrap samples of the original training set. If the base learner lacks intrinsic randomization (e.g. it\ncan be trained ef\ufb01ciently by solving a convex optimization problem), bagging is a good mechanism\nfor inducing diversity. However, if the underlying base learner has multiple local optima, as is the\ncase typically with NNs, the bootstrap can sometimes hurt performance since a base learner trained\non a bootstrap sample sees only 63% unique data points.4 In the literature on decision tree ensembles,\nBreiman [8] proposed to use a combination of bagging [7] and random subset selection of features at\neach node. Geurts et al. [16] later showed that bagging is unnecessary if additional randomness can\nbe injected into the random subset selection procedure. Intuitively, using more data for training the\nbase learners helps reduce their bias and ensembling helps reduce the variance.\nWe used the entire training dataset to train each network since deep NNs typically perform better\nwith more data, although it is straightforward to use a random subsample if need be. We found that\nrandom initialization of the NN parameters, along with random shuf\ufb02ing of the data points, was\nsuf\ufb01cient to obtain good performance in practice. We observed that bagging deteriorated performance\nin our experiments. Lee et al. [36] independently observed that training on entire dataset with\nrandom initialization was better than bagging for deep ensembles, however their goal was to improve\n\n3Not to be confused with Generative Adversarial Networks (GANs).\n4 The bootstrap draws N times uniformly with replacement from a dataset with N items. The probability\nan item is picked at least once is 1 (1 1/N )N , which for large N becomes 1 e1 \u21e1 0.632. Hence, the\nnumber of unique data points in a bootstrap sample is 0.632 \u21e5 N on average.\n\n4\n\n\fpredictive accuracy and not predictive uncertainty. The overall training procedure is summarized in\nAlgorithm 1.\n\nAlgorithm 1 Pseudocode of the training procedure for our method\n1: . Let each neural network parametrize a distribution over the outputs, i.e. p\u2713(y|x). Use a proper\nscoring rule as the training criterion `(\u2713, x, y). Recommended default values are M = 5 and\n\u270f = 1% of the input range of the corresponding dimension (e.g 2.55 if input range is [0,255]).\n\n2: Initialize \u27131,\u2713 2, . . . ,\u2713 M randomly\n3: for m = 1 : M do\n4: Sample data point nm randomly for each net\n\n5: Generate adversarial example using x0nm = xnm + \u270f signrxnm `(\u2713m, xnm, ynm)\n\n6: Minimize `(\u2713m, xnm, ynm) + `(\u2713m, x0nm, ynm) w.r.t. \u2713m\n\n. adversarial training (optional)\n\n. train networks independently in parallel\n. single nm for clarity, minibatch in practice\n\nWe treat the ensemble as a uniformly-weighted mixture model and combine the predictions as\nm=1 p\u2713m(y|x,\u2713 m). For classi\ufb01cation, this corresponds to averaging the predicted\nprobabilities. For regression, the prediction is a mixture of Gaussian distributions. For ease of\ncomputing quantiles and predictive probabilities, we further approximate the ensemble prediction as a\nGaussian whose mean and variance are respectively the mean and variance of the mixture. The mean\n\u2713m(x) are given by \u00b5\u21e4(x) = M1Pm \u00b5\u2713m(x)\n\np(y|x) = M1PM\nand variance of a mixture M1PN\u00b5\u2713m(x), 2\n\u2713m(x) \u00b52\n\n\u21e4(x) = M1Pm2\n\n\u21e4(x) respectively.\n\n\u2713m(x) + \u00b52\n\nand 2\n\n3 Experimental results\n\n3.1 Evaluation metrics and experimental setup\nFor both classi\ufb01cation and regression, we evaluate the negative log likelihood (NLL) which depends\non the predictive uncertainty. NLL is a proper scoring rule and a popular metric for evaluating\npredictive uncertainty [49]. For classi\ufb01cation we additionally measure classi\ufb01cation accuracy and\n\nthe Brier score, de\ufb01ned as BS = K1PK\n\nk=1t\u21e4k p(y = k|x\u21e4)2 where t\u21e4k = 1 if k = y\u21e4, and 0\n\notherwise. For regression problems, we additionally measured the root mean squared error (RMSE).\nUnless otherwise speci\ufb01ed, we used batch size of 100 and Adam optimizer with \ufb01xed learning rate of\n0.1 in our experiments. We use the same technique for generating adversarial training examples for\nregression problems. Goodfellow et al. [18] used a \ufb01xed \u270f for all dimensions; this is unsatisfying\nif the input dimensions have different ranges. Hence, in all of our experiments, we set \u270f to 0.01\ntimes the range of the training data along that particular dimension. We used the default weight\ninitialization in Torch.\n\n3.2 Regression on toy datasets\nFirst, we qualitatively evaluate the performance of the proposed method on a one-dimensional toy\nregression dataset. This dataset was used by Hern\u00b4andez-Lobato and Adams [24], and consists of 20\ntraining examples drawn as y = x3 + \u270f where \u270f \u21e0N (0, 32). We used the same architecture as [24].\nA commonly used heuristic in practice is to use an ensemble of NNs (trained to minimize MSE),\nobtain multiple point predictions and use the empirical variance of the networks\u2019 predictions as an\napproximate measure of uncertainty. We demonstrate that this is inferior to learning the variance by\ntraining using NLL.5 The results are shown in Figure 1.\nThe results clearly demonstrate that (i) learning variance and training using a scoring rule (NLL) leads\nto improved predictive uncertainty and (ii) ensemble combination improves performance, especially\nas we move farther from the observed training data.\n\n3.3 Regression on real world datasets\nIn our next experiment, we compare our method to state-of-the-art methods for predictive uncertainty\nestimation using NNs on regression tasks. We use the experimental setup proposed by Hern\u00b4andez-\nLobato and Adams [24] for evaluating probabilistic backpropagation (PBP), which was also used\n\n5See also Appendix A.2 for calibration results on a real world dataset.\n\n5\n\n\fFigure 1: Results on a toy regression task: x-axis denotes x. On the y-axis, the blue line is the ground\ntruth curve, the red dots are observed noisy training data points and the gray lines correspond to\nthe predicted mean along with three standard deviations. Left most plot corresponds to empirical\nvariance of 5 networks trained using MSE, second plot shows the effect of training using NLL using\na single net, third plot shows the additional effect of adversarial training, and \ufb01nal plot shows the\neffect of using an ensemble of 5 networks respectively.\n\nby Gal and Ghahramani [15] to evaluate MC-dropout.6 Each dataset is split into 20 train-test folds,\nexcept for the protein dataset which uses 5 folds and the Year Prediction MSD dataset which uses\na single train-test split. We use the identical network architecture: 1-hidden layer NN with ReLU\nnonlinearity [45], containing 50 hidden units for smaller datasets and 100 hidden units for the larger\nprotein and Year Prediction MSD datasets. We trained for 40 epochs; we refer to [24] for further\ndetails about the datasets and the experimental protocol. We used 5 networks in our ensemble. Our\nresults are shown in Table 1, along with the PBP and MC-dropout results reported in their respective\npapers.\n\nDatasets\n\nBoston housing\nConcrete\nEnergy\nKin8nm\nNaval propulsion plant\nPower plant\nProtein\nWine\nYacht\nYear Prediction MSD\n\nPBP\n\n3.01 \u00b1 0.18\n5.67 \u00b1 0.09\n1.80 \u00b1 0.05\n0.10 \u00b1 0.00\n0.01 \u00b1 0.00\n4.12 \u00b1 0.03\n4.73 \u00b1 0.01\n0.64 \u00b1 0.01\n1.02 \u00b1 0.05\n8.88 \u00b1 NA\n\nRMSE\n\nMC-dropout Deep Ensembles\n3.28 \u00b1 1.00\n2.97 \u00b1 0.85\n5.23 \u00b1 0.53\n6.03 \u00b1 0.58\n1.66 \u00b1 0.19\n2.09 \u00b1 0.29\n0.09 \u00b1 0.00\n0.10 \u00b1 0.00\n0.00 \u00b1 0.00\n0.01 \u00b1 0.00\n4.11 \u00b1 0.17\n4.02 \u00b1 0.18\n4.36 \u00b1 0.04\n4.71 \u00b1 0.06\n0.64 \u00b1 0.04\n0.62 \u00b1 0.04\n1.11 \u00b1 0.38\n1.58 \u00b1 0.48\n8.85 \u00b1 NA\n8.89 \u00b1 NA\n\nPBP\n\n2.57 \u00b1 0.09\n3.16 \u00b1 0.02\n2.04 \u00b1 0.02\n-0.90 \u00b1 0.01\n-3.73 \u00b1 0.01\n2.84 \u00b1 0.01\n2.97 \u00b1 0.00\n0.97 \u00b1 0.01\n1.63 \u00b1 0.02\n3.60 \u00b1 NA\n\nNLL\n\nMC-dropout Deep Ensembles\n2.41 \u00b1 0.25\n2.46 \u00b1 0.25\n3.04 \u00b1 0.09\n3.06 \u00b1 0.18\n1.38 \u00b1 0.22\n1.99 \u00b1 0.09\n-1.20 \u00b1 0.02\n-0.95 \u00b1 0.03\n-5.63 \u00b1 0.05\n-3.80 \u00b1 0.05\n2.79 \u00b1 0.04\n2.80 \u00b1 0.05\n2.83 \u00b1 0.02\n2.89 \u00b1 0.01\n0.94 \u00b1 0.12\n0.93 \u00b1 0.06\n1.18 \u00b1 0.21\n1.55 \u00b1 0.12\n3.35 \u00b1 NA\n3.59 \u00b1 NA\n\nTable 1: Results on regression benchmark datasets comparing RMSE and NLL. See Table 2 for\nresults on variants of our method.\n\nWe observe that our method outperforms (or is competitive with) existing methods in terms of NLL.\nOn some datasets, we observe that our method is slightly worse in terms of RMSE. We believe that\nthis is due to the fact that our method optimizes for NLL (which captures predictive uncertainty)\ninstead of MSE. Table 2 in Appendix A.1 reports additional results on variants of our method,\ndemonstrating the advantage of using an ensemble as well as learning variance.\n\n3.4 Classi\ufb01cation on MNIST, SVHN and ImageNet\nNext we evaluate the performance on classi\ufb01cation tasks using MNIST and SVHN datasets. Our goal\nis not to achieve the state-of-the-art performance on these problems, but rather to evaluate the effect\nof adversarial training as well as the number of networks in the ensemble. To verify if adversarial\ntraining helps, we also include a baseline which picks a random signed vector. For MNIST, we used\nan MLP with 3-hidden layers with 200 hidden units per layer and ReLU non-linearities with batch\nnormalization. For MC-dropout, we added dropout after each non-linearity with 0.1 as the dropout\nrate.7 Results are shown in Figure 2(a). We observe that adversarial training and increasing the\nnumber of networks in the ensemble signi\ufb01cantly improve performance in terms of both classi\ufb01cation\naccuracy as well as NLL and Brier score, illustrating that our method produces well-calibrated\nuncertainty estimates. Adversarial training leads to better performance than augmenting with random\ndirection. Our method also performs much better than MC-dropout in terms of all the performance\nmeasures. Note that augmenting the training dataset with invariances (such as random crop and\nhorizontal \ufb02ips) is complementary to adversarial training and can potentially improve performance.\n\n6We do not compare to VI [19] as PBP and MC-dropout outperform VI on these benchmarks.\n7We also tried dropout rate of 0.5, but that performed worse.\n\n6\n\n64202462001000100200\f(a) MNIST dataset using 3-layer MLP\n\n(b) SVHN using VGG-style convnet\n\nFigure 2: Evaluating predictive uncertainty as a function of ensemble size M (number of networks\nin the ensemble or the number of MC-dropout samples): Ensemble variants signi\ufb01cantly outperform\nMC-dropout performance with the corresponding M in terms of all 3 metrics. Adversarial training\nimproves results for MNIST for all M and SVHN when M = 1, but the effect drops as M increases.\n\nTo measure the sensitivity of the results to the choice of network architecture, we experimented\nwith a two-layer MLP as well as a convolutional NN; we observed qualitatively similar results; see\nAppendix B.1 in the supplementary material for details.\nWe also report results on the SVHN dataset using an VGG-style convolutional NN.8 The results are\nin Figure 2(b). Ensembles outperform MC dropout. Adversarial training helps slightly for M = 1,\nhowever the effect drops as the number of networks in the ensemble increases. If the classes are\nwell-separated, adversarial training might not change the classi\ufb01cation boundary signi\ufb01cantly. It is\nnot clear if this is the case here, further investigation is required.\nFinally, we evaluate on the ImageNet (ILSVRC-2012) dataset [51] using the inception network [56].\nDue to computational constraints, we only evaluate the effect of ensembles on this dataset. The\nresults on ImageNet (single-crop evaluation) are shown in Table 4. We observe that as M increases,\nboth the accuracy and the quality of predictive uncertainty improve signi\ufb01cantly.\nAnother advantage of using an ensemble is that it enables us to easily identify training examples\nwhere the individual networks disagree or agree the most. This disagreement9 provides another\nuseful qualitative way to evaluate predictive uncertainty. Figures 10 and 11 in Appendix B.2 report\nqualitative evaluation of predictive uncertainty on the MNIST dataset.\n\n3.5 Uncertainty evaluation: test examples from known vs unknown classes\nIn the \ufb01nal experiment, we evaluate uncertainty on out-of-distribution examples from unseen classes.\nOvercon\ufb01dent predictions on unseen classes pose a challenge for reliable deployment of deep learning\nmodels in real world applications. We would like the predictions to exhibit higher uncertainty when\nthe test data is very different from the training data. To test if the proposed method possesses this\ndesirable property, we train a MLP on the standard MNIST train/test split using the same architecture\nas before. However, in addition to the regular test set with known classes, we also evaluate it on a\ntest set containing unknown classes. We used the test split of the NotMNIST10 dataset. The images\nin this dataset have the same size as MNIST, however the labels are alphabets instead of digits. We\ndo not have access to the true conditional probabilities, but we expect the predictions to be closer\nto uniform on unseen classes compared to the known classes where the predictive probabilities\nshould concentrate on the true targets. We evaluate the entropy of the predictive distribution and\nuse this to evaluate the quality of the uncertainty estimates. The results are shown in Figure 3(a).\nFor known classes (top row), both our method and MC-dropout have low entropy as expected. For\nunknown classes (bottom row), as M increases, the entropy of deep ensembles increases much faster\nthan MC-dropout indicating that our method is better suited for handling unseen test examples. In\nparticular, MC-dropout seems to give high con\ufb01dence predictions for some of the test examples, as\nevidenced by the mode around 0 even for unseen classes. Such overcon\ufb01dent wrong predictions can\nbe problematic in practice when tested on a mixture of known and unknown classes, as we will see in\nSection 3.6. Comparing different variants of our method, the mode for adversarial training increases\nslightly faster than the mode for vanilla ensembles indicating that adversarial training is bene\ufb01cial\n\n8The architecture is similar to the one described in http://torch.ch/blog/2015/07/30/cifar.html.\n\nm=1 KL(p\u2713m (y|x)||pE(y|x)) where KL denotes the\n\n9More precisely, we de\ufb01ne disagreement as PM\nKullback-Leibler divergence and pE(y|x) = M1Pm p\u2713m (y|x) is the prediction of the ensemble.\n\n10Available at http://yaroslavvb.blogspot.co.uk/2011/09/notmnist-dataset.html\n\n7\n\n0510151umEer Rf nets1.01.21.41.61.8ClassLfLcatLRn ErrRrEnsemEleEnsemEle + 5EnsemEle + AT0C drRSRut0510151umEer Rf nets0.020.040.060.080.100.120.141LLEnsemEleEnsemEle + 5EnsemEle + AT0C drRSRut0510151umEer Rf nets0.00140.00160.00180.00200.00220.00240.00260.00280.0030BrLer 6cRreEnsemEleEnsemEle + 5EnsemEle + AT0C drRSRut05101umEer Rf nets2468101214ClassLfLcatLRn ErrRrEnsemEleEnsemEle + 5EnsemEle + AT0C drRSRut05101umEer Rf nets0.150.200.250.300.350.400.450.501LLEnsemEleEnsemEle + 5EnsemEle + AT0C drRSRut05101umEer Rf nets0.0040.0060.0080.0100.0120.0140.016BrLer 6cRreEnsemEleEnsemEle + 5EnsemEle + AT0C drRSRut\ffor quantifying uncertainty on unseen classes. We qualitatively evaluate results in Figures 12(a)\nand 12(b) in Appendix B.2. Figure 12(a) shows that the ensemble agreement is highest for letter \u2018I\u2019\nwhich resembles 1 in the MNIST training dataset, and that the ensemble disagreement is higher for\nexamples visually different from the MNIST training dataset.\n\n(a) MNIST-NotMNIST\n\n(b) SVHN-CIFAR10\n\nFigure 3: : Histogram of the predictive entropy on test examples from known classes (top row) and\nunknown classes (bottom row), as we vary ensemble size M.\n\nWe ran a similar experiment, training on SVHN and testing on CIFAR-10 [31] test set; both datasets\ncontain 32 \u21e5 32 \u21e5 3 images, however SVHN contains images of digits whereas CIFAR-10 contains\nimages of object categories. The results are shown in Figure 3(b). As in the MNIST-NotMNIST\nexperiment, we observe that MC-dropout produces over-con\ufb01dent predictions on unseen examples,\nwhereas our method produces higher uncertainty on unseen classes.\nFinally, we test on ImageNet by splitting the training set by categories. We split the dataset into\nimages of dogs (known classes) and non-dogs (unknown classes), following Vinyals et al. [58] who\nproposed this setup for a different task. Figure 5 shows the histogram of the predictive entropy as\nwell as the maximum predicted probability (i.e. con\ufb01dence in the predicted class). We observe that\nthe predictive uncertainty improves on unseen classes, as the ensemble size increases.\n\n3.6 Accuracy as a function of con\ufb01dence\nIn practical applications, it is highly desirable for a system to avoid overcon\ufb01dent, incorrect predictions\nand fail gracefully. To evaluate the usefulness of predictive uncertainty for decision making, we\nconsider a task where the model is evaluated only on cases where the model\u2019s con\ufb01dence is above an\nuser-speci\ufb01ed threshold. If the con\ufb01dence estimates are well-calibrated, one can trust the model\u2019s\npredictions when the reported con\ufb01dence is high and resort to a different solution (e.g. use human in\na loop, or use prediction from a simpler model) when the model is not con\ufb01dent.\nWe re-use the results from the experiment in the previous section where we trained a network on\nMNIST and test it on a mix of test examples from MNIST (known classes) and NotMNIST (unknown\nM Top-1 error Top-5 error NLL\n\nFigure 5: ImageNet trained only on dogs: Histogram of the\npredictive entropy (left) and maximum predicted probabil-\nity (right) on test examples from known classes (dogs) and\nunknown classes (non-dogs), as we vary the ensemble size.\n\n8\n\n%\n\n%\n\n6.129\n5.274\n4.955\n4.723\n4.637\n4.532\n4.485\n4.430\n4.373\n4.364\n\nBrier Score\n\u21e5103\n0.317\n0.294\n0.286\n0.282\n0.280\n0.278\n0.277\n0.276\n0.276\n0.275\n\n0.959\n0.867\n0.836\n0.818\n0.809\n0.803\n0.797\n0.794\n0.791\n0.789\n\n22.166\n20.462\n19.709\n19.334\n19.104\n18.986\n18.860\n18.771\n18.728\n18.675\n\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\nFigure 4: Results on ImageNet: Deep\nEnsembles lead to lower classi\ufb01cation\nerror as well as better predictive uncer-\ntainty as evidenced by lower NLL and\nBrier score.\n\n\u22120.50.00.51.01.52.0entrRpy values02468101214EnsemEle1510\u22120.50.00.51.01.52.0entrRpy valuesEnsemEle + 51510\u22120.50.00.51.01.52.0entrRpy valuesEnsemEle + AT1510\u22120.50.00.51.01.52.0entrRpy values0C drRpRut 0.11510\u22120.50.00.51.01.52.0entrRpy values02468101214EnsemEle1510\u22120.50.00.51.01.52.0entrRpy valuesEnsemEle + 51510\u22120.50.00.51.01.52.0entrRpy valuesEnsemEle + AT1510\u22120.50.00.51.01.52.0entrRpy values0C drRpRut 0.11510\u22120.50.00.51.01.52.02.5entrRpy values01234567EnsemEle1510\u22120.50.00.51.01.52.02.5entrRpy valuesEnsemEle + 51510\u22120.50.00.51.01.52.02.5entrRpy valuesEnsemEle + A71510\u22120.50.00.51.01.52.0entrRpy values0C drRpRut1510\u22120.50.00.51.01.52.02.5entrRpy values01234567EnsemEle1510\u22120.50.00.51.01.52.02.5entrRpy valuesEnsemEle + 51510\u22120.50.00.51.01.52.02.5entrRpy valuesEnsemEle + A71510\u22120.50.00.51.01.52.02.5entrRpy values0C drRpRut1510\fFigure 6: Accuracy vs Con\ufb01dence curves: Networks trained on MNIST and tested on both MNIST\ntest containing known classes and the NotMNIST dataset containing unseen classes. MC-dropout can\nproduce overcon\ufb01dent wrong predictions, whereas deep ensembles are signi\ufb01cantly more robust.\n\nclasses). The network will produce incorrect predictions on out-of-distribution examples, however we\nwould like these predictions to have low con\ufb01dence. Given the prediction p(y = k|x), we de\ufb01ne the\npredicted label as \u02c6y = arg maxk p(y = k|x), and the con\ufb01dence as p(y = \u02c6y|x) = maxk p(y = k|x).\nWe \ufb01lter out test examples, corresponding to a particular con\ufb01dence threshold 0 \uf8ff \u2327 \uf8ff 1 and plot the\naccuracy for this threshold. The con\ufb01dence vs accuracy results are shown in Figure 6. If we look at\ncases only where the con\ufb01dence is 90%, we expect higher accuracy than cases where con\ufb01dence\n 80%, hence the curve should be monotonically increasing. If the application demands an accuracy\nx%, we can trust the model only in cases where the con\ufb01dence is greater than the corresponding\nthreshold. Hence, we can compare accuracy of the models for a desired con\ufb01dence threshold of the\napplication. MC-dropout can produce overcon\ufb01dent wrong predictions as evidenced by low accuracy\neven for high values of \u2327, whereas deep ensembles are signi\ufb01cantly more robust.\n\n4 Discussion\nWe have proposed a simple and scalable non-Bayesian solution that provides a very strong baseline\non evaluation metrics for predictive uncertainty quanti\ufb01cation. Intuitively, our method captures two\nsources of uncertainty. Training a probabilistic NN p\u2713(y|x) using proper scoring rules as training\nobjectives captures ambiguity in targets y for a given x. In addition, our method uses a combination\nof ensembles (which captures \u201cmodel uncertainty\u201d by averaging predictions over multiple models\nconsistent with the training data), and adversarial training (which encourages local smoothness),\nfor robustness to model misspeci\ufb01cation and out-of-distribution examples. Ensembles, even for\nM = 5, signi\ufb01cantly improve uncertainty quality in all the cases. Adversarial training helps on\nsome datasets for some metrics and is not strictly necessary in all cases. Our method requires very\nlittle hyperparameter tuning and is well suited for large scale distributed computation and can be\nreadily implemented for a wide variety of architectures such as MLPs, CNNs, etc including those\nwhich do not use dropout e.g. residual networks [22]. It is perhaps surprising to the Bayesian deep\nlearning community that a non-Bayesian (yet probabilistic) approach can perform as well as Bayesian\nNNs. We hope that our work will encourage the community to consider non-Bayesian approaches\n(such as ensembles) and other interesting evaluation metrics for predictive uncertainty. Concurrent\nwith our work, Hendrycks and Gimpel [23] and Guo et al. [20] have also independently shown that\nnon-Bayesian solutions can produce good predictive uncertainty estimates on some tasks. Abbasi\nand Gagn\u00b4e [1], Tram`er et al. [57] have also explored ensemble-based solutions to tackle adversarial\nexamples, a particularly hard case of out-of-distribution examples.\nThere are several avenues for future work. We focused on training independent networks as training\ncan be trivially parallelized. Explicitly de-correlating networks\u2019 predictions, e.g. as in [37], might\npromote ensemble diversity and improve performance even further. Optimizing the ensemble weights,\nas in stacking [60] or adaptive mixture of experts [28], can further improve the performance. The\nensemble has M times more parameters than a single network; for memory-constrained applications,\nthe ensemble can be distilled into a simpler model [10, 26]. It would be also interesting to investigate\nso-called implicit ensembles the where ensemble members share parameters, e.g. using multiple\nheads [36, 48], snapshot ensembles [27] or swapout [52].\n\n9\n\n0.00.10.20.30.40.50.60.70.80.9CRnfidence 7hreshRld \u03c430405060708090Accuracy Rn examples p(y|x)\u2265\u03c4EnsemEleEnsemEle + 5EnsemEle + A70C drRpRut\fAcknowledgments\nWe would like to thank Samuel Ritter and Oriol Vinyals for help with ImageNet experiments, and\nDaan Wierstra, David Silver, David Barrett, Ian Osband, Martin Szummer, Peter Dayan, Shakir\nMohamed, Theophane Weber, Ulrich Paquet and the anonymous reviewers for helpful feedback.\n\nReferences\n[1] M. Abbasi and C. Gagn\u00b4e. Robustness to adversarial examples through an ensemble of specialists.\n\narXiv preprint arXiv:1702.06856, 2017.\n\n[2] B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey. Predicting the sequence speci\ufb01cities\nof DNA-and RNA-binding proteins by deep learning. Nature biotechnology, 33(8):831\u2013838,\n2015.\n\n[3] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Man\u00b4e. Concrete problems\n\nin AI safety. arXiv preprint arXiv:1606.06565, 2016.\n\n[4] J. M. Bernardo and A. F. Smith. Bayesian Theory, volume 405. John Wiley & Sons, 2009.\n[5] C. M. Bishop. Mixture density networks. 1994.\n[6] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural\n\nnetworks. In ICML, 2015.\n\n[7] L. Breiman. Bagging predictors. Machine learning, 24(2):123\u2013140, 1996.\n[8] L. Breiman. Random forests. Machine learning, 45(1):5\u201332, 2001.\n[9] G. W. Brier. Veri\ufb01cation of forecasts expressed in terms of probability. Monthly weather review,\n\n1950.\n\n[10] C. Bucila, R. Caruana, and A. Niculescu-Mizil. Model compression. In KDD. ACM, 2006.\n[11] B. Clarke. Comparing Bayes model averaging and stacking when model approximation error\n\ncannot be ignored. J. Mach. Learn. Res. (JMLR), 4:683\u2013712, 2003.\n\n[12] A. P. Dawid. The well-calibrated Bayesian. Journal of the American Statistical Association,\n\n1982.\n\n[13] M. H. DeGroot and S. E. Fienberg. The comparison and evaluation of forecasters. The\n\nstatistician, 1983.\n\n[14] T. G. Dietterich. Ensemble methods in machine learning. In Multiple classi\ufb01er systems. 2000.\n[15] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model\n\nuncertainty in deep learning. In ICML, 2016.\n\n[16] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine learning, 63(1):\n\n3\u201342, 2006.\n\n[17] T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal\n\nof the American Statistical Association, 102(477):359\u2013378, 2007.\n\n[18] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In\n\nICLR, 2015.\n\n[19] A. Graves. Practical variational inference for neural networks. In NIPS, 2011.\n[20] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks.\n\narXiv preprint arXiv:1706.04599, 2017.\n\n[21] L. Hasenclever, S. Webb, T. Lienart, S. Vollmer, B. Lakshminarayanan, C. Blundell, and Y. W.\nTeh. Distributed Bayesian learning with stochastic natural-gradient expectation propagation and\nthe posterior server. arXiv preprint arXiv:1512.09327, 2015.\n\n10\n\n\f[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n770\u2013778, 2016.\n\n[23] D. Hendrycks and K. Gimpel. A baseline for detecting misclassi\ufb01ed and out-of-distribution\n\nexamples in neural networks. arXiv preprint arXiv:1610.02136, 2016.\n\n[24] J. M. Hern\u00b4andez-Lobato and R. P. Adams. Probabilistic backpropagation for scalable learning\n\nof Bayesian neural networks. In ICML, 2015.\n\n[25] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,\nP. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recog-\nnition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):\n82\u201397, 2012.\n\n[26] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[27] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger. Snapshot ensembles:\n\nTrain 1, get M for free. ICLR submission, 2017.\n\n[28] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts.\n\nNeural computation, 3(1):79\u201387, 1991.\n\n[29] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization\n\ntrick. In NIPS, 2015.\n\n[30] A. Korattikara, V. Rathod, K. Murphy, and M. Welling. Bayesian dark knowledge. In NIPS,\n\n2015.\n\n[31] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.\n\n[32] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[33] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. arXiv preprint\n\narXiv:1611.01236, 2016.\n\n[34] B. Lakshminarayanan. Decision trees and forests: a probabilistic perspective. PhD thesis, UCL\n\n(University College London), 2016.\n\n[35] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n\n[36] S. Lee, S. Purushwalkam, M. Cogswell, D. Crandall, and D. Batra. Why M heads are better than\none: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.\n\n[37] S. Lee, S. P. S. Prakash, M. Cogswell, V. Ranjan, D. Crandall, and D. Batra. Stochastic multiple\n\nchoice learning for training diverse deep ensembles. In NIPS, 2016.\n\n[38] Y. Li, J. M. Hern\u00b4andez-Lobato, and R. E. Turner. Stochastic expectation propagation. In NIPS,\n\n2015.\n\n[39] C. Louizos and M. Welling. Structured and ef\ufb01cient variational deep learning with matrix\n\nGaussian posteriors. arXiv preprint arXiv:1603.04733, 2016.\n\n[40] D. J. MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of\n\nTechnology, 1992.\n\n[41] S.-i. Maeda. A Bayesian encourages dropout. arXiv preprint arXiv:1412.7003, 2014.\n\n[42] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef\ufb01cient estimation of word representations in\n\nvector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[43] T. P. Minka. Bayesian model averaging is not model combination. 2000.\n\n11\n\n\f[44] T. Miyato, S.-i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing by virtual\n\nadversarial examples. In ICLR, 2016.\n\n[45] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In\n\nICML, 2010.\n\n[46] R. M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., 1996.\n[47] D. A. Nix and A. S. Weigend. Estimating the mean and variance of the target probability\n\ndistribution. In IEEE International Conference on Neural Networks, 1994.\n\n[48] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped DQN. In\n\nNIPS, 2016.\n\n[49] J. Quinonero-Candela, C. E. Rasmussen, F. Sinz, O. Bousquet, and B. Sch\u00a8olkopf. Evaluating\n\npredictive uncertainty challenge. In Machine Learning Challenges. Springer, 2006.\n\n[50] C. E. Rasmussen and J. Quinonero-Candela. Healing the relevance vector machine through\n\naugmentation. In ICML, 2005.\n\n[51] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition\nChallenge. International Journal of Computer Vision (IJCV), 115(3):211\u2013252, 2015.\n\n[52] S. Singh, D. Hoiem, and D. Forsyth. Swapout: Learning an ensemble of deep architectures. In\n\nNIPS, 2016.\n\n[53] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust\nBayesian neural networks. In Advances in Neural Information Processing Systems, pages\n4134\u20134142, 2016.\n\n[54] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple\n\nway to prevent neural networks from over\ufb01tting. JMLR, 2014.\n\n[55] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.\n\nIntriguing properties of neural networks. In ICLR, 2014.\n\n[56] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception archi-\ntecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 2818\u20132826, 2016.\n\n[57] F. Tram`er, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble adversarial training:\n\nAttacks and defenses. arXiv preprint arXiv:1705.07204, 2017.\n\n[58] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning.\n\nIn NIPS, 2016.\n\n[59] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In\n\nICML, 2011.\n\n[60] D. H. Wolpert. Stacked generalization. Neural networks, 5(2):241\u2013259, 1992.\n[61] J. Zhou and O. G. Troyanskaya. Predicting effects of noncoding variants with deep learning-\n\nbased sequence model. Nature methods, 12(10):931\u2013934, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3205, "authors": [{"given_name": "Balaji", "family_name": "Lakshminarayanan", "institution": "Google Deepmind"}, {"given_name": "Alexander", "family_name": "Pritzel", "institution": "Google Deepmind"}, {"given_name": "Charles", "family_name": "Blundell", "institution": "DeepMind"}]}