{"title": "What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?", "book": "Advances in Neural Information Processing Systems", "page_first": 5574, "page_last": 5584, "abstract": "There are two major types of uncertainty one can model. Aleatoric uncertainty captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model - uncertainty which can be explained away given enough data. Traditionally it has been difficult to model epistemic uncertainty in computer vision, but with new Bayesian deep learning tools this is now possible. We study the benefits of modeling epistemic vs. aleatoric uncertainty in Bayesian deep learning models for vision tasks. For this we present a Bayesian deep learning framework combining input-dependent aleatoric uncertainty together with epistemic uncertainty. We study models under the framework with per-pixel semantic segmentation and depth regression tasks. Further, our explicit uncertainty formulation leads to new loss functions for these tasks, which can be interpreted as learned attenuation. This makes the loss more robust to noisy data, also giving new state-of-the-art results on segmentation and depth regression benchmarks.", "full_text": "What Uncertainties Do We Need in Bayesian Deep\n\nLearning for Computer Vision?\n\nAlex Kendall\n\nUniversity of Cambridge\nagk34@cam.ac.uk\n\nYarin Gal\n\nUniversity of Cambridge\nyg279@cam.ac.uk\n\nAbstract\n\nThere are two major types of uncertainty one can model. Aleatoric uncertainty\ncaptures noise inherent in the observations. On the other hand, epistemic uncer-\ntainty accounts for uncertainty in the model \u2013 uncertainty which can be explained\naway given enough data. Traditionally it has been dif\ufb01cult to model epistemic\nuncertainty in computer vision, but with new Bayesian deep learning tools this\nis now possible. We study the bene\ufb01ts of modeling epistemic vs. aleatoric un-\ncertainty in Bayesian deep learning models for vision tasks. For this we present\na Bayesian deep learning framework combining input-dependent aleatoric uncer-\ntainty together with epistemic uncertainty. We study models under the framework\nwith per-pixel semantic segmentation and depth regression tasks. Further, our\nexplicit uncertainty formulation leads to new loss functions for these tasks, which\ncan be interpreted as learned attenuation. This makes the loss more robust to noisy\ndata, also giving new state-of-the-art results on segmentation and depth regression\nbenchmarks.\n\n1\n\nIntroduction\n\nUnderstanding what a model does not know is a critical part of many machine learning systems.\nToday, deep learning algorithms are able to learn powerful representations which can map high di-\nmensional data to an array of outputs. However these mappings are often taken blindly and assumed\nto be accurate, which is not always the case. In two recent examples this has had disastrous con-\nsequences. In May 2016 there was the \ufb01rst fatality from an assisted driving system, caused by the\nperception system confusing the white side of a trailer for bright sky [1]. In a second recent ex-\nample, an image classi\ufb01cation system erroneously identi\ufb01ed two African Americans as gorillas [2],\nraising concerns of racial discrimination. If both these algorithms were able to assign a high level\nof uncertainty to their erroneous predictions, then the system may have been able to make better\ndecisions and likely avoid disaster.\nQuantifying uncertainty in computer vision applications can be largely divided into regression set-\ntings such as depth regression, and classi\ufb01cation settings such as semantic segmentation. Existing\napproaches to model uncertainty in such settings in computer vision include particle \ufb01ltering and\nconditional random \ufb01elds [3, 4]. However many modern applications mandate the use of deep learn-\ning to achieve state-of-the-art performance [5], with most deep learning models not able to represent\nuncertainty. Deep learning does not allow for uncertainty representation in regression settings for\nexample, and deep learning classi\ufb01cation models often give normalised score vectors, which do not\nnecessarily capture model uncertainty. For both settings uncertainty can be captured with Bayesian\ndeep learning approaches \u2013 which offer a practical framework for understanding uncertainty with\ndeep learning models [6].\nIn Bayesian modeling, there are two main types of uncertainty one can model [7]. Aleatoric uncer-\ntainty captures noise inherent in the observations. This could be for example sensor noise or motion\nnoise, resulting in uncertainty which cannot be reduced even if more data were to be collected. On\nthe other hand, epistemic uncertainty accounts for uncertainty in the model parameters \u2013 uncertainty\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(a) Input Image\n\n(b) Ground Truth\n\n(c) Semantic\nSegmentation\n\n(d) Aleatoric\nUncertainty\n\n(e) Epistemic\nUncertainty\n\nFigure 1: Illustrating the difference between aleatoric and epistemic uncertainty for semantic segmentation\non the CamVid dataset [8]. Aleatoric uncertainty captures noise inherent in the observations. In (d) our model\nexhibits increased aleatoric uncertainty on object boundaries and for objects far from the camera. Epistemic\nuncertainty accounts for our ignorance about which model generated our collected data. This is a notably\ndifferent measure of uncertainty and in (e) our model exhibits increased epistemic uncertainty for semantically\nand visually challenging pixels. The bottom row shows a failure case of the segmentation model when the\nmodel fails to segment the footpath due to increased epistemic uncertainty, but not aleatoric uncertainty.\n\nwhich captures our ignorance about which model generated our collected data. This uncertainty\ncan be explained away given enough data, and is often referred to as model uncertainty. Aleatoric\nuncertainty can further be categorized into homoscedastic uncertainty, uncertainty which stays con-\nstant for different inputs, and heteroscedastic uncertainty. Heteroscedastic uncertainty depends on\nthe inputs to the model, with some inputs potentially having more noisy outputs than others. Het-\neroscedastic uncertainty is especially important for computer vision applications. For example, for\ndepth regression, highly textured input images with strong vanishing lines are expected to result in\ncon\ufb01dent predictions, whereas an input image of a featureless wall is expected to have very high\nuncertainty.\nIn this paper we make the observation that in many big data regimes (such as the ones common\nto deep learning with image data), it is most effective to model aleatoric uncertainty, uncertainty\nwhich cannot be explained away. This is in comparison to epistemic uncertainty which is mostly\nexplained away with the large amounts of data often available in machine vision. We further show\nthat modeling aleatoric uncertainty alone comes at a cost. Out-of-data examples, which can be\nidenti\ufb01ed with epistemic uncertainty, cannot be identi\ufb01ed with aleatoric uncertainty alone.\nFor this we present a uni\ufb01ed Bayesian deep learning framework which allows us to learn map-\npings from input data to aleatoric uncertainty and compose these together with epistemic uncer-\ntainty approximations. We derive our framework for both regression and classi\ufb01cation applications\nand present results for per-pixel depth regression and semantic segmentation tasks (see Figure 1 and\nthe supplementary video for examples). We show how modeling aleatoric uncertainty in regression\ncan be used to learn loss attenuation, and develop a complimentary approach for the classi\ufb01cation\ncase. This demonstrates the ef\ufb01cacy of our approach on dif\ufb01cult and large scale tasks.\nThe main contributions of this work are;\n\n1. We capture an accurate understanding of aleatoric and epistemic uncertainties, in particular\n\nwith a novel approach for classi\ufb01cation,\n\n2. We improve model performance by 1 \u2212 3% over non-Bayesian baselines by reducing the\neffect of noisy data with the implied attenuation obtained from explicitly representing\naleatoric uncertainty,\n\n3. We study the trade-offs between modeling aleatoric or epistemic uncertainty by character-\nizing the properties of each uncertainty and comparing model performance and inference\ntime.\n\n2\n\n\f2 Related Work\n\nExisting approaches to Bayesian deep learning capture either epistemic uncertainty alone, or\naleatoric uncertainty alone [6]. These uncertainties are formalised as probability distributions over\neither the model parameters, or model outputs, respectively. Epistemic uncertainty is modeled by\nplacing a prior distribution over a model\u2019s weights, and then trying to capture how much these\nweights vary given some data. Aleatoric uncertainty on the other hand is modeled by placing a dis-\ntribution over the output of the model. For example, in regression our outputs might be modeled as\ncorrupted with Gaussian random noise. In this case we are interested in learning the noise\u2019s variance\nas a function of different inputs (such noise can also be modeled with a constant value for all data\npoints, but this is of less practical interest). These uncertainties, in the context of Bayesian deep\nlearning, are explained in more detail in this section.\n2.1 Epistemic Uncertainty in Bayesian Deep Learning\nTo capture epistemic uncertainty in a neural network (NN) we put a prior distribution over its\nweights, for example a Gaussian prior distribution: W \u223c N (0, I).\nSuch a model is referred to as a Bayesian neural network (BNN) [9\u201311]. Bayesian neural networks\nreplace the deterministic network\u2019s weight parameters with distributions over these parameters, and\ninstead of optimising the network weights directly we average over all possible weights (referred\nto as marginalisation). Denoting the random output of the BNN as f W(x), we de\ufb01ne the model\nlikelihood p(y|f W(x)). Given a dataset X = {x1, ..., xN}, Y = {y1, ..., yN}, Bayesian inference\nis used to compute the posterior over the weights p(W|X, Y). This posterior captures the set of\nplausible model parameters, given the data.\nFor regression tasks we often de\ufb01ne our likelihood as a Gaussian with mean given by the model\noutput: p(y|f W(x)) = N (f W(x), \u03c32), with an observation noise scalar \u03c3. For classi\ufb01cation, on\nthe other hand, we often squash the model output through a softmax function, and sample from the\nresulting probability vector: p(y|f W(x)) = Softmax(f W(x)).\nBNNs are easy to formulate, but dif\ufb01cult to perform inference in. This is because the marginal\nprobability p(Y|X), required to evaluate the posterior p(W|X, Y) = p(Y|X, W)p(W)/p(Y|X),\nIn these approximate\ncannot be evaluated analytically. Different approximations exist [12\u201315].\ninference techniques, the posterior p(W|X, Y) is \ufb01tted with a simple distribution q\u2217\n\u03b8 (W), parame-\nterised by \u03b8. This replaces the intractable problem of averaging over all weights in the BNN with an\noptimisation task, where we seek to optimise over the parameters of the simple distribution instead\nof optimising the original neural network\u2019s parameters.\nDropout variational inference is a practical approach for approximate inference in large and complex\nmodels [15]. This inference is done by training a model with dropout before every weight layer,\nand by also performing dropout at test time to sample from the approximate posterior (stochastic\nforward passes, referred to as Monte Carlo dropout). More formally, this approach is equivalent\nto performing approximate variational inference where we \ufb01nd a simple distribution q\u2217\n\u03b8 (W) in a\ntractable family which minimises the Kullback-Leibler (KL) divergence to the true model posterior\np(W|X, Y). Dropout can be interpreted as a variational Bayesian approximation, where the ap-\nproximating distribution is a mixture of two Gaussians with small variances and the mean of one of\nthe Gaussians is \ufb01xed at zero. The minimisation objective is given by [16]:\n||\u03b8||2\n\nlog p(yi|f(cid:99)Wi(xi)) +\nwith N data points, dropout probability p, samples (cid:99)Wi \u223c q\u2217\n\nL(\u03b8, p) = \u2212 1\nN\n\n1 \u2212 p\n2N\n\nN(cid:88)\n\ni=1\n\n(1)\n\n\u03b8 (W), and \u03b8 the set of the simple\nIn regression, for\n\ndistribution\u2019s parameters to be optimised (weight matrices in dropout\u2019s case).\nexample, the negative log likelihood can be further simpli\ufb01ed as\n\n\u2212 log p(yi|f(cid:99)Wi(xi)) \u221d 1\n\n2\u03c32||yi \u2212 f(cid:99)Wi (xi)||2 +\n\n1\n2\n\n(2)\nfor a Gaussian likelihood, with \u03c3 the model\u2019s observation noise parameter \u2013 capturing how much\nnoise we have in the outputs.\nEpistemic uncertainty in the weights can be reduced by observing more data. This uncertainty in-\nduces prediction uncertainty by marginalising over the (approximate) weights posterior distribution.\n\nlog \u03c32\n\n3\n\n\fT(cid:88)\n\nSoftmax(f(cid:99)Wt(x))\n\n(3)\n\nFor classi\ufb01cation this can be approximated using Monte Carlo integration as follows:\n\nT(cid:88)\nwith T sampled masked model weights (cid:99)Wt \u223c q\u2217\nprobability vector: H(p) = \u2212(cid:80)C\n\np(y = c|x, X, Y) \u2248 1\nT\n\nt=1\n\n\u03b8 (W), where q\u03b8(W) is the Dropout distribution\n[6]. The uncertainty of this probability vector p can then be summarised using the entropy of the\nc=1 pc log pc. For regression this epistemic uncertainty is captured\n\nby the predictive variance, which can be approximated as:\n\nf(cid:99)Wt(x)T f(cid:99)Wt(xt) \u2212 E(y)T E(y)\n\nVar(y) \u2248 \u03c32 +\n\n1\nT\n\n(4)\n(cid:80)T\nwith predictions in this epistemic model done by approximating the predictive mean: E(y) \u2248\nt=1 f(cid:99)Wt(x). The \ufb01rst term in the predictive variance, \u03c32, corresponds to the amount of noise\nwhen we have zero parameter uncertainty (i.e. when all draws(cid:99)Wt take the same constant value).\n\n1\nT\ninherent in the data (which will be explained in more detail soon). The second part of the predictive\nvariance measures how much the model is uncertain about its predictions \u2013 this term will vanish\n\nt=1\n\n2.2 Heteroscedastic Aleatoric Uncertainty\nIn the above we captured model uncertainty \u2013 uncertainty over the model parameters \u2013 by approxi-\nmating the distribution p(W|X, Y). To capture aleatoric uncertainty in regression, we would have\nto tune the observation noise parameter \u03c3.\nHomoscedastic regression assumes constant observation noise \u03c3 for every input point x. Het-\neroscedastic regression, on the other hand, assumes that observation noise can vary with input x\n[17, 18]. Heteroscedastic models are useful in cases where parts of the observation space might\nhave higher noise levels than others. In non-Bayesian neural networks, this observation noise pa-\nrameter is often \ufb01xed as part of the model\u2019s weight decay, and ignored. However, when made\ndata-dependent, it can be learned as a function of the data:\n\nN(cid:88)\n\ni=1\n\nLNN(\u03b8) =\n\n1\nN\n\n1\n\n2\u03c3(xi)2||yi \u2212 f (xi)||2 +\n\n1\n2\n\nlog \u03c3(xi)2\n\n(5)\n\nwith added weight decay parameterised by \u03bb (and similarly for l1 loss). Note that here, unlike\nthe above, variational inference is not performed over the weights, but instead we perform MAP\ninference \u2013 \ufb01nding a single value for the model parameters \u03b8. This approach does not capture\nepistemic model uncertainty, as epistemic uncertainty is a property of the model and not of the data.\nIn the next section we will combine these two types of uncertainties together in a single model. We\nwill see how heteroscedastic noise can be interpreted as model attenuation, and develop a compli-\nmentary approach for the classi\ufb01cation case.\n\n3 Combining Aleatoric and Epistemic Uncertainty in One Model\n\nIn the previous section we described existing Bayesian deep learning techniques. In this section we\npresent novel contributions which extend this existing literature. We develop models that will allow\nus to study the effects of modeling either aleatoric uncertainty alone, epistemic uncertainty alone,\nor modeling both uncertainties together in a single model. This is followed by an observation that\naleatoric uncertainty in regression tasks can be interpreted as learned loss attenuation \u2013 making the\nloss more robust to noisy data. We follow that by extending the ideas of heteroscedastic regression\nto classi\ufb01cation tasks. This allows us to learn loss attenuation for classi\ufb01cation tasks as well.\n3.1 Combining Heteroscedastic Aleatoric Uncertainty and Epistemic Uncertainty\nWe wish to capture both epistemic and aleatoric uncertainty in a vision model. For this we turn the\nheteroscedastic NN in \u00a72.2 into a Bayesian NN by placing a distribution over its weights, with our\nconstruction in this section developed speci\ufb01cally for the case of vision models1.\nWe need to infer the posterior distribution for a BNN model f mapping an input image, x, to a unary\noutput, \u02c6y \u2208 R, and a measure of aleatoric uncertainty given by variance, \u03c32. We approximate the\nposterior over the BNN with a dropout variational distribution using the tools of \u00a72.1. As before,\n\n1Although this construction can be generalised for any heteroscedastic NN architecture.\n\n4\n\n\fwe draw model weights from the approximate posterior(cid:99)W \u223c q(W) to obtain a model output, this\nwhere f is a Bayesian convolutional neural network parametrised by model weights(cid:99)W. We can use\n\ntime composed of both predictive mean as well as predictive variance:\n\n[\u02c6y, \u02c6\u03c32] = f(cid:99)W(x)\n\na single network to transform the input x, with its head split to predict both \u02c6y as well as \u02c6\u03c32.\nWe \ufb01x a Gaussian likelihood to model our aleatoric uncertainty. This induces a minimisation objec-\ntive given labeled output points x:\n\n(6)\n\nLBN N (\u03b8) =\n\n1\nD\n\n\u02c6\u03c3\u22122\n\ni\n\n||yi \u2212 \u02c6yi||2 +\n\n1\n2\n\n1\n2\n\nlog \u02c6\u03c32\ni\n\n(7)\n\n(cid:88)\n\ni\n\nwhere D is the number of output pixels yi corresponding to input image x, indexed by i (addition-\nally, the loss includes weight decay which is omitted for brevity). For example, we may set D = 1\nfor image-level regression tasks, or D equal to the number of pixels for dense prediction tasks (pre-\ndicting a unary corresponding to each input image pixel). \u02c6\u03c32\ni is the BNN output for the predicted\nvariance for pixel i.\nThis loss consists of two components; the residual regression obtained with a stochastic sample\nthrough the model \u2013 making use of the uncertainty over the parameters \u2013 and an uncertainty regu-\nlarization term. We do not need \u2018uncertainty labels\u2019 to learn uncertainty. Rather, we only need to\nsupervise the learning of the regression task. We learn the variance, \u03c32, implicitly from the loss\nfunction. The second regularization term prevents the network from predicting in\ufb01nite uncertainty\n(and therefore zero loss) for all data points.\nIn practice, we train the network to predict the log variance, si := log \u02c6\u03c32\ni :\nexp(\u2212si)||yi \u2212 \u02c6yi||2 +\n1\nsi.\n2\n\nLBN N (\u03b8) =\n\n(cid:88)\n\n1\nD\n\n(8)\n\n1\n2\n\nThis is because it is more numerically stable than regressing the variance, \u03c32, as the loss avoids a\npotential division by zero. The exponential mapping also allows us to regress unconstrained scalar\nvalues, where exp(\u2212si) is resolved to the positive domain giving valid values for variance.\nTo summarize, the predictive uncertainty for pixel y in this combined model can be approximated\nusing:\n\nVar(y) \u2248 1\nT\n\nt \u2212\n\u02c6y2\n\nt=1\nt=1 a set of T sampled outputs: \u02c6yt, \u02c6\u03c32\n\nt=1\n\n(cid:18) 1\n\nT\n\n(cid:19)2\n\nT(cid:88)\n\nT(cid:88)\nt = f(cid:99)Wt(x) for randomly masked weights\n\n1\nT\n\n(9)\n\n\u02c6\u03c32\nt\n\n\u02c6yt\n\nt=1\n\n+\n\ni\n\nT(cid:88)\n\nwith {\u02c6yt, \u02c6\u03c32\n\n(cid:99)Wt \u223c q(W).\nt }T\n\n3.2 Heteroscedastic Uncertainty as Learned Loss Attenuation\nWe observe that allowing the network to predict uncertainty, allows it effectively to temper the\nresidual loss by exp(\u2212si), which depends on the data. This acts similarly to an intelligent robust\nregression function. It allows the network to adapt the residual\u2019s weighting, and even allows the\nnetwork to learn to attenuate the effect from erroneous labels. This makes the model more robust to\nnoisy data: inputs for which the model learned to predict high uncertainty will have a smaller effect\non the loss.\nThe model is discouraged from predicting high uncertainty for all points \u2013 in effect ignoring the\ndata \u2013 through the log \u03c32 term. Large uncertainty increases the contribution of this term, and in turn\npenalizes the model: The model can learn to ignore the data \u2013 but is penalised for that. The model is\nalso discouraged from predicting very low uncertainty for points with high residual error, as low \u03c32\nwill exaggerate the contribution of the residual and will penalize the model. It is important to stress\nthat this learned attenuation is not an ad-hoc construction, but a consequence of the probabilistic\ninterpretation of the model.\n3.3 Heteroscedastic Uncertainty in Classi\ufb01cation Tasks\nThis learned loss attenuation property of heteroscedastic NNs in regression is a desirable effect for\nclassi\ufb01cation models as well. However, heteroscedastic NNs in classi\ufb01cation are peculiar models\nbecause technically any classi\ufb01cation task has input-dependent uncertainty. Nevertheless, the ideas\nabove can be extended from regression heteroscedastic NNs to classi\ufb01cation heteroscedastic NNs.\n\n5\n\n\f\u02c6xi,t = f W\nLx =\n\ni + \u03c3W\ni \u0001t,\n1\nT\n\nlog\n\n(cid:88)\n\ni\n\n(cid:88)\n\nt\n\n\u0001t \u223c N (0, I)\nexp(\u02c6xi,t,c \u2212 log\n\n(cid:88)\n\nc(cid:48)\n\nexp \u02c6xi,t,c(cid:48))\n\n(12)\n\nFor this we adapt the standard classi\ufb01cation model to marginalise over intermediate heteroscedastic\nregression uncertainty placed over the logit space. We therefore explicitly refer to our proposed\nmodel adaptation as a heteroscedastic classi\ufb01cation NN.\nFor classi\ufb01cation tasks our NN predicts a vector of unaries fi for each pixel i, which when passed\nthrough a softmax operation, forms a probability vector pi. We change the model by placing a\nGaussian distribution over the unaries vector:\n\n\u02c6xi|W \u223c N (f W\n\n, (\u03c3W\n\ni )2)\n\u02c6pi = Softmax(\u02c6xi).\n\ni\n\n(10)\n\ni\n\ni\n\n, \u03c3W\n\nare the network outputs with parameters W. This vector f W\n\nis corrupted with Gaus-\nHere f W\nsian noise with variance (\u03c3W\ni )2 (a diagonal matrix with one element for each logit value), and the\ncorrupted vector is then squashed with the softmax function to obtain pi, the probability vector for\npixel i.\nOur expected log likelihood for this model is given by:\n\ni\n\n,(\u03c3W\n\ni )2)[\u02c6pi,c]\n\nlog EN (\u02c6xi;f W\n\n(11)\nwith c the observed class for input i, which gives us our loss function. Ideally, we would want to\nanalytically integrate out this Gaussian distribution, but no analytic solution is known. We therefore\napproximate the objective through Monte Carlo integration, and sample unaries through the softmax\nfunction. We note that this operation is extremely fast because we perform the computation once\n(passing inputs through the model to get logits). We only need to sample from the logits, which is\na fraction of the network\u2019s compute, and therefore does not signi\ufb01cantly increase the model\u2019s test\ntime. We can rewrite the above and obtain the following numerically-stable stochastic loss:\n\ni\n\nwith xi,t,c(cid:48) the c(cid:48) element in the logit vector xi,t.\nThis objective can be interpreted as learning loss attenuation, similarly to the regression case. We\nnext assess the ideas above empirically.\n\n4 Experiments\n\nIn this section we evaluate our methods with pixel-wise depth regression and semantic segmentation.\nAn analysis of these results is given in the following section. To show the robustness of our learned\nloss attenuation \u2013 a side-effect of modeling uncertainty \u2013 we present results on an array of popular\ndatasets, CamVid, Make3D, and NYUv2 Depth, where we set new state-of-the-art benchmarks.\nFor the following experiments we use the DenseNet architecture [19] which has been adapted for\ndense prediction tasks by [20]. We use our own independent implementation of the architecture\nusing TensorFlow [21] (which slightly outperforms the original authors\u2019 implementation on CamVid\nby 0.2%, see Table 1a). For all experiments we train with 224 \u00d7 224 crops of batch size 4, and then\n\ufb01ne-tune on full-size images with a batch size of 1. We train with RMS-Prop with a constant learning\nrate of 0.001 and weight decay 10\u22124.\nWe compare the results of the Bayesian neural network models outlined in \u00a73. We model epistemic\nuncertainty using Monte Carlo dropout (\u00a72.1). The DenseNet architecture places dropout with p =\n0.2 after each convolutional layer. Following [22], we use 50 Monte Carlo dropout samples. We\nmodel aleatoric uncertainty with MAP inference using loss functions (8) and (12 in the appendix),\nfor regression and classi\ufb01cation respectively (\u00a72.2). However, we derive the loss function using a\nLaplacian prior, as opposed to the Gaussian prior used for the derivations in \u00a73. This is because\nit results in a loss function which applies a L1 distance on the residuals. Typically, we \ufb01nd this to\noutperform L2 loss for regression tasks in vision. We model the bene\ufb01t of combining both epistemic\nuncertainty as well as aleatoric uncertainty using our developments presented in \u00a73.\n4.1 Semantic Segmentation\nTo demonstrate our method for semantic segmentation, we use two datasets, CamVid [8] and NYU\nv2 [23]. CamVid is a road scene understanding dataset with 367 training images and 233 test images,\nof day and dusk scenes, with 11 classes. We resize images to 360 \u00d7 480 pixels for training and\nevaluation. In Table 1a we present results for our architecture. Our method sets a new state-of-the-art\n\n6\n\n\fCamVid\nSegNet [28]\nFCN-8 [29]\nDeepLab-LFOV [24]\nBayesian SegNet [22]\nDilation8 [30]\nDilation8 + FSO [31]\nDenseNet [20]\n\nThis work:\n\nDenseNet (Our Implementation)\n+ Aleatoric Uncertainty\n+ Epistemic Uncertainty\n+ Aleatoric & Epistemic\n\nIoU\n46.4\n57.0\n61.6\n63.1\n65.3\n66.1\n66.9\n\n67.1\n67.4\n67.2\n67.5\n\nNYUv2 40-class\n\nAccuracy\n\nIoU\n\nSegNet [28]\nFCN-8 [29]\nBayesian SegNet [22]\nEigen and Fergus [32]\n\nThis work:\n\nDeepLabLargeFOV\n+ Aleatoric Uncertainty\n+ Epistemic Uncertainty\n+ Aleatoric & Epistemic\n\n66.1\n61.8\n68.0\n65.6\n\n70.1\n70.4\n70.2\n70.6\n\n23.6\n31.6\n32.4\n34.1\n\n36.5\n37.1\n36.7\n37.3\n\n(a) CamVid dataset for road scene segmentation.\n\n(b) NYUv2 40-class dataset for indoor scenes.\n\nTable 1: Semantic segmentation performance. Modeling both aleatoric and epistemic uncertainty gives a\nnotable improvement in segmentation accuracy over state of the art baselines.\n\nMake3D\nKarsch et al. [33]\nLiu et al. [34]\nLi et al. [35]\nLaina et al. [26]\n\nrel\n0.355\n0.335\n0.278\n0.176\n\nThis work:\n\nDenseNet Baseline\n+ Aleatoric Uncertainty\n+ Epistemic Uncertainty\n+ Aleatoric & Epistemic\n\n0.167\n0.149\n0.162\n0.149\n\nrms\n9.20\n9.49\n7.19\n4.46\n\n3.92\n3.93\n3.87\n4.08\n\nlog10\n0.127\n0.137\n0.092\n0.072\n\n0.064\n0.061\n0.064\n0.063\n\nNYU v2 Depth\nKarsch et al. [33]\nLadicky et al. [36]\nLiu et al. [34]\nLi et al. [35]\nEigen et al. [27]\nEigen and Fergus [32]\nLaina et al. [26]\n\n-\n\nrel\n0.374\n\n0.335\n0.232\n0.215\n0.158\n0.127\n\nrms\n1.12\n\n-\n\n1.06\n0.821\n0.907\n0.641\n0.573\n\nlog10\n0.134\n\n0.127\n0.094\n\n-\n\n-\n-\n\n0.055\n\n\u03b42\n-\n\n-\n\n\u03b43\n-\n\n-\n\n\u03b41\n-\n\n-\n\n54.2% 82.9% 91.4%\n\n62.1% 88.6% 96.8%\n61.1% 88.7% 97.1%\n76.9% 95.0% 98.8%\n81.1% 95.3% 98.8%\n\nThis work:\n\nDenseNet Baseline\n+ Aleatoric Uncertainty\n+ Epistemic Uncertainty\n+ Aleatoric & Epistemic\n\n0.117\n0.112\n0.114\n0.110\n\n0.517\n0.508\n0.512\n0.506\n\n0.051\n0.046\n0.049\n0.045\n\n80.2% 95.1% 98.8%\n81.6% 95.8% 98.8%\n81.1% 95.4% 98.8%\n81.7% 95.9% 98.9%\n\n(a) Make3D depth dataset [25].\n\n(b) NYUv2 depth dataset [23].\n\nTable 2: Monocular depth regression performance. Comparison to previous approaches on depth regression\ndataset NYUv2 Depth. Modeling the combination of uncertainties improves accuracy.\n\non this dataset with mean intersection over union (IoU) score of 67.5%. We observe that modeling\nboth aleatoric and epistemic uncertainty improves over the baseline result. The implicit attenuation\nobtained from the aleatoric loss provides a larger improvement than the epistemic uncertainty model.\nHowever, the combination of both uncertainties improves performance even further. This shows that\nfor this application it is more important to model aleatoric uncertainty, suggesting that epistemic\nuncertainty can be mostly explained away in this large data setting.\nSecondly, NYUv2 [23] is a challenging indoor segmentation dataset with 40 different semantic\nclasses. It has 1449 images with resolution 640 \u00d7 480 from 464 different indoor scenes. Table 1b\nshows our results. This dataset is much harder than CamVid because there is signi\ufb01cantly less struc-\nture in indoor scenes compared to street scenes, and because of the increased number of semantic\nclasses. We use DeepLabLargeFOV [24] as our baseline model. We observe a similar result (qual-\nitative results given in Figure 4); we improve baseline performance by giving the model \ufb02exibility\nto estimate uncertainty and attenuate the loss. The effect is more pronounced, perhaps because the\ndataset is more dif\ufb01cult.\n4.2 Pixel-wise Depth Regression\nWe demonstrate the ef\ufb01cacy of our method for regression using two popular monocular depth regres-\nsion datasets, Make3D [25] and NYUv2 Depth [23]. The Make3D dataset consists of 400 training\nand 134 testing images, gathered using a 3-D laser scanner. We evaluate our method using the same\nstandard as [26], resizing images to 345 \u00d7 460 pixels and evaluating on pixels with depth less than\n70m. NYUv2 Depth is taken from the same dataset used for classi\ufb01cation above. It contains RGB-D\nimagery from 464 different indoor scenes. We compare to previous approaches for Make3D in Table\n2a and NYUv2 Depth in Table 2b, using standard metrics (for a description of these metrics please\nsee [27]).\nThese results show that aleatoric uncertainty is able to capture many aspects of this task which\nare inherently dif\ufb01cult. For example, in the qualitative results in Figure 5 and 6 we observe that\naleatoric uncertainty is greater for large depths, re\ufb02ective surfaces and occlusion boundaries in the\nimage. These are common failure modes of monocular depth algorithms [26]. On the other hand,\nthese qualitative results show that epistemic uncertainty captures dif\ufb01culties due to lack of data. For\n\n7\n\n\f(a) Classi\ufb01cation (CamVid)\n\n(b) Regression (Make3D)\n\nFigure 2: Precision Recall plots demonstrating both measures of uncertainty can effectively capture accuracy,\nas precision decreases with increasing uncertainty.\n\n(a) Regression (Make3D)\n\n(b) Classi\ufb01cation (CamVid)\n\nFigure 3: Uncertainty calibration plots. This plot shows how well uncertainty is calibrated, where perfect\ncalibration corresponds to the line y = x, shown in black. We observe an improvement in calibration mean\nsquared error with aleatoric, epistemic and the combination of uncertainties.\n\nexample, we observe larger uncertainty for objects which are rare in the training set such as humans\nin the third example of Figure 5.\nIn summary, we have demonstrated that our model can improve performance over non-Bayesian\nbaselines by implicitly learning attenuation of systematic noise and dif\ufb01cult concepts. For example\nwe observe high aleatoric uncertainty for distant objects and on object and occlusion boundaries.\n\n5 Analysis: What Do Aleatoric and Epistemic Uncertainties Capture?\n\nIn \u00a74 we showed that modeling aleatoric and epistemic uncertainties improves prediction perfor-\nmance, with the combination performing even better. In this section we wish to study the effec-\ntiveness of modeling aleatoric and epistemic uncertainty.\nIn particular, we wish to quantify the\nperformance of these uncertainty measurements and analyze what they capture.\n5.1 Quality of Uncertainty Metric\nFirstly, in Figure 2 we show precision-recall curves for regression and classi\ufb01cation models. They\nshow how our model performance improves by removing pixels with uncertainty larger than various\npercentile thresholds. This illustrates two behaviors of aleatoric and epistemic uncertainty measures.\nFirstly, it shows that the uncertainty measurements are able to correlate well with accuracy, because\nall curves are strictly decreasing functions. We observe that precision is lower when we have more\npoints that the model is not certain about. Secondly, the curves for epistemic and aleatoric uncer-\ntainty models are very similar. This shows that each uncertainty ranks pixel con\ufb01dence similarly to\nthe other uncertainty, in the absence of the other uncertainty. This suggests that when only one un-\ncertainty is explicitly modeled, it attempts to compensate for the lack of the alternative uncertainty\nwhen possible.\nSecondly, in Figure 3 we analyze the quality of our uncertainty measurement using calibration plots\nfrom our model on the test set. To form calibration plots for classi\ufb01cation models, we discretize our\nmodel\u2019s predicted probabilities into a number of bins, for all classes and all pixels in the test set. We\nthen plot the frequency of correctly predicted labels for each bin of probability values. Better per-\nforming uncertainty estimates should correlate more accurately with the line y = x in the calibration\nplots. For regression models, we can form calibration plots by comparing the frequency of residuals\nlying within varying thresholds of the predicted distribution. Figure 3 shows the calibration of our\nclassi\ufb01cation and regression uncertainties.\n\n8\n\n0.00.20.40.60.81.0Recall0.880.900.920.940.960.981.00PrecisionAleatoric UncertaintyEpistemic Uncertainty0.00.20.40.60.81.0Recall01234Precision (RMS Error)Aleatoric UncertaintyEpistemic Uncertainty0.00.20.40.60.81.0Probability0.00.20.40.60.81.0FrequencyAleatoric, MSE = 0.031Epistemic, MSE = 0.003640.00.20.40.60.81.0Probability0.00.20.40.60.81.0PrecisionNon-Bayesian, MSE = 0.00501Aleatoric, MSE = 0.00272Epistemic, MSE = 0.007Epistemic+Aleatoric, MSE = 0.002140.00.20.40.60.81.0Probability0.00.20.40.60.81.0PrecisionNon-Bayesian, MSE = 0.00501Aleatoric, MSE = 0.00272Epistemic, MSE = 0.007Epistemic+Aleatoric, MSE = 0.00214\fTest\ndataset\n\nTrain\ndataset\nRMS\nMake3D / 4 Make3D 5.76\nMake3D / 2 Make3D 4.62\nMake3D\nMake3D 3.87\nMake3D / 4 NYUv2\nMake3D\nNYUv2\n\n-\n-\n\nAleatoric Epistemic\nvariance\nvariance\n0.506\n0.521\n0.485\n0.388\n0.461\n\n7.73\n4.38\n2.78\n15.0\n4.87\n\n(a) Regression\n\nTrain\nTest\ndataset\ndataset\nCamVid / 4 CamVid\nCamVid / 2 CamVid\nCamVid\nCamVid\nCamVid / 4 NYUv2\nCamVid\nNYUv2\n\nAleatoric\nentropy\n0.106\n0.156\n0.111\n0.247\n0.264\n\nIoU\n57.2\n62.9\n67.5\n\n-\n-\n\n(b) Classi\ufb01cation\n\nEpistemic logit\nvariance (\u00d710\u22123)\n\n1.96\n1.66\n1.36\n10.9\n11.8\n\nTable 3: Accuracy and aleatoric and epistemic uncertainties for a range of different train and test dataset\ncombinations. We show aleatoric and epistemic uncertainty as the mean value of all pixels in the test dataset. We\ncompare reduced training set sizes (1, 1\u20442, 1\u20444) and unrelated test datasets. This shows that aleatoric uncertainty\nremains approximately constant, while epistemic uncertainty decreases the closer the test data is to the training\ndistribution, demonstrating that epistemic uncertainty can be explained away with suf\ufb01cient training data (but\nnot for out-of-distribution data).\n5.2 Uncertainty with Distance from Training Data\nIn this section we show two results:\n\n1. Aleatoric uncertainty cannot be explained away with more data,\n\n2. Aleatoric uncertainty does not increase for out-of-data examples (situations different from\n\ntraining set), whereas epistemic uncertainty does.\n\nIn Table 3 we give accuracy and uncertainty for models trained on increasing sized subsets of\ndatasets. This shows that epistemic uncertainty decreases as the training dataset gets larger. It also\nshows that aleatoric uncertainty remains relatively constant and cannot be explained away with more\ndata. Testing the models with a different test set (bottom two lines) shows that epistemic uncertainty\nincreases considerably on those test points which lie far from the training sets.\nThese results reinforce the case that epistemic uncertainty can be explained away with enough data,\nbut is required to capture situations not encountered in the training set. This is particularly important\nfor safety-critical systems, where epistemic uncertainty is required to detect situations which have\nnever been seen by the model before.\n5.3 Real-Time Application\nOur model based on DenseNet [20] can process a 640\u00d7480 resolution image in 150ms on a NVIDIA\nTitan X GPU. The aleatoric uncertainty models add negligible compute. However, epistemic mod-\nels require expensive Monte Carlo dropout sampling. For models such as ResNet [4], this is pos-\nsible to achieve economically because only the last few layers contain dropout. Other models, like\nDenseNet, require the entire architecture to be sampled. This is dif\ufb01cult to parallelize due to GPU\nmemory constraints, and often results in a 50\u00d7 slow-down for 50 Monte Carlo samples.\n6 Conclusions\n\nWe presented a novel Bayesian deep learning framework to learn a mapping to aleatoric uncertainty\nfrom the input data, which is composed on top of epistemic uncertainty models. We derived our\nframework for both regression and classi\ufb01cation applications. We showed that it is important to\nmodel aleatoric uncertainty for:\n\n\u2022 Large data situations, where epistemic uncertainty is explained away,\n\u2022 Real-time applications, because we can form aleatoric models without expensive Monte\n\nCarlo samples.\n\nAnd epistemic uncertainty is important for:\n\n\u2022 Safety-critical applications, because epistemic uncertainty is required to understand ex-\n\namples which are different from training data,\n\n\u2022 Small datasets where the training data is sparse.\n\nHowever aleatoric and epistemic uncertainty models are not mutually exclusive. We showed that\nthe combination is able to achieve new state-of-the-art results on depth regression and semantic\nsegmentation benchmarks.\nThe \ufb01rst paragraph in this paper posed two recent disasters which could have been averted by real-\ntime Bayesian deep learning tools. Therefore, we leave \ufb01nding a method for real-time epistemic\nuncertainty in deep learning as an important direction for future research.\n\n9\n\n\fReferences\n\n[1] NHTSA. PE 16-007. Technical report, U.S. Department of Transportation, National Highway Traf\ufb01c\n\nSafety Administration, Jan 2017. Tesla Crash Preliminary Evaluation Report.\n\n[2] Jessica Guynn. Google photos labeled black people \u2019gorillas\u2019. USA Today, 2015.\n[3] Andrew Blake, Rupert Curwen, and Andrew Zisserman. A framework for spatiotemporal control in the\n\ntracking of visual contours. International Journal of Computer Vision, 11(2):127\u2013145, 1993.\n\n[4] Xuming He, Richard S Zemel, and Miguel \u00b4A Carreira-Perpi\u02dcn\u00b4an. Multiscale conditional random \ufb01elds for\nimage labeling. In Computer vision and pattern recognition, 2004. CVPR 2004. Proceedings of the 2004\nIEEE computer society conference on, volume 2, pages II\u2013II. IEEE, 2004.\n\n[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[6] Y. Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.\n[7] Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epistemic? does it matter? Structural Safety, 31\n\n(2):105\u2013112, 2009.\n\n[8] Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A high-\n\nde\ufb01nition ground truth database. Pattern Recognition Letters, 30(2):88\u201397, 2009.\n\n[9] John Denker and Yann LeCun. Transforming neural-net output levels to probability distributions.\n\nAdvances in Neural Information Processing Systems 3. Citeseer, 1991.\n\nIn\n\n[10] David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation,\n\n4(3):448\u2013472, 1992.\n\n[11] Radford M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.\n[12] Alex Graves. Practical variational inference for neural networks.\n\nIn Advances in Neural Information\n\nProcessing Systems, pages 2348\u20132356, 2011.\n\n[13] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural\n\nnetwork. In ICML, 2015.\n\n[14] Jos\u00b4e Miguel Hern\u00b4andez-Lobato, Yingzhen Li, Daniel Hern\u00b4andez-Lobato, Thang Bui, and Richard E\nIn Proceedings of The 33rd International Confer-\n\nTurner. Black-box alpha divergence minimization.\nence on Machine Learning, pages 1511\u20131520, 2016.\n\n[15] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with Bernoulli approximate\n\nvariational inference. ICLR workshop track, 2016.\n\n[16] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to\n\nvariational methods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\n[17] David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability dis-\ntribution. In Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE\nInternational Conference On, volume 1, pages 55\u201360. IEEE, 1994.\n\n[18] Quoc V Le, Alex J Smola, and St\u00b4ephane Canu. Heteroscedastic Gaussian process regression. In Proceed-\n\nings of the 22nd international conference on Machine learning, pages 489\u2013496. ACM, 2005.\n\n[19] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convo-\n\nlutional networks. arXiv preprint arXiv:1608.06993, 2016.\n\n[20] Simon J\u00b4egou, Michal Drozdzal, David Vazquez, Adriana Romero, and Yoshua Bengio. The one\narXiv preprint\n\nhundred layers tiramisu: Fully convolutional densenets for semantic segmentation.\narXiv:1611.09326, 2016.\n\n[21] Mart\u00b4\u0131n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,\nSanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for large-scale machine\nlearning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementa-\ntion (OSDI). Savannah, Georgia, USA, 2016.\n\n[22] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian SegNet: Model uncertainty in deep\nconvolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680,\n2015.\n\n[23] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support\nIn European Conference on Computer Vision, pages 746\u2013760. Springer,\n\ninference from rgbd images.\n2012.\n\n[24] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Se-\narXiv preprint\n\nmantic image segmentation with deep convolutional nets and fully connected crfs.\narXiv:1412.7062, 2014.\n\n10\n\n\f[25] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still\n\nimage. IEEE transactions on pattern analysis and machine intelligence, 31(5):824\u2013840, 2009.\n\n[26] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth\nIn 3D Vision (3DV), 2016 Fourth International\n\nprediction with fully convolutional residual networks.\nConference on, pages 239\u2013248. IEEE, 2016.\n\n[27] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a\nIn Advances in neural information processing systems, pages 2366\u20132374,\n\nmulti-scale deep network.\n2014.\n\n[28] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet: A deep convolutional encoder-\ndecoder architecture for scene segmentation. IEEE Transactions on Pattern Analysis and Machine In-\ntelligence, 2017.\n\n[29] Evan Shelhamer, Jonathon Long, and Trevor Darrell. Fully convolutional networks for semantic segmen-\n\ntation. IEEE transactions on pattern analysis and machine intelligence, 2016.\n\n[30] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint\n\narXiv:1511.07122, 2015.\n\n[31] Abhijit Kundu, Vibhav Vineet, and Vladlen Koltun. Feature space optimization for semantic video seg-\nmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3168\u20133175, 2016.\n\n[32] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common\nmulti-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer\nVision, pages 2650\u20132658, 2015.\n\n[33] Kevin Karsch, Ce Liu, and Sing Bing Kang. Depth extraction from video using non-parametric sampling.\n\nIn European Conference on Computer Vision, pages 775\u2013788. Springer, 2012.\n\n[34] Miaomiao Liu, Mathieu Salzmann, and Xuming He. Discrete-continuous depth estimation from a single\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n\nimage.\n716\u2013723, 2014.\n\n[35] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel, and Mingyi He. Depth and surface normal\nestimation from monocular images using regression on deep features and hierarchical crfs. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 1119\u20131127, 2015.\n\n[36] Lubor Ladicky, Jianbo Shi, and Marc Pollefeys. Pulling things out of perspective. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 89\u201396, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2871, "authors": [{"given_name": "Alex", "family_name": "Kendall", "institution": "University of Cambridge"}, {"given_name": "Yarin", "family_name": "Gal", "institution": "University of Oxford"}]}