{"title": "Concrete Dropout", "book": "Advances in Neural Information Processing Systems", "page_first": 3581, "page_last": 3590, "abstract": "Dropout is used as a practical tool to obtain uncertainty estimates in large vision models and reinforcement learning (RL) tasks. But to obtain well-calibrated uncertainty estimates, a grid-search over the dropout probabilities is necessary\u2014a prohibitive operation with large models, and an impossible one with RL. We propose a new dropout variant which gives improved performance and better calibrated uncertainties. Relying on recent developments in Bayesian deep learning, we use a continuous relaxation of dropout\u2019s discrete masks. Together with a principled optimisation objective, this allows for automatic tuning of the dropout probability in large models, and as a result faster experimentation cycles. In RL this allows the agent to adapt its uncertainty dynamically as more data is observed. We analyse the proposed variant extensively on a range of tasks, and give insights into common practice in the field where larger dropout probabilities are often used in deeper model layers.", "full_text": "Concrete Dropout\n\nYarin Gal\n\nyarin.gal@eng.cam.ac.uk\n\nUniversity of Cambridge\n\nand Alan Turing Institute, London\n\nJiri Hron\n\njh2084@cam.ac.uk\n\nUniversity of Cambridge\n\nAlex Kendall\n\nagk34@cam.ac.uk\n\nUniversity of Cambridge\n\nAbstract\n\nDropout is used as a practical tool to obtain uncertainty estimates in large vision\nmodels and reinforcement learning (RL) tasks. But to obtain well-calibrated\nuncertainty estimates, a grid-search over the dropout probabilities is necessary\u2014\na prohibitive operation with large models, and an impossible one with RL. We\npropose a new dropout variant which gives improved performance and better\ncalibrated uncertainties. Relying on recent developments in Bayesian deep learning,\nwe use a continuous relaxation of dropout\u2019s discrete masks. Together with a\nprincipled optimisation objective, this allows for automatic tuning of the dropout\nprobability in large models, and as a result faster experimentation cycles. In RL\nthis allows the agent to adapt its uncertainty dynamically as more data is observed.\nWe analyse the proposed variant extensively on a range of tasks, and give insights\ninto common practice in the \ufb01eld where larger dropout probabilities are often used\nin deeper model layers.\n\n1\n\nIntroduction\n\nWell-calibrated uncertainty is crucial for many tasks in deep learning. From the detection of adversar-\nial examples [25], through an agent exploring its environment safely [10, 18], to analysing failure\ncases in autonomous driving vision systems [20]. Tasks such as these depend on good uncertainty\nestimates to perform well, with miscalibrated uncertainties in reinforcement learning (RL) having the\npotential to lead to over-exploration of the environment. Or, much worse, miscalibrated uncertainties\nin an autonomous driving vision systems leading to its failure to detect its own ignorance about the\nworld, resulting in the loss of human life [29].\nA principled technique to obtaining uncertainty in models such as the above is Bayesian inference,\nwith dropout [9, 14] being a practical inference approximation. In dropout inference the neural\nnetwork is trained with dropout at training time, and at test time the output is evaluated by dropping\nunits randomly to generate samples from the predictive distribution [9]. But to get well-calibrated\nuncertainty estimates it is necessary to adapt the dropout probability as a variational parameter to the\ndata at hand [7]. In previous works this was done through a grid-search over the dropout probabilities\n[9]. Grid-search can pose dif\ufb01culties though in certain tasks. Grid-search is a prohibitive operation\nwith large models such as the ones used in Computer Vision [19, 20], where multiple GPUs would\nbe used to train a single model. Grid-searching over the dropout probability in such models would\nrequire either an immense waste of computational resources, or extremely prolonged experimentation\ncycles. More so, the number of possible per-layer dropout con\ufb01gurations grows exponentially as\nthe number of model layers increases. Researchers have therefore restricted the grid-search to a\nsmall number of possible dropout values to make such search feasible [8], which in turn might hurt\nuncertainty calibration in vision models for autonomous systems.\nIn other tasks a grid-search over the dropout probabilities is impossible altogether. In tasks where the\namount of data changes over time, for example, the dropout probability should be decreased as the\namount of data increases [7]. This is because the dropout probability has to diminish to zero in the\nlimit of data\u2014with the model explaining away its uncertainty completely (this is explained in more\ndetail in \u00a72). RL is an example setting where the dropout probability has to be adapted dynamically.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe amount of data collected by the agent increases steadily with each episode, and in order to reduce\nthe agent\u2019s uncertainty, the dropout probability must be decreased. Grid-searching over the dropout\nprobability is impossible in this setting, as the agent will have to be reset and re-trained with the\nentire data with each new acquired episode. A method to tune the dropout probability which results\nin good accuracy and uncertainty estimates is needed then.\nExisting literature on tuning the dropout probability is sparse. Current methods include the opti-\nmisation of \u21b5 in Gaussian dropout following its variational interpretation [23], and overlaying a\nbinary belief network to optimise the dropout probabilities as a function of the inputs [2]. The latter\napproach is of limited practicality with large models due to the increase in model size. With the\nformer approach [23], practical use reveals some unforeseen dif\ufb01culties [28]. Most notably, the \u21b5\nvalues have to be truncated at 1, as the KL approximation would diverge otherwise. In practice the\nmethod under-performs.\nIn this work we propose a new practical dropout variant which can be seen as a continuous relaxation\nof the discrete dropout technique. Relying on recent techniques in Bayesian deep learning [16, 27],\ntogether with appropriate regularisation terms derived from dropout\u2019s Bayesian interpretation, our\nvariant allows the dropout probability to be tuned using gradient methods. This results in better-\ncalibrated uncertainty estimates in large models, avoiding the coarse and expensive grid-search over\nthe dropout probabilities. Further, this allows us to use dropout in RL tasks in a principled way.\nWe analyse the behaviour of our proposed dropout variant on a wide variety of tasks. We study its\nability to capture different types of uncertainty on a simple synthetic dataset with known ground\ntruth uncertainty, and show how its behaviour changes with increasing amounts of data versus\nmodel size. We show improved accuracy and uncertainty on popular datasets in the \ufb01eld, and\nfurther demonstrate our variant on large models used in the Computer Vision community, showing\na signi\ufb01cant reduction in experiment time as well as improved model performance and uncertainty\ncalibration. We demonstrate our dropout variant in a model-based RL task, showing that the agent\nautomatically reduces its uncertainty as the amount of data increases, and give insights into common\npractice in the \ufb01eld where a small dropout probability is often used with the shallow layers of a\nmodel, and a large dropout probability used with the deeper layers.\n\n2 Background\n\nIn order to understand the relation between a model\u2019s uncertainty and the dropout probability, we\nstart with a slightly philosophical discussion of the different types of uncertainty available to us. This\ndiscussion will be grounded in the development of new tools to better understand these uncertainties\nin the next section.\nThree types of uncertainty are often encountered in Bayesian modelling. Epistemic uncertainty\ncaptures our ignorance about the models most suitable to explain our data; Aleatoric uncertainty\ncaptures noise inherent in the environment; Lastly, predictive uncertainty conveys the model\u2019s\nuncertainty in its output. Epistemic uncertainty reduces as the amount of observed data increases\u2014\nhence its alternative name \u201creducible uncertainty\u201d. When dealing with models over functions, this\nuncertainty can be captured through the range of possible functions and the probability given to\neach function. This uncertainty is often summarised by generating function realisations from our\ndistribution and estimating the variance of the functions when evaluated on a \ufb01xed set of inputs.\nAleatoric uncertainty captures noise sources such as measurement noise\u2014noises which cannot be\nexplained away even if more data were available (although this uncertainty can be reduced through\nthe use of higher precision sensors for example). This uncertainty is often modelled as part of the\nlikelihood, at the top of the model, where we place some noise corruption process on the function\u2019s\noutput. Gaussian corrupting noise is often assumed in regression, although other noise sources are\npopular as well such as Laplace noise. By inferring the Gaussian likelihood\u2019s precision parameter \u2327\nfor example we can estimate the amount of aleatoric noise inherent in the data.\nCombining both types of uncertainty gives us the predictive uncertainty\u2014the model\u2019s con\ufb01dence\nin its prediction, taking into account noise it can explain away and noise it cannot. This uncertainty\nis often obtained by generating multiple functions from our model and corrupting them with noise\n(with precision \u2327). Calculating the variance of these outputs on a \ufb01xed set of inputs we obtain the\nmodel\u2019s predictive uncertainty. This uncertainty has different properties for different inputs. Inputs\nnear the training data will have a smaller epistemic uncertainty component, while inputs far away\n\n2\n\n\ffrom the training data will have higher epistemic uncertainty. Similarly, some parts of the input space\nmight have larger aleatoric uncertainty than others, with these inputs producing larger measurement\nerror for example. These different types of uncertainty are of great importance in \ufb01elds such as AI\nsafety [1] and autonomous decision making, where the model\u2019s epistemic uncertainty can be used to\navoid making uninformed decisions with potentially life-threatening implications [20].\nWhen using dropout neural networks (or any other stochastic regularisation technique), a randomly\ndrawn masked weight matrix corresponds to a function draw [7]. Therefore, the dropout probability,\ntogether with the weight con\ufb01guration of the network, determine the magnitude of the epistemic\nuncertainty. For a \ufb01xed dropout probability p, high magnitude weights will result in higher output\nvariance, i.e. higher epistemic uncertainty. With a \ufb01xed p, a model wanting to decrease its epistemic\nuncertainty will have to reduce its weight magnitude (and set the weights to be exactly zero to have\nzero epistemic uncertainty). Of course, this is impossible, as the model will not be able to explain the\ndata well with zero weight matrices, therefore some balance between desired output variance and\nweight magnitude is achieved1. For uncertainty representation, this can be seen as a degeneracy with\nthe model when the dropout probability is held \ufb01xed.\nAllowing the probability to change (for example by grid-searching it to maximise validation log-\nlikelihood [9]) will let the model decrease its epistemic uncertainty by choosing smaller dropout\nprobabilities. But if we wish to replace the grid-search with a gradient method, we need to de\ufb01ne an\noptimisation objective to optimise p with respect to. This is not a trivial thing, as our aim is not to\nmaximise model performance, but rather to obtain good epistemic uncertainty. What is a suitable\nobjective for this? This is discussed next.\n\n3 Concrete Dropout\n\nOne of the dif\ufb01culties with the approach above is that grid-searching over the dropout probability\ncan be expensive and time consuming, especially when done with large models. Even worse, when\noperating in a continuous learning setting such as reinforcement learning, the model should collapse\nits epistemic uncertainty as it collects more data. When grid-searching this means that the data has\nto be set-aside such that a new model could be trained with a smaller dropout probability when the\ndataset is large enough. This is infeasible in many RL tasks. Instead, the dropout probability can be\noptimised using a gradient method, where we seek to minimise some objective with respect to (w.r.t.)\nthat parameter.\nA suitable objective follows dropout\u2019s variational interpretation [7]. Following the variational\ninterpretation, dropout is seen as an approximating distribution q\u2713(!) to the posterior in a Bayesian\nneural network with a set of random weight matrices ! = {Wl}L\nl=1 with L layers and \u2713 the set of\nvariational parameters. The optimisation objective that follows from the variational interpretation can\nbe written as:\n\nlog p(yi|f !(xi)) +\n\n1\nN\n\nKL(q\u2713(!)||p(!))\n\n(1)\n\nbLMC(\u2713) = \n\n1\n\nM Xi2S\n\nwith \u2713 parameters to optimise, N the number of data points, S a random set of M data points,\nf !(xi) the neural network\u2019s output on input xi when evaluated with weight matrices realisation\n!, and p(yi|f !(xi)) the model\u2019s likelihood, e.g. a Gaussian with mean f !(xi). The KL term\nKL(q\u2713(!)||p(!)) is a \u201cregularisation\u201d term which ensures that the approximate posterior q\u2713(!) does\nnot deviate too far from the prior distribution p(!). A note on our choice for a prior is given in\nappendix B. Assume that the set of variational parameters for the dropout distribution satis\ufb01es \u2713 =\nl=1, a set of mean weight matrices and dropout probabilities such that q\u2713(!) =Ql qMl(Wl)\n{Ml, pl}L\nand qMl(Wl) = Ml\u00b7diag[Bernoulli(1pl)Kl] for a single random weight matrix Wl of dimensions\nKl+1 by Kl. The KL term can be approximated well following [7]\n\nKL(q\u2713(!)||p(!)) =\n\nKL(qM(W)||p(W)) /\n\nLXl=1\n\nKL(qMl(Wl)||p(Wl))\n\nl2(1 p)\n\n2\n\n||M||2 KH(p)\n\n(2)\n\n(3)\n\n1This raises an interesting hypothesis: does dropout work well because it forces the weights to be near zero,\n\ni.e. regularising the weights? We will comment on this later.\n\n3\n\n\fwith\n\nH(p) := p log p (1 p) log(1 p)\n\n(4)\n\nthe entropy of a Bernoulli random variable with probability p.\nThe entropy term can be seen as a dropout regularisation term. This regularisation term depends on\nthe dropout probability p alone, which means that the term is constant w.r.t. model weights. For this\nreason the term can be omitted when the dropout probability is not optimised, but the term is crucial\nwhen it is optimised. Minimising the KL divergence between qM(W) and the prior is equivalent\nto maximising the entropy of a Bernoulli random variable with probability 1 p. This pushes the\ndropout probability towards 0.5\u2014the highest it can attain. The scaling of the regularisation term\nmeans that large models will push the dropout probability towards 0.5 much more than smaller\nmodels, but as the amount of data N increases the dropout probability will be pushed towards 0\n(because of the \ufb01rst term in eq. (1)).\nWe need to evaluate the derivative of the last optimisation objective eq. (1) w.r.t. the parameter\np. Several estimators are available for us to do this: for example the score function estimator\n(also known as a likelihood ratio estimator and Reinforce [6, 12, 30, 35]), or the pathwise derivative\nestimator (this estimator is also referred to in the literature as the re-parametrisation trick, in\ufb01nitesimal\nperturbation analysis, and stochastic backpropagation [11, 22, 31, 34]). The score function estimator\nis known to have extremely high variance in practice, making optimisation dif\ufb01cult. Following early\nexperimentation with the score function estimator, it was evident that the increase in variance was\nnot manageable. The pathwise derivative estimator is known to have much lower variance than the\nscore function estimator in many applications, and indeed was used by [23] with Gaussian dropout.\nHowever, unlike the Gaussian dropout setting, in our case we need to optimise the parameter of a\nBernoulli distributions. The pathwise derivative estimator assumes that the distribution at hand can\nbe re-parametrised in the form g(\u2713, \u270f) with \u2713 the distribution\u2019s parameters, and \u270f a random variable\nwhich does not depend on \u2713. This cannot be done with the Bernoulli distribution.\nInstead, we replace dropout\u2019s discrete Bernoulli distribution with its continuous relaxation. More\nspeci\ufb01cally, we use the Concrete distribution relaxation. This relaxation allows us to re-parametrise\nthe distribution and use the low variance pathwise derivative estimator instead of the score function\nestimator.\nThe Concrete distribution is a continuous distribution used to approximate discrete random variables,\nsuggested in the context of latent random variables in deep generative models [16, 27]. One way to\nview the distribution is as a relaxation of the \u201cmax\u201d function in the Gumbel-max trick to a \u201csoftmax\u201d\nfunction, which allows the discrete random variable z to be written in the form \u02dcz = g(\u2713, \u270f) with\nparameters \u2713, and \u270f a random variable which does not depend on \u2713.\nWe will concentrate on the binary random variable case (i.e. a Bernoulli distribution). Instead of\nsampling the random variable from the discrete Bernoulli distribution (generating zeros and ones) we\nsample realisations from the Concrete distribution with some temperature t which results in values in\nthe interval [0, 1]. This distribution concentrates most mass on the boundaries of the interval 0 and 1.\nIn fact, for the one dimensional case here with the Bernoulli distribution, the Concrete distribution\nrelaxation \u02dcz of the Bernoulli random variable z reduces to a simple sigmoid distribution which has a\nconvenient parametrisation:\n\n(5)\n\n\u02dcz = sigmoid\u2713 1\n\nt \u00b7 log p log(1 p) + log u log(1 u)\u25c6\n\nwith uniform u \u21e0 Unif(0, 1). This relation between u and \u02dcz is depicted in \ufb01gure 10 in appendix\nA. Here u is a random variable which does not depend on our parameter p. The functional relation\nbetween \u02dcz and u is differentiable w.r.t. p.\nWith the Concrete relaxation of the dropout masks, it is now possible to optimise the dropout\nprobability using the pathwise derivative estimator. We refer to this Concrete relaxation of the\ndropout masks as Concrete Dropout. A Python code snippet for Concrete dropout in Keras [5] is\ngiven in appendix C, spanning about 20 lines of code, and experiment code is given online2. We next\nassess the proposed dropout variant empirically on a large array of tasks.\n\n2https://github.com/yaringal/ConcreteDropout\n\n4\n\n\f4 Experiments\n\nWe next analyse the behaviour of our proposed dropout variant on a wide variety of tasks. We study\nhow our dropout variant captures different types of uncertainty on a simple synthetic dataset with\nknown ground truth uncertainty, and show how its behaviour changes with increasing amounts of\ndata versus model size (\u00a74.1). We show that Concrete dropout matches the performance of hand-\ntuned dropout on the UCI datasets (\u00a74.2) and MNIST (\u00a74.3), and further demonstrate our variant\non large models used in the Computer Vision community (\u00a74.4). We show a signi\ufb01cant reduction\nin experiment time as well as improved model performance and uncertainty calibration. Lastly, we\ndemonstrate our dropout variant in a model-based RL task extending on [10], showing that the agent\ncorrectly reduces its uncertainty dynamically as the amount of data increases (\u00a74.5).\nWe compare the performance of hand-tuned dropout to our Concrete dropout variant in the following\nexperiments. We chose not to compare to Gaussian dropout in our experiments, as when optimising\nGaussian dropout\u2019s \u21b5 following its variational interpretation [23], the method is known to under-\nperform [28] (however, Gal [7] compared Gaussian dropout to Bernoulli dropout and found that when\noptimising the dropout probability by hand, the two methods perform similarly).\n\n4.1 Synthetic data\n\nThe tools above allow us to separate both epistemic and aleatoric uncertainties with ease. We start\nwith an analysis of how different uncertainties behave with different data sizes. For this we optimise\nboth the dropout probability p as well as the (per point) model precision \u2327 (following [20] for the latter\none). We generated simple data from the function y = 2x + 8 + \u270f with known noise \u270f \u21e0N (0, 1) (i.e.\ncorrupting the observations with noise with a \ufb01xed standard deviation 1), creating datasets increasing\nin size ranging from 10 data points (example in \ufb01gure 1e) up to 10, 000 data points (example in\n\ufb01gure 1f). Knowing the true amount of noise in our synthetic dataset, we can assess the quality of the\nuncertainties predicted by the model.\nWe used models with three hidden layers of size 1024 and ReLU non-linearities, and repeated each\nexperiment three times, averaging the experiments\u2019 results. Figure 1a shows the epistemic uncertainty\n(in standard deviation) decreasing as the amount of data increases. This uncertainty was computed\nby generating multiple function draws and evaluating the functions over a test set generated from\nthe same data distribution. Figure 1b shows the aleatoric uncertainty tending towards 1 as the data\nincreases\u2014showing that the model obtains an increasingly improved estimate to the model precision\nas more data is given. Finally, \ufb01gure 1c shows the predictive uncertainty obtained by combining the\nvariances of both plots above. This uncertainty seems to converge to a constant value as the epistemic\nuncertainty decreases and the estimation of the aleatoric uncertainty improves.\nLastly, the optimised dropout probabilities corresponding to the various dataset sizes are given in\n\ufb01gure 1d. As can be seen, the optimal dropout probability in each layer decreases as more data is\nobserved, starting from near 0.5 probabilities in all layers with the smallest dataset, and converging to\nvalues ranging between 0.2 and 0.4 when 10, 000 data points are given to the model. More interesting,\nthe optimal dropout probability for the input layer is constant at near-zero, which is often observed\nwith hand-tuned dropout probabilities as well.\n\n(a) Epistemic\n\n(b) Aleatoric\n\n(c) Predictive\n\n(d) Optimised dropout\nprobability\n(per\nlayer). First layer in blue.\n\nvalues\n\n(e) Example dataset with\n10 data points.\n\n(f) Example dataset with\n10, 000 data points.\n\nFigure 1: Different uncertainties (epistemic, aleatoric, and predictive, in std) as the number of data points\nincreases, as well as optimised dropout probabilities and example synthetic datasets.\n\n5\n\n\fFigure 2: Test negative log likelihood. The lower\nthe better. Best viewed in colour.\n\nFigure 3: Test RMSE. The lower the better. Best\nviewed in colour.\n\n4.2 UCI\nWe next assess the performance of our technique in a regression setting using the popular UCI\nbenchmark [26]. All experiments were performed using a fully connected neural network (NN) with\n2 hidden layers, 50 units each, following the experiment setup of [13]. We compare against a two\nlayer Bayesian NN approximated by standard dropout [9] and a Deep Gaussian Process of depth 2\n[4]. Test negative log likelihood for 4 datasets is reported in \ufb01gure 2, with test error reported in \ufb01gure\n3. Full results as well as experiment setup are given in the appendix D.\nFigure 4 shows posterior dropout probabilities across different cross validation splits. Intriguingly,\nthe input layer\u2019s dropout probability (p) always decreases to essentially zero. This is a recurring\npattern we observed with all UCI datasets experiments, and is further discussed in the next section.\n\nFigure 4: Converged dropout probabilities per layer, split and UCI dataset (best viewed on a computer screen).\n\n4.3 MNIST\nWe further experimented with the standard classi\ufb01cation benchmark MNIST [24]. Here we assess the\naccuracy of Concrete dropout, and study its behaviour in relation to the training set size and model\nsize. We assessed a fully connected NN with 3 hidden layers and ReLU activations. All models were\ntrained for 500 epochs (\u21e0 2\u00b7 105 iterations); each experiment was run three times using random initial\nsettings in order to avoid reporting spurious results. Concrete dropout achieves MNIST accuracy of\n98.6%, matching that of hand-tuned dropout.\nFigure 5 shows a decrease in converged dropout probabilities as the size of data increases. Notice\nthat while the dropout probabilities in the third hidden and output layers vary by a relatively small\namount, they converge to zero in the \ufb01rst two layers. This happens despite the fact that the 2nd and\n\nFigure 5: Converged dropout probabilities as func-\ntion of training set size (3x512 MLP).\n\nFigure 6: Converged dropout probabilities as func-\ntion of number of hidden units.\n\n6\n\n\f(a) Input Image\n\n(c) Epistemic Uncertainty\nFigure 7: Example output from our semantic segmentation model (a large computer vision model).\n\n(b) Semantic Segmentation\n\n3rd hidden layers are of the same shape and prior length scale setting. Note how the optimal dropout\nprobabilities are zero in the \ufb01rst layer, matching the previous results. However, observe that the model\nonly becomes con\ufb01dent about the optimal input transformation (dropout probabilities are set to zero)\nafter seeing a relatively large number of examples in comparison to the model size (explaining the\nresults in \u00a74.1 where the dropout probabilities of the \ufb01rst layer did not collapse to zero). This implies\nthat removing dropout a priori might lead to suboptimal results if the training set is not suf\ufb01ciently\ninformative, and it is best to allow the probability to adapt to the data.\nFigure 6 provides further insights by comparing the above examined 3x512 MLP model (orange) to\nother architectures. As can be seen, the dropout probabilities in the \ufb01rst layer stay close to zero, but\nothers steadily increase with the model size as the epistemic uncertainty increases. Further results are\ngiven in the appendix D.1.\n\n4.4 Computer vision\nIn computer vision, dropout is typically applied to the \ufb01nal dense layers as a regulariser, because the\ntop layers of the model contain the majority of the model\u2019s parameters [32]. For encoder-decoder\nsemantic segmentation models, such as Bayesian SegNet, [21] found through grid-search that the\nbest performing model used dropout over the middle layers (central encoder and decoder units) as\nthey contain the most parameters. However, the vast majority of computer vision models leave the\ndropout probability \ufb01xed at p = 0.5, because it is prohibitively expensive to optimise manually \u2013\nwith a few notable exceptions which required considerable computing resources [15, 33].\nWe demonstrate Concrete dropout\u2019s ef\ufb01cacy by applying it to the DenseNet model [17] for semantic\nsegmentation (example input, output, and uncertainty map is given in Figure 7). We use the same\ntraining scheme and hyper-parameters as the original authors [17]. We use Concrete dropout weight\nregulariser 108 (derived from the prior length-scale) and dropout regulariser 0.01 \u21e5 N \u21e5 H \u21e5 W ,\nwhere N is the training dataset size, and H \u21e5 W are the number of pixels in the image. This is\nbecause the loss is pixel-wise, with the random image crops used as model input. The original model\nuses a hand-tuned dropout p = 0.2. Table 1 shows that replacing dropout with Concrete dropout\nmarginally improves performance.\n\nDenseNet Model Variant\nNo Dropout\nDropout (manually-tuned p = 0.2)\nDropout (manually-tuned p = 0.2)\nConcrete Dropout\nConcrete Dropout\n\nMC Sampling\n\n-\n7\n3\n7\n3\n\nIoU\n65.8\n67.1\n67.2\n67.2\n67.4\n\nTable 1: Comparing the performance of Concrete dropout against base-\nline models with DenseNet [17] on the CamVid road scene semantic\nsegmentation dataset.\n\nTable 2: Calibration plot. Concrete\ndropout reduces the uncertainty calibra-\ntion RMSE compared to the baselines.\n\nConcrete dropout is tolerant to initialisation values. Figure 8 shows that for a range of initiali-\nsation choices in p = [0.05, 0.5] we converge to a similar optima. Interestingly, we observe that\nConcrete dropout learns a different pattern to manual dropout tuning results [21]. The second and\nlast layers have larger dropout probability, while the \ufb01rst and middle layers are largely deterministic.\nConcrete dropout improves calibration of uncertainty obtained from the models. Figure 2 shows\ncalibration plots of a Concrete dropout model against the baselines. This compares the model\u2019s\npredicted uncertainty against the accuracy frequencies, where a perfectly calibrated model corresponds\nto the line y = x.\n\n7\n\n\f(a) L = 0\n\n(b) L = 1\n\n(c) L = n/2\n\n(d) L = n 1\n\n(e) L = n\n\nFigure 8: Learned Concrete dropout probabilities for the \ufb01rst, second, middle and last two layers in a semantic\nsegmentation model. p converges to the same minima for a range of initialisations from p = [0.05, 0.5].\n\nConcrete dropout layer requires negligible additional compute compared with standard dropout\nlayers with our implementation. However, using conventional dropout requires considerable re-\nsources to manually tune dropout probabilities. Typically, computer vision models consist of 10M +\nparameters, and take multiple days to train on a modern GPU. Using Concrete dropout can decrease\nthe time of model training by weeks by automatically learning the dropout probabilities.\n\n4.5 Model-based reinforcement learning\nExisting RL research using dropout uncertainty would hold the dropout probability \ufb01xed, or decrease\nit following a schedule [9, 10, 18]. This gives a proxy to the epistemic uncertainty, but raises other\ndif\ufb01culties such as planning the dropout schedule. This can also lead to under-exploitation of the\nenvironment as was reported in [9] with Thompson sampling. To avoid this under-exploitation,\nGal et al. [10] for example performed a grid-search to \ufb01nd p that trades-off this exploration and\nexploitation over the acquisition of multiple episodes at once.\nWe repeated the experiment setup of [10], where an agent attempts to balance a pendulum hanging\nfrom a cart by applying force to the cart. [10] used a \ufb01xed dropout probability of 0.1 in the dynamics\nmodel. Instead, we use Concrete dropout with the dynamics model, and able to match their cumulative\nreward (16.5 with 25 time steps). Concrete dropout allows the dropout probability to adapt as more\ndata is collected, instead of being set once and held \ufb01xed. Figures 9a\u20139c show the optimised dropout\nprobabilities per layer vs. the number of episodes (acquired data), as well as the \ufb01xed probabilities in\nthe original setup. Concrete dropout automatically decreases the dropout probability as more data\nis observed. Figures 9d\u20139g show the dynamics\u2019 model epistemic uncertainty for each one of the\nfour state components in the system: [x, \u02d9x, \u2713, \u02d9\u2713] (cart location, velocity, pendulum angle, and angular\nvelocity). This uncertainty was calculated on a validation set split from the total data after each\nepisode. Note how with Concrete dropout the epistemic uncertainty decreases over time as more data\nis observed.\n\n(a) L = 0\n\n(b) L = 1\n\n(c) L = 2\n\n(d) x\n\n(e) \u02d9x\n\n(f) \u2713\n\n(g) \u02d9\u2713\n\nFigure 9: Concrete dropout in model-based RL. Left three plots: dropout probabilities for the 3 layers of the\ndynamics model as a function of the number of episodes (amount of data) observed by the agent (Concrete\ndropout in blue, baseline in orange). Right four plots: epistemic uncertainty over the dynamics model output for\nthe four state components: [x, \u02d9x, \u2713, \u02d9\u2713]. Best viewed on a computer screen.\n\n5 Conclusions and Insights\n\nIn this paper we introduced Concrete dropout\u2014a principled extension of dropout which allows for the\ndropout probabilities to be tuned. We demonstrated improved calibration and uncertainty estimates,\nas well as reduced experimentation cycle time. Two interesting insights arise from this work. First,\ncommon practice in the \ufb01eld where a small dropout probability is often used with the shallow layers\nof a model seems to be supported by dropout\u2019s variational interpretation. This can be seen as evidence\ntowards the variational explanation of dropout. Secondly, an open question arising from previous\nresearch was whether dropout works well because it forces the weights to be near zero with \ufb01xed p.\nHere we showed that allowing p to adapt, gives comparable performance as optimal \ufb01xed p. Allowing\np to change does not force the weight magnitude to be near zero, suggesting that the hypothesis that\ndropout works because p is \ufb01xed is false.\n\n8\n\n\fReferences\n[1] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane. Concrete\n\nproblems in ai safety. arXiv preprint arXiv:1606.06565, 2016.\n\n[2] Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Advances in Neural\n\nInformation Processing Systems, pages 3084\u20133092, 2013.\n\n[3] Matthew J. Beal and Zoubin Ghahramani. The variational Bayesian EM algorithm for incomplete data:\n\nWith application to scoring graphical model structures. Bayesian Statistics, 2003.\n\n[4] Thang D. Bui, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Daniel Hern\u00e1ndez-Lobato, Yingzhen Li, and Richard E.\nTurner. Deep gaussian processes for regression using approximate expectation propagation. In Proceedings\nof the 33rd International Conference on International Conference on Machine Learning - Volume 48,\nICML\u201916, pages 1472\u20131481, 2016.\n\n[5] Fran\u00e7ois Chollet. Keras, 2015. URL https://github.com/fchollet/keras. GitHub repository.\n[6] Michael C. Fu. Chapter 19 gradient estimation. In Shane G. Henderson and Barry L. Nelson, editors,\nSimulation, volume 13 of Handbooks in Operations Research and Management Science, pages 575 \u2013 616.\nElsevier, 2006.\n\n[7] Yarin Gal. Uncertainty in Deep Learning. PhD thesis, University of Cambridge, 2016.\n[8] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural\n\nnetworks. NIPS, 2016.\n\n[9] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty\n\nin deep learning. ICML, 2016.\n\n[10] Yarin Gal, Rowan McAllister, and Carl E. Rasmussen. Improving PILCO with Bayesian neural network\n\ndynamics models. In Data-Ef\ufb01cient Machine Learning workshop, ICML, April 2016.\n\n[11] Paul Glasserman. Monte Carlo methods in \ufb01nancial engineering, volume 53. Springer Science & Business\n\nMedia, 2013.\n\n[12] Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM,\n\n33(10):75\u201384, 1990.\n\n[13] Jose Miguel Hernandez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of\n\nBayesian neural networks. In ICML, 2015.\n\n[14] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Im-\nproving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580,\n2012.\n\n[15] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolu-\n\ntional networks. arXiv preprint arXiv:1608.06993, 2016.\n\n[16] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. In Bayesian\n\nDeep Learning workshop, NIPS, 2016.\n\n[17] Simon J\u00e9gou, Michal Drozdzal, David Vazquez, Adriana Romero, and Yoshua Bengio. The one hundred\nlayers tiramisu: Fully convolutional densenets for semantic segmentation. arXiv preprint arXiv:1611.09326,\n2016.\n\n[18] Gregory Kahn, Adam Villa\ufb02or, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertainty-aware\n\nreinforcement learning for collision avoidance. In ArXiv e-prints, 1702.01182, 2017.\n\n[19] Michael Kampffmeyer, Arnt-Borre Salberg, and Robert Jenssen. Semantic segmentation of small objects\nand modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In\nThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2016.\n\n[20] Alex Kendall and Yarin Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer\n\nVision? In ArXiv e-prints, 1703.04977, 2017.\n\n[21] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep\nconvolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680,\n2015.\n\n[22] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[23] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization\n\ntrick. In NIPS. Curran Associates, Inc., 2015.\n\n[24] Yann LeCun and Corinna Cortes. The MNIST database of handwritten digits. 1998. URL http:\n\n//yann.lecun.com/exdb/mnist/.\n\n9\n\n\f[25] Yingzhen Li and Yarin Gal. Dropout Inference in Bayesian Neural Networks with Alpha-divergences. In\n\nArXiv e-prints, 1703.02914, 2017.\n\n[26] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.\n[27] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The Concrete distribution: A continuous relaxation\n\nof discrete random variables. In Bayesian Deep Learning workshop, NIPS, 2016.\n\n[28] Dmitry Molchanov, Arseniy Ashuha, and Dmitry Vetrov. Dropout-based automatic relevance determination.\n\nIn Bayesian Deep Learning workshop, NIPS, 2016.\n\n[29] NHTSA. PE 16-007. Technical report, U.S. Department of Transportation, National Highway Traf\ufb01c\n\nSafety Administration, Jan 2017. Tesla Crash Preliminary Evaluation Report.\n\n[30] John Paisley, David Blei, and Michael Jordan. Variational Bayesian inference with stochastic search.\n\nICML, 2012.\n\n[31] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. In ICML, 2014.\n\n[32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[33] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20139, 2015.\n\n[34] Michalis Titsias and Miguel L\u00e1zaro-Gredilla. Doubly stochastic variational Bayes for non-conjugate\ninference. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages\n1971\u20131979, 2014.\n\n[35] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256, 1992.\n\n10\n\n\f", "award": [], "sourceid": 2018, "authors": [{"given_name": "Yarin", "family_name": "Gal", "institution": "University of Oxford"}, {"given_name": "Jiri", "family_name": "Hron", "institution": "University of Cambridge"}, {"given_name": "Alex", "family_name": "Kendall", "institution": "University of Cambridge"}]}