{"title": "Bayesian dark knowledge", "book": "Advances in Neural Information Processing Systems", "page_first": 3438, "page_last": 3446, "abstract": "We consider the problem of Bayesian parameter estimation for deep neural networks, which is important in problem settings where we may have little data, and/ or where we need accurate posterior predictive densities p(y|x, D), e.g., for applications involving bandits or active learning. One simple approach to this is to use online Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynamics). Unfortunately, such a method needs to store many copies of the parameters (which wastes memory), and needs to make predictions using many versions of the model (which wastes time).We describe a method for \u201cdistilling\u201d a Monte Carlo approximation to the posterior predictive density into a more compact form, namely a single deep neural network. We compare to two very recent approaches to Bayesian neural networks, namely an approach based on expectation propagation [HLA15] and an approach based on variational Bayes [BCKW15]. Our method performs better than both of these, is much simpler to implement, and uses less computation at test time.", "full_text": "Bayesian Dark Knowledge\n\nAnoop Korattikara, Vivek Rathod, Kevin Murphy\n{kbanoop, rathodv, kpmurphy}@google.com\n\nGoogle Research\n\nMax Welling\n\nUniversity of Amsterdam\nm.welling@uva.nl\n\nAbstract\n\nWe consider the problem of Bayesian parameter estimation for deep neural net-\nworks, which is important in problem settings where we may have little data, and/\nor where we need accurate posterior predictive densities p(y|x,D), e.g., for appli-\ncations involving bandits or active learning. One simple approach to this is to use\nonline Monte Carlo methods, such as SGLD (stochastic gradient Langevin dynam-\nics). Unfortunately, such a method needs to store many copies of the parameters\n(which wastes memory), and needs to make predictions using many versions of\nthe model (which wastes time).\nWe describe a method for \u201cdistilling\u201d a Monte Carlo approximation to the pos-\nterior predictive density into a more compact form, namely a single deep neural\nnetwork. We compare to two very recent approaches to Bayesian neural networks,\nnamely an approach based on expectation propagation [HLA15] and an approach\nbased on variational Bayes [BCKW15]. Our method performs better than both of\nthese, is much simpler to implement, and uses less computation at test time.\n\n1\n\nIntroduction\n\nDeep neural networks (DNNs) have recently been achieving state of the art results in many \ufb01elds.\nHowever, their predictions are often over con\ufb01dent, which is a problem in applications such as\nactive learning, reinforcement learning (including bandits), and classi\ufb01er fusion, which all rely on\ngood estimates of uncertainty.\nA principled way to tackle this problem is to use Bayesian inference. Speci\ufb01cally, we \ufb01rst com-\ni=1 p(yi|xi, \u03b8), where\ni=1, xi \u2208 X D is the i\u2019th input (where D is the number of features), and\nDN = {(xi, yi)}N\nyi \u2208 Y is the i\u2019th output. Then we compute the posterior predictive distribution, p(y|x,DN ) =\n\npute the posterior distribution over the model parameters, p(\u03b8|DN ) \u221d p(\u03b8)(cid:81)N\n(cid:82) p(y|x, \u03b8)p(\u03b8|DN )d\u03b8, for each test point x.\n\nFor reasons of computational speed, it is common to approximate the posterior distribution by a\npoint estimate such as the MAP estimate, \u02c6\u03b8 = argmax p(\u03b8|DN ). When N is large, we often use\nstochastic gradient descent (SGD) to compute \u02c6\u03b8. Finally, we make a plug-in approximation to the\npredictive distribution: p(y|x,DN ) \u2248 p(y|x, \u02c6\u03b8). Unfortunately, this loses most of the bene\ufb01ts\nof the Bayesian approach, since uncertainty in the parameters (which induces uncertainty in the\npredictions) is ignored.\nVarious ways of more accurately approximating p(\u03b8|DN ) (and hence p(y|x,DN )) have been devel-\noped. Recently, [HLA15] proposed a method called \u201cprobabilistic backpropagation\u201d (PBP) based\non an online version of expectation propagation (EP), (i.e., using repeated assumed density \ufb01ltering\n(ADF)), where the posterior is approximated as a product of univariate Gaussians, one per parame-\n\nter: p(\u03b8|DN ) \u2248 q(\u03b8) (cid:44)(cid:81)\n\ni N (\u03b8i|mi, vi).\n\nAn alternative to EP is variational Bayes (VB) where we optimize a lower bound on the marginal\nlikelihood.\n[Gra11] presented a (biased) Monte Carlo estimate of this lower bound and applies\n\n1\n\n\fhis method, called \u201cvariational inference\u201d (VI), to infer the neural network weights. More recently,\n[BCKW15] proposed an approach called \u201cBayes by Backprop\u201d (BBB), which extends the VI method\nwith an unbiased MC estimate of the lower bound based on the \u201creparameterization trick\u201d of [KW14,\nRMW14]. In both [Gra11] and [BCKW15], the posterior is approximated by a product of univariate\nGaussians.\nAlthough EP and VB scale well with data size (since they use online learning), there are several\nproblems with these methods: (1) they can give poor approximations when the posterior p(\u03b8|DN )\ndoes not factorize, or if it has multi-modality or skew; (2) at test time, computing the predictive\ndensity p(y|x,DN ) can be much slower than using the plug-in approximation, because of the need\nto integrate out the parameters; (3) they need to use double the memory of a standard plug-in method\n(to store the mean and variance of each parameter), which can be problematic in memory-limited\nsettings such as mobile phones; (4) they can be quite complicated to derive and implement.\nA common alternative to EP and VB is to use MCMC methods to approximate p(\u03b8|DN ). Tra-\nditional MCMC methods are batch algorithms, that scale poorly with dataset size. However, re-\ncently a method called stochastic gradient Langevin dynamics (SGLD) [WT11] has been devised\nthat can draw samples approximately from the posterior in an online fashion, just as SGD updates a\npoint estimate of the parameters online. Furthermore, various extensions of SGLD have been pro-\nposed, including stochastic gradient hybrid Monte Carlo (SGHMC) [CFG14], stochastic gradient\nNos\u00b4e-Hoover Thermostat (SG-NHT) [DFB+14] (which improves upon SGHMC), stochastic gra-\ndient Fisher scoring (SGFS) [AKW12] (which uses second order information), stochastic gradient\nRiemannian Langevin Dynamics [PT13], distributed SGLD [ASW14], etc. However, in this paper,\nwe will just use \u201cvanilla\u201d SGLD [WT11].1\n(cid:80)S\nAll these MCMC methods (whether batch or online) produce a Monte Carlo approximation to the\ns=1 \u03b4(\u03b8 \u2212 \u03b8s), where S is the number of samples. Such an approxima-\nposterior, q(\u03b8) = 1\nS\ntion can be more accurate than that produced by EP or VB, and the method is much easier to\n(cid:80)S\nimplement (for SGLD, you essentially just add Gaussian noise to your SGD updates). However,\nat test time, things are S times slower than using a plug-in estimate, since we need to compute\ns=1 p(y|x, \u03b8s), and the memory requirements are S times bigger, since we need to\nq(y|x) = 1\nstore the \u03b8s. (For our largest experiment, our DNN has 500k parameters, so we can only afford to\nstore a single sample.)\nIn this paper, we propose to train a parametric model S(y|x, w) to approximate the Monte Carlo\nposterior predictive distribution q(y|x) in order to gain the bene\ufb01ts of the Bayesian approach while\nonly using the same run time cost as the plugin method. Following [HVD14], we call q(y|x) the\n\u201cteacher\u201d and S(y|x, w) the \u201cstudent\u201d. We use SGLD2 to estimate q(\u03b8) and hence q(y|x) online;\nwe simultaneously train the student online to minimize KL(q(y|x)||S(y|x, w)). We give the details\nin Section 2.\nSimilar ideas have been proposed in the past. In particular, [SG05] also trained a parametric student\nmodel to approximate a Monte Carlo teacher. However, they used batch training and they used\nmixture models for the student. By contrast, we use online training (and can thus handle larger\ndatasets), and use deep neural networks for the student.\n[HVD14] also trained a student neural network to emulate the predictions of a (larger) teacher net-\nwork (a process they call \u201cdistillation\u201d), extending earlier work of [BCNM06] which approximated\nan ensemble of classi\ufb01ers by a single one. The key difference from our work is that our teacher\nis generated using MCMC, and our goal is not just to improve classi\ufb01cation accuracy, but also to\nget reliable probabilistic predictions, especially away from the training data. [HVD14] coined the\nterm \u201cdark knowledge\u201d to represent the information which is \u201chidden\u201d inside the teacher network,\nand which can then be distilled into the student. We therefore call our approach \u201cBayesian dark\nknowledge\u201d.\n\nS\n\n1 We did some preliminary experiments with SG-NHT for \ufb01tting an MLP to MNIST data, but the results\n\nwere not much better than SGLD.\n\n2Note that SGLD is an approximate sampling algorithm and introduces a slight bias in the predictions of\nthe teacher and student network. If required, we can replace SGLD with an exact MCMC method (e.g. HMC)\nto get more accurate results at the expense of more training time.\n\n2\n\n\fIn summary, our contributions are as follows. First, we show how to combine online MCMC meth-\nods with model distillation in order to get a simple, scalable approach to Bayesian inference of the\nparameters of neural networks (and other kinds of models). Second, we show that our probabilistic\npredictions lead to improved log likelihood scores on the test set compared to SGD and the recently\nproposed EP and VB approaches.\n\n2 Methods\n\nOur goal is to train a student neural network (SNN) to approximate the Bayesian predictive distri-\nbution of the teacher, which is a Monte Carlo ensemble of teacher neural networks (TNN).\nIf we denote the predictions of the teacher by p(y|x,DN ) and the parameters of the student network\nby w, our objective becomes\n\nL(w|x) = KL(p(y|x,DN )||S(y|x, w)) = \u2212Ep(y|x,DN ) log S(y|x, w) + const\n\n(cid:21)\n\n(cid:90) (cid:20)(cid:90)\n(cid:90)\n(cid:90)\n\n= \u2212\n\n= \u2212\n\n= \u2212\n\n(cid:90)\n\np(y|x, \u03b8)p(\u03b8|DN )d\u03b8\n\nlog S(y|x, w)dy\n\np(\u03b8|DN )\n\np(y|x, \u03b8) log S(y|x, w)dy d\u03b8\n\np(\u03b8|DN )(cid:2)Ep(y|x,\u03b8) log S(y|x, w)(cid:3) d\u03b8\n\n(1)\n\n(3)\n\n(cid:88)\n\nUnfortunately, computing this integral is not analytically tractable. However, we can approximate\nthis by Monte Carlo:\n\n\u03b8s\u2208\u0398\n\nEp(y|x,\u03b8s) log S(y|x, w)\n\n\u02c6L(w|x) = \u2212 1\n|\u0398|\nwhere \u0398 is a set of samples from p(\u03b8|DN ).\nTo make this a function just of w, we need to integrate out x. For this, we need a dataset to train\nthe student network on, which we will denote by D(cid:48). Note that points in this dataset do not need\nground truth labels; instead the labels (which will be probability distributions) will be provided\nby the teacher. The choice of student data controls the domain over which the student will make\naccurate predictions. For low dimensional problems (such as in Section 3.1), we can uniformly\nsample the input domain. For higher dimensional problems, we can sample \u201cnear\u201d the training\ndata, for example by perturbing the inputs slightly. In any case, we will compute a Monte Carlo\napproximation to the loss as follows:\n\n(2)\n\n(cid:90)\n\n\u02c6L(w) =\n\n(cid:88)\n\np(x)L(w|x)dx \u2248 1\n|D(cid:48)|\nEp(y|x(cid:48),\u03b8s) log S(y|x(cid:48), w)\n\nL(w|x(cid:48))\n\n(cid:88)\n\n(cid:88)\n\nx(cid:48)\u2208D(cid:48)\n\n\u2248 \u2212 1\n|\u0398|\n\n1\n|D(cid:48)|\n\n\u03b8s\u2208\u0398\n\nx(cid:48)\u2208D(cid:48)\n\nIt can take a lot of memory to pre-compute and store the set of parameter samples \u0398 and the set of\ndata samples D(cid:48), so in practice we use the stochastic algorithm shown in Algorithm 1, which uses a\nsingle posterior sample \u03b8s and a minibatch of x(cid:48) at each step.\nThe hyper-parameters \u03bb and \u03b3 from Algorithm 1 control the strength of the priors for the teacher\nand student networks. We use simple spherical Gaussian priors (equivalent to L2 regularization);\nwe set the precision (strength) of these Gaussian priors by cross-validation. Typically \u03bb (cid:29) \u03b3, since\nthe student gets to \u201csee\u201d more data than the teacher. This is true for two reasons: \ufb01rst, the teacher\nis trained to predict a single label per input, whereas the student is trained to predict a distribution,\nwhich contains more information (as argued in [HVD14]); second, the teacher makes multiple passes\nover the same training data, whereas the student sees \u201cfresh\u201d randomly generated data D(cid:48) at each\nstep.\n\n2.1 Classi\ufb01cation\n\nFor classi\ufb01cation problems, each teacher network \u03b8s models the observations using a standard soft-\nmax model, p(y = k|x, \u03b8s). We want to approximate this using a student network, which also has a\n\n3\n\n\fAlgorithm 1: Distilled SGLD\nInput: DN = {(xi, yi)}N\n\u03b7t, student learning schedule \u03c1t, teacher prior \u03bb, student prior \u03b3\nfor t = 1 : T do\n\ni=1, minibatch size M, number of iterations T , teacher learning schedule\n\n// Train teacher (SGLD step)\nSample minibatch indices S \u2282 [1, N ] of size M\nSample zt \u223c N (0, \u03b7tI)\nUpdate \u03b8t+1 := \u03b8t + \u03b7t\n2\n// Train student (SGD step)\nSample D(cid:48) of size M from student data generator\nwt+1 := wt \u2212 \u03c1t\n\n(cid:0)\u2207\u03b8 log p(\u03b8|\u03bb) + N\n(cid:80)\nx(cid:48)\u2208D(cid:48) \u2207w \u02c6L(w, \u03b8t+1|x(cid:48)) + \u03b3wt\n\n(cid:16) 1\n\nM\n\nM\n\n(cid:17)\n\ni\u2208S \u2207\u03b8 log p(yi|xi, \u03b8)(cid:1) + zt\n(cid:80)\n\n\u02c6L(w|\u03b8s, x) = \u2212 K(cid:88)\n\nsoftmax output, S(y = k|x, w). Hence from Eqn. 2, our loss function estimate is the standard cross\nentropy loss:\n\np(y = k|x, \u03b8s) log S(y = k|x, w)\n\n(4)\n\nThe student network outputs \u03b2k(x, w) = log S(y = k|x, w). To estimate the gradient w.r.t. w, we\njust have to compute the gradients w.r.t. \u03b2 and back-propagate through the network. These gradients\nare given by \u2202 \u02c6L(w,\u03b8s|x)\n\n\u2202\u03b2k(x,w) = \u2212p(y = k|x, \u03b8s).\n\nk=1\n\n2.2 Regression\nn ) where f (x|\u03b8) is\nIn regression, the observations are modeled as p(yi|xi, \u03b8) = N (yi|f (xi|\u03b8), \u03bb\u22121\nthe prediction of the TNN and \u03bbn is the noise precision. We want to approximate the predictive\ndistribution as p(y|x,DN ) \u2248 S(y|x, w) = N (y|\u00b5(x, w), e\u03b1(x,w)). We will train a student network\nto output the parameters of the approximating distribution \u00b5(x, w) and \u03b1(x, w); note that this is\ntwice the number of outputs of the teacher network, since we want to capture the (data dependent)\nvariance.3 We use e\u03b1(x,w) instead of directly predicting the variance \u03c32(x|w) to avoid dealing with\npositivity constraints during training.\nTo train the SNN, we will minimize the objective de\ufb01ned in Eqn. 2:\n\u02c6L(w|\u03b8s, x) = \u2212Ep(y|x,\u03b8s) log N (y|\u00b5(x, w), e\u03b1(x,w))\n\n(cid:104)\n\n(cid:105)\n\n(cid:20)\n\n=\n\n=\n\n1\n2\n1\n2\n\nEp(y|x,\u03b8s)\n\u03b1(x, w) + e\u2212\u03b1(x,w)\n\n\u03b1(x, w) + e\u2212\u03b1(x,w)(y \u2212 \u00b5(x, w)2)\n(f (x|\u03b8s) \u2212 \u00b5(x, w))2 +\n\n(cid:26)\n\n(cid:27)(cid:21)\n\n1\n\u03bbn\n\nNow, to estimate \u2207w \u02c6L(w, \u03b8s|x), we just have to compute\nthrough the network. These gradients are:\n\n\u2202 \u02c6L\n\n\u2202\u00b5(x,w) and\n\n\u2202 \u02c6L\n\n\u2202\u03b1(x,w), and back propagate\n\n\u2202 \u02c6L(w, \u03b8s|x)\n\u2202\u00b5(x, w)\n\u2202 \u02c6L(w, \u03b8s|x\n\u2202\u03b1(x, w)\n\n(cid:20)\n\n=\n\n1\n2\n\n= e\u2212\u03b1(x,w) {\u00b5(x, w) \u2212 f (x|\u03b8s)}\n\n(cid:26)\n\n1 \u2212 e\u2212\u03b1(x,w)\n\n(f (x|\u03b8s) \u2212 \u00b5(x, w))2 +\n\n(cid:27)(cid:21)\n\n1\n\u03bbn\n\n(5)\n\n(6)\n\n3 Experimental results\n\nIn this section, we compare SGLD and distilled SGLD with other approximate inference methods,\nincluding the plugin approximation using SGD, the PBP approach of [HLA15], the BBB approach of\n\n3 This is not necessary in the classi\ufb01cation case, since the softmax distribution already captures uncertainty.\n\n4\n\n\fDataset\nToyClass\nMNIST\nToyReg\nBoston Housing\n\nN\n20\n60k\n10\n506\n\nD\n2\n784\n1\n13\n\nY\nPBP BBB HMC\n{0, 1}\nN\n{0, . . . , 9} N\nR\nY\nR\nY\n\nY\nN\nY\nN\n\nN\nY\nN\nN\n\nTable 1: Summary of our experimental con\ufb01gurations.\n\n(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 1: Posterior predictive density for various methods on the toy 2d dataset. (a) SGD (plugin)\nusing the 2-10-2 network. (b) HMC using 20k samples. (c) SGLD using 1k samples. (d-f) Distilled\nSGLD using a student network with the following architectures: 2-10-2, 2-100-2 and 2-10-10-2.\n\n[BCKW15], and Hamiltonian Monte Carlo (HMC) [Nea11], which is considered the \u201cgold standard\u201d\nfor MCMC for neural nets. We implemented SGD and SGLD using the Torch library (torch.ch).\nFor HMC, we used Stan (mc-stan.org). We perform this comparison for various classi\ufb01cation\nand regression problems, as summarized in Table 1.4\n\n3.1 Toy 2d classi\ufb01cation problem\n\nWe start with a toy 2d binary classi\ufb01cation problem, in order to visually illustrate the performance\nof different methods. We generate a synthetic dataset in 2 dimensions with 2 classes, 10 points per\nclass. We then \ufb01t a multi layer perceptron (MLP) with one hidden layer of 10 ReLu units and 2\nsoftmax outputs (denoted 2-10-2) using SGD. The resulting predictions are shown in Figure 1(a).\nWe see the expected sigmoidal probability ramp orthogonal to the linear decision boundary. Unfor-\ntunately, this method predicts a label of 0 or 1 with very high con\ufb01dence, even for points that are far\nfrom the training data (e.g., in the top left and bottom right corners).\nIn Figure 1(b), we show the result of HMC using 20k samples. This is the \u201ctrue\u201d posterior predictive\ndensity which we wish to approximate. In Figure 1(c), we show the result of SGLD using about 1000\nsamples. Speci\ufb01cally, we generate 100k samples, discard the \ufb01rst 2k for burnin, and then keep every\n100\u2019th sample. We see that this is a good approximation to the HMC distribution.\nIn Figures 1(d-f), we show the results of approximating the SGLD Monte Carlo predictive distribu-\ntion with a single student MLP of various sizes. To train this student network, we sampled points at\nrandom from the domain of the input, [\u221210, 10] \u00d7 [\u221210, 10]; this encourages the student to predict\naccurately at all locations, including those far from the training data. In (d), the student has the same\n\n4 Ideally, we would apply all methods to all datasets, to enable a proper comparison. Unfortunately, this was\nnot possible, for various reasons. First, the open source code for the EP approach only supports regression, so\nwe could not evaluate this on classi\ufb01cation problems. Second, we were not able to run the BBB code, so we just\nquote performance numbers from their paper [BCKW15]. Third, HMC is too slow to run on large problems, so\nwe just applied it to the small \u201ctoy\u201d problems. Nevertheless, our experiments show that our methods compare\nfavorably to these other methods.\n\n5\n\n\fModel\nSGD\nSGLD\nDistilled 2-10-2\nDistilled 2-100-2\nDistilled 2-10-10-2\n\nNum. params. KL\n40\n40k\n40\n400\n140\n\n0.246\n0.007\n0.031\n0.014\n0.009\n\nTable 2: KL divergence on the 2d classi\ufb01cation dataset.\n\nSGD [BCKW15] Dropout BBB SGD (our impl.)\n1.536 \u00b1 0.0120\n\n1.51\n\n1.83\n\n1.82\n\nSGLD\n\n1.271 \u00b1 0.0126\n\nDist. SGLD\n1.307 \u00b1 0.0169\n\nTable 3: Test set misclassi\ufb01cation rate on MNIST for different methods using a 784-400-400-10\nMLP. SGD (\ufb01rst column), Dropout and BBB numbers are quoted from [BCKW15]. For our impl-\nmentation of SGD (fourth column), SGLD and distilled SGLD, we report the mean misclassi\ufb01cation\nrate over 10 runs and its standard error.\n\nsize as the teacher (2-10-2), but this is too simple a model to capture the complexity of the predictive\ndistribution (which is an average over models). In (e), the student has a larger hidden layer (2-100-\n2); this works better. However, we get best results using a two hidden layer model (2-10-10-2), as\nshown in (f).\nIn Table 2, we show the KL divergence between the HMC distribution (which we consider as ground\ntruth) and the various approximations mentioned above. We computed this by comparing the prob-\nability distributions pointwise on a 2d grid. The numbers match the qualitative results shown in\nFigure 1.\n\n3.2 MNIST classi\ufb01cation\n\nNow we consider the MNIST digit classi\ufb01cation problem, which has N = 60k examples, 10\nclasses, and D = 784 features. The only preprocessing we do is divide the pixel values by 126\n(as in [BCKW15]). We train only on 50K datapoints and use the remaining 10K for tuning hyper-\nparameters. This means our results are not strictly comparable to a lot of published work, which\nuses the whole dataset for training; however, the difference is likely to be small.\nFollowing [BCKW15], we use an MLP with 2 hidden layers with 400 hidden units per layer, ReLU\nactivations, and softmax outputs; we denote this by 784-400-400-10. This model has 500k parame-\nters.\nWe \ufb01rst \ufb01t this model by SGD, using these hyper parameters: \ufb01xed learning rate of \u03b7t = 5 \u00d7 10\u22126,\nprior precision \u03bb = 1, minibatch size M = 100, number of iterations T = 1M. As shown in\nTable 3, our \ufb01nal error rate on the test set is 1.536%, which is a bit lower than the SGD number\nreported in [BCKW15], perhaps due to the slightly different training/ validation con\ufb01guration.\nNext we \ufb01t this model by SGLD, using these hyper parameters: \ufb01xed learning rate of \u03b7t = 4\u00d710\u22126,\nthinning interval \u03c4 = 100, burn in iterations B = 1000, prior precision \u03bb = 1, minibatch size\nM = 100. As shown in Table 3, our \ufb01nal error rate on the test set is about 1.271%, which is better\nthan the SGD, dropout and BBB results from [BCKW15].5\nFinally, we consider using distillation, where the teacher is an SGLD MC approximation of the\nposterior predictive. We use the same 784-400-400-10 architecture for the student as well as the\nteacher. We generate data for the student by adding Gaussian noise (with standard deviation of\n0.001) to randomly sampled training points6 We use a constant learning rate of \u03c1 = 0.005, a batch\nsize of M = 100, a prior precision of 0.001 (for the student) and train for T = 1M iterations. We\nobtain a test error of 1.307% which is very close to that obtained with SGLD (see Table 4).\n\n5 We only show the BBB results with the same Gaussian prior that we use. Performance of BBB can be\nimproved using other priors, such as a scale mixture of Gaussians, as shown in [BCKW15]. Our approach\ncould probably also bene\ufb01t from such a prior, but we did not try this.\n\n6In the future, we would like to consider more sophisticated data perturbations, such as elastic distortions.\n\n6\n\n\fSGD\n\n-0.0613 \u00b1 0.0002\n\nSGLD\n\n-0.0419 \u00b1 0.0002\n\nDistilled SGLD\n-0.0502 \u00b1 0.0007\n\nTable 4: Log likelihood per test example on MNIST. We report the mean over 10 trials \u00b1 one\nstandard error.\n\nMethod\nPBP (as reported in [HLA15])\nVI (as reported in [HLA15])\nSGD\nSGLD\nSGLD distilled\n\nAvg. test log likelihood\n-2.574 \u00b1 0.089\n-2.903 \u00b1 0.071\n-2.7639 \u00b1 0.1527\n-2.306 \u00b1 0.1205\n-2.350 \u00b1 0.0762\n\nTable 5: Log likelihood per test example on the Boston housing dataset. We report the mean over\n20 trials \u00b1 one standard error.\n\nWe also report the average test log-likelihood of SGD, SGLD and distilled SGLD in Table 4. The\nlog-likelihood is equivalent to the logarithmic scoring rule [Bic07] used in assessing the calibration\nof probabilistic models. The logarithmic rule is a strictly proper scoring rule, meaning that the\nscore is uniquely maximized by predicting the true probabilities. From Table 4, we see that both\nSGLD and distilled SGLD acheive higher scores than SGD, and therefore produce better calibrated\npredictions.\nNote that the SGLD results were obtained by averaging predictions from \u2248 10,000 models sampled\nfrom the posterior, whereas distillation produces a single neural network that approximates the av-\nerage prediction of these models, i.e. distillation reduces both storage and test time costs of SGLD\nby a factor of 10,000, without sacri\ufb01cing much accuracy. In terms of training time, SGD took 1.3\nms, SGLD took 1.6 ms and distilled SGLD took 3.2 ms per iteration. In terms of memory, distilled\nSGLD requires only twice as much as SGD or SGLD during training, and the same as SGD during\ntesting.\n\n3.3 Toy 1d regression\n\nWe start with a toy 1d regression problem, in order to visually illustrate the performance of different\nmethods. We use the same data and model as [HLA15]. In particular, we use N = 20 points in\nD = 1 dimensions, sampled from the function y = x3 + \u0001n, where \u0001n \u223c N (0, 9). We \ufb01t this data\nwith an MLP with 10 hidden units and ReLU activations. For SGLD, we use S = 2000 samples.\nFor distillation, the teacher uses the same architecture as the student.\nThe results are shown in Figure 2. We see that SGLD is a better approximation to the \u201ctrue\u201d (HMC)\nposterior predictive density than the plugin SGD approximation (which has no predictive uncer-\ntainty), and the VI approximation of [Gra11]. Finally, we see that distilling SGLD incurs little loss\nin accuracy, but saves a lot computationally.\n\n3.4 Boston housing\n\nFinally, we consider a larger regression problem, namely the Boston housing dataset, which was\nalso used in [HLA15]. This has N = 506 data points (456 training, 50 testing), with D = 13\ndimensions. Since this data set is so small, we repeated all experiments 20 times, using different\ntrain/ test splits.\nFollowing [HLA15], we use an MLP with 1 layer of 50 hidden units and ReLU activations. First\nwe use SGD, with these hyper parameters7: Minibatch size M = 1, noise precision \u03bbn = 1.25,\nprior precision \u03bb = 1, number of trials 20, constant learning rate \u03b7t = 1e \u2212 6, number of iterations\nT = 170K. As shown in Table 5, we get an average log likelihood of \u22122.7639.\nNext we \ufb01t the model using SGLD. We use an initial learning rate of \u03b70 = 1e \u2212 5, which we reduce\nby a factor of 0.5 every 80K iterations; we use 500K iterations, a burnin of 10K, and a thinning\n\n7We choose all hyper-parameters using cross-validation whereas [HLA15] performs posterior inference on\n\nthe noise and prior precisions, and uses Bayesian optimization to choose the remaining hyper-parameters.\n\n7\n\n\fFigure 2: Predictive distribution for different methods on a toy 1d regression problem. (a) PBP of\n[HLA15]. (b) HMC. (c) VI method of [Gra11]. (d) SGD. (e) SGLD. (f) Distilled SGLD. Error bars\ndenote 3 standard deviations. (Figures a-d kindly provided by the authors of [HLA15]. We replace\ntheir term \u201cBP\u201d (backprop) with \u201cSGD\u201d to avoid confusion.)\n\ninterval of 10. As shown in Table 5, we get an average log likelihood of \u22122.306, which is better\nthan SGD.\nFinally, we distill our SGLD model. The student architecture is the same as the teacher. We use the\nfollowing teacher hyper parameters: prior precision \u03bb = 2.5; initial learning rate of \u03b70 = 1e \u2212 5,\nwhich we reduce by a factor of 0.5 every 80K iterations. For the student, we use generated training\ndata with Gaussian noise with standard deviation 0.05, we use a prior precision of \u03b3 = 0.001, an\ninitial learning rate of \u03c10 = 1e \u2212 2, which we reduce by 0.8 after every 5e3 iterations. As shown\nin Table 5, we get an average log likelihood of \u22122.350, which is only slightly worse than SGLD,\nand much better than SGD. Furthermore, both SGLD and distilled SGLD are better than the PBP\nmethod of [HLA15] and the VI method of [Gra11].\n\n4 Conclusions and future work\n\nWe have shown a very simple method for \u201cbeing Bayesian\u201d about neural networks (and other kinds\nof models), that seems to work better than recently proposed alternatives based on EP [HLA15] and\nVB [Gra11, BCKW15].\nThere are various things we would like to do in the future: (1) Show the utility of our model in\nan end-to-end task, where predictive uncertainty is useful (such as with contextual bandits or active\nlearning). (2) Consider ways to reduce the variance of the algorithm, perhaps by keeping a running\nminibatch of parameters uniformly sampled from the posterior, which can be done online using\nreservoir sampling. (3) Exploring more intelligent data generation methods for training the student.\n(4) Investigating if our method is able to reduce the prevalence of con\ufb01dent false predictions on\nadversarially generated examples, such as those discussed in [SZS+14].\n\nAcknowledgements\n\nWe thank Jos\u00b4e Miguel Hern\u00b4andez-Lobato, Julien Cornebise, Jonathan Huang, George Papandreou,\nSergio Guadarrama and Nick Johnston.\n\n8\n\n\fReferences\n[AKW12]\n\n[ASW14]\n\nS. Ahn, A. Korattikara, and M. Welling. Bayesian Posterior Sampling via Stochastic Gradient\nFisher Scoring. In ICML, 2012.\nSungjin Ahn, Babak Shahbaba, and Max Welling. Distributed stochastic gradient MCMC.\nICML, 2014.\n\nIn\n\n[BCKW15] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural net-\n\nworks. In ICML, 2015.\n\n[BCNM06] Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD,\n\n2006.\nJ Eric Bickel. Some comparisons among quadratic, spherical, and logarithmic scoring rules. De-\ncision Analysis, 4(2):49\u201365, 2007.\nTianqi Chen, Emily B Fox, and Carlos Guestrin. Stochastic Gradient Hamiltonian Monte Carlo.\nIn ICML, 2014.\n\n[DFB+14] N Ding, Y Fang, R Babbush, C Chen, R Skeel, and H Neven. Bayesian sampling using stochastic\n\ngradient thermostats. In NIPS, 2014.\nYarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\nuncertainty in deep learning. 6 June 2015.\nAlex Graves. Practical variational inference for neural networks. In NIPS, 2011.\nJ. Hern\u00b4andez-Lobato and R. Adams. Probabilistic backpropagation for scalable learning of\nbayesian neural networks. In ICML, 2015.\n\n[Bic07]\n\n[CFG14]\n\n[GG15]\n\n[Gra11]\n[HLA15]\n\n[KW14]\n\n[Nea11]\n\n[PT13]\n\n[HVD14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In\n\nNIPS Deep Learning Workshop, 2014.\nDiederik P Kingma and Max Welling. Stochastic gradient VB and the variational auto-encoder.\nIn ICLR, 2014.\nRadford Neal. MCMC using hamiltonian dynamics. In Handbook of Markov chain Monte Carlo.\nChapman and Hall, 2011.\nSam Patterson and Yee Whye Teh. Stochastic gradient riemannian langevin dynamics on the\nprobability simplex. In NIPS, 2013.\n\n[RBK+14] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and\n\nYoshua Bengio. FitNets: Hints for thin deep nets. Arxiv, 19 2014.\n\n[RMW14] D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference\n\nin deep generative models. In ICML, 2014.\nEdward Snelson and Zoubin Ghahramani. Compact approximations to bayesian predictive distri-\nbutions. In ICML, 2005.\n\n[SZS+14] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\n\nlow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014.\nMax Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In\nICML, 2011.\n\n[SG05]\n\n[WT11]\n\n9\n\n\f", "award": [], "sourceid": 1897, "authors": [{"given_name": "Anoop", "family_name": "Korattikara Balan", "institution": "Google"}, {"given_name": "Vivek", "family_name": "Rathod", "institution": "Google"}, {"given_name": "Kevin", "family_name": "Murphy", "institution": "Google"}, {"given_name": "Max", "family_name": "Welling", "institution": null}]}