
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The proposed algorithm uses the KL distance between the true and the proposed posterior as the objective function. To compute the KL distance it uses stochastic gradient Langevin dynamics to effienctly sample from the posterior and to optimize it it uses stochastic gradient descent.
The writing is quite clear. It is a simple but effective approach and the examples provided show its benefits quite convincingly.
Q2: Please summarize your review in 12 sentences
The paper proposes a stochastic gradient algorithm to train a parametric model of the posterior predictive distribution of a Bayesian model.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
the main idea of the manuscript is, in a regression context, to approximate the predictive distribution with a simple tractable model in order to avoid the high cost required to approximate the true predictive distribution using Monte Carlo samples. The authors discuss why they want to use a Bayesian approach (and their claims are somewhat supported by their simulations) in contrast with a plug in approach. The approach developed comes with no surprise, but is natural and sensible : minimize an averaged KL divergence between the true predictive and the surrogate parametric model using the output of SGLD algorithm. Performance of the method is illustrated through simulations on some examples and, as for any manuscript, outperforms competitors. There are numerous parameters to choose in the algorithm, which is not discussed.
Q2: Please summarize your review in 12 sentences
Nothing surprising in the manuscript. It is an idea worth exploring.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper uses an online approximation to MCMC to draw parameters for a Bayesian neural network. The predictive distribution under these samples is then fitted using stochastic approximation. The comparisons are to recent work on approximate Bayesian inference applied to the same models and example problems. The claim being that it's a better method (at least in some ways). The paper does not yet present demonstrate that these methods will push forward any particular application.
The paper is a fairly natural extension of existing work. It is tothepoint, done well, presented clearly, and would be easy for others to try, so could have considerable impact.
Some indication of training times would be useful. Presumably the training time is much slower than the other methods, as the abstract makes a point specifically about test time. However, this issue is simply dodged.
I think it should be made clear in the paper that stochastic gradient Langevin dynamics only draws approximate samples from the posterior. Not just in the usual MCMC sense of requiring burnin, but it doesn't even respect the posterior locally when exploring some mode for a long time. While there are various developments (cited), none of these offer the same sort of results as traditional batch MCMC on small problems. It's true that on large problems there isn't a better option, but I think it's important not to overpromise. The eagleeyed reader will notice the posterior is wrong in Figure 2, but it's never really pointed out.
I would remove the statement that the priors are equivalent to L_2 regularization. That's only if doing MAP estimation, which this work is emphatically not! It seems a shame to use simple spherical Gaussian priors, given all the work on Bayesian neural networks in the 1990s. It's hard to take these priors too seriously, especially in a very large and deep network.
In an engineering sense, the overall procedure is useful. Although future comparisons would be required to see if it really beats standard regularization, early stopping, dropout, and various other hacks to avoid overfitting. In applications where the predictive distributions are the goal of inference, more work is also required to make approximate Bayesian inference at this scale trustworthy. However, this is an interesting step in the process.
Minor:
I personally don't think the title really captures the contribution. It may sound cool, but there isn't any discussion of eeking out "hidden" knowledge in the paper.
I don't think the fitting algorithm is really plain SGD. The subsequent theta samples in the final line are dependent, so presumably some stochastic approximation argument needs to be made.
$5e6$ is ugly typesetting. I suggest $5\!\times\!10^{6}$, and $1e5$ could just be $10^{5}$.
References: The NIPS styleguide says the references should use the numerical "unsrt" style. The references could have more complete details. Some proper nouns need capitalizing. Some authors only have initials, while most names are listed in full.
Q2: Please summarize your review in 12 sentences
This paper is a natural extension of Snelson and Ghahrahmani's "Compact approximations to Bayesian predictive distributions" (ICML 2005), using neural networks and online methods for the MCMC and gradientbased fitting. It's well done, bringing existing neat ideas up to date.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Paper Title: Bayesian Dark Knowledge
Paper Summary: This paper presents a method for approximately learning a Bayesian neural network model while avoiding major storage costs accumulated during training and computational costs during prediction. Typically, in Bayesian models, samples are generated, and a sample approximation to the posterior predictive distribution is formed. However, this requires storing many copies of the parameters of a model, which may require a great deal of storage/memory in highparameter neural network models. Additionally, prediction on test data requires evaluation of all sampled models, which may be computationally costly. This paper aims to learn a model that makes (approximate) posteriorpredictivebased predictions on test data without storing many copies of the parameter set. The method presented here accomplishes this by training a "student" model to approximate a Bayesian "teacher's" predictionsa procedure in the neural networks literature referred to as "distillation" (and also referred to in previous papers as "model compression"). However, the student and teacher are trained simultaneously in an online manner (so no large collection of samples are required to be stored, even during training). Experiments are shown on synthetic data (a toy 2D binary classification problem and a toy 1D regression problem), on classification for the MNIST dataset, and on regression for a housing dataset.
Comments:
 I feel that working to develop computational tools for practical Bayesian inference in neural networks is an important direction of research (in particular, tools that mitigate the issue of large storage/memory requirements when sampling in largeparameters Bayesian models), and I feel that this paper is making good steps in this direction.
 One issue I have with the novelty of this paper is that the fundamental concept being developed here (student/teacher learning, or model distillation/compression) is not new  the authors apply this existing idea in a new domain. However, I do feel the main novelty (and strength) of this paper is the clever way in which the algorithm carries out simultaneous online training of the teacher and student without requiring storage of samples. This is an online distillation/compression method in which the teacher is never "fully trained" (the teacher never actually provides full posteriorpredictive "labels" for the student) and yet the student is still able to learn the "full trained" teacher's model.
 One potential computational issue with the presented algorithm is that the student must be trained (to mimic the teacher), on the training dataset D', at every iteration during the MCMC algorithm. I believe this requires a great deal more training of the student than in typical distillation methods (which do not need to make many passes through the training dataset D'). This might be particularly problematic if D' is required to be large in order to achieve good performance. In general, I feel that there is not enough discussion about the training set D' (whether particular choices about D' affect the method's performance, and the particular details used in the experiments presented in this paper).
 I feel that the experiments in this paper were not totally polished and thorough. The authors write that they could not enable a proper comparison of all methods on all datasets because (in part) "the open source code for the EP approach only supports regression" and "we did not get access to the code for the VB approach in time for us to compare to it". I appreciate the honesty here! However, it would be nice to complete these goals and polish the experiments section in this paper.
Q2: Please summarize your review in 12 sentences
I feel that the goal of this paperto develop methods for Bayesian neural networks without the need to store and evaluate many copies of the modelis important, and that the presented method is a clever way of approaching this goal (though it is, in part, an application of existing methods). However, the paper could do a good deal more to polish the experiments section.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their helpful comments.
Our responses to the main questions raised in the review:
1. "No
mention of training times" (Assigned_Reviewer_2)
 The time for
one iteration of our algorithm is approximately equal to the time for 1
iteration of SGLD on the teacher plus 1 iteration of SGD on the student.
For the MNIST experiment with 400 Hidden Units/layer, our implementation
took 1.3 secs for SGD on the teacher, 1.6 secs for SGLD on the teacher,
and 3 secs for online distillation, i.e. SGLD on the teacher + SGD on the
student (times are given in secs/1000 iterations).
2. "...it should
be made clear that SGLD only draws approximate samples... "
(Assigned_Reviewer_2)
 Yes, we will make this clear. But please
note that, if we want exact posterior samples, we can use a traditional
MCMC method such as HMC for inference in the teacher network.
3.
"The subsequent theta samples in the final line are dependent, so some
stochastic approximation argument needs to be made."
(Assigned_Reviewer_2) "I think SGLD theta samples should be thinned as
there will be high autocorrelation" (Assigned_Reviewer_6)
 Yes,
high autocorrelation between subsequent samples may in theory slow down
the convergence of SGD on the student. We want the SGLD chain to mix
faster than the SGD chain converges, which we achieve using a high step
size for SGLD and annealing the step size quickly for SGD. Another way to
break correlations between subsequent samples is to generate a pool of
samples using SGLD ahead of time, and then randomly select from this pool
in each iteration of SGD. This might be not be possible in practice with
large networks because of storage costs, in which case we can maintain a
smaller pool of samples using reservoir sampling. Other options are
thinning or getting samples from multiple parallel SGLD chains.
4.
"there is not enough discussion about the training set D'"
(Assigned_Reviewer_3)
 The purpose of the dataset D' is to provide
data outside the training set where we want the student to mimic the
teacher well. For our toy classification example, D' is a grid because we
wanted to visualize the predictive probability. For real data, D' can be
restricted to data that we expect to see in practice. For example, for
image data, we can use 1) perturbed versions of the training data e.g. by
adding Gaussian noise or by elastic distortions, 2) unlabeled images
and/or 3) adversarial examples.
5. "...requires a great deal more
training of the student than in typical distillation methods (which do not
need to make many passes through the training dataset D')..."
(Assigned_Reviewer_3)
 Note that D` can even be an infinite
dataset. We use only a minibatch of data to make each update. Just as SGD
can learn from seeing each data point only a few times, our student
network can learn to approximate the teacher without too many passes, and
usually converges in just a little more time than it takes for SGLD to
converge.
6. "I feel that the experiments in this paper were not
totally polished and thorough" (Assigned_Reviewer_3)
 We just
received code from Blundell et al. and are working on finishing the
comparisons with their method. We are also working with astrophysicists to
build predictive models for the outcomes of expensive experiments at the
Large Hadron Collider. This is a real world example where modeling
uncertainty is important. If the paper is accepted, we will add these
results to the camera ready version.
7. "The advantage of
distilled SGLD over SGLD is not well supported in the experiments."
(Assigned_Reviewer_5)
 In terms of accuracy of predictions and
uncertainty estimates, distillation has no advantage over SGLD. But SGLD
and other MCMC methods have to store many posterior samples which is
impractical both in terms of memory/storage costs and test time
computational costs. The goal of distilled SGLD is to avoid these
computational problems, not to improve accuracy.
8. "I do not like
the model fit in line 190... you are losing correlations between pairs
(x,w) and (x',w')..." (Assigned_Reviewer_6)
 We are sorry but we
did not quite understand this remark. Please note that we are only
interested in the predictive distribution p(yx,Data), and not in the
posterior distribution over the weights of the neural
network.

