NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
The primary originality of this paper derives from dealing with active-learning regime with little or no data. This is an extremely important problem for ML, especially as ML is applied to more real-world domains where data is minimal and collection is expensive. The significance of this problem is therefore of high significance. I will discuss the significance their approach to the problem below. Related to this first point, the authors do a fantastic job of situating themselves in the wider active-learning literature, highlighting where there “ice-problem” sits and specifying its unique differences to alternative active learning scenarios. They also motivate this all with real-world examples, which I found particularly helpful and helped with clarity. The model itself is by no means novel, and in that sense, not very significant. The potential significance derives from their inference algorithm - whereby amortized inference is used for local latent variables and sampling methods used for model parameters. The authors make some claims about the significance of this contribution - i.e. allows model to scale with dataset size, while retaining accurate posterior estimates for weights. The discussion of the acquisition functions is extremely clear and well written but the selection method for imputation is not particularly novel as an algorithm - however, it may be in the context of imputation (please see improvements section on related work). The selection method for target prediction is far more interesting and potentially significant. Again, I am unsure if the proposed conditional MI has been suggested before (see improvements on related work), so it is impossible to fully determine the significance for this acquisition function. Nevertheless, a very thorough analysis is given of this acquisition function in terms of exploration (select data to observe new features) and exploitation (select data to make accurate predictions). This was well received - although the final amalgamation of the two different acquisition functions seems rushed - an additional (or re-written) sentence motivating this decision would be warranted. The significance of your work is that you can deal with the ice start problem in active learning. To my understanding, your approach to solving this issue is a) deal with partial observability - however, this is an issue for an element-wise acquisition scheme and b) deal with uncertainty - but this is a problem for active learning system, irrespective of the fact that there may be little or no data. Therefore, it is unclear to me whether your approach adds anything to the ice-start problem per se, as opposed to just element-wise active learning. Second, I would have like to have seen some further discussion about the “novelty” of your inference algorithm. I understand some of your stated motivations - i.e. amortization allows latent inference to scale, while ensuring posterior estimates of parameters are accurate. By why do you prioritize posterior estimates over parameters. One possibility, that you state, is that the SG-HMC algorithm retains the same computational cost as training with a standard VAE. This is obviously important. But some discussion on the importance of maintaining accurate estimates of parameters in terms of active learning would have been warranted. I.e. do you expect this to aid the acquisition functions (whose goal is to reduce posterior entropy over parameters)? The method for embedding (to transform partial observations) is not entirely clear. This could be because I am unfamiliar with the relevant literature. It is not a huge issue, as you do not claim it to be a novel contribution, but an additional sentence would have helped me understand the algorithm as a whole. Second, I find step (i) on line 163 ambiguous. What is x_io when conditioning the posterior over latent variables. I am fairly sure I know what this is, but it took some thinking, going back through the paper, etc. Please clarify this. The equation following line 136 (this equation is not numbered - I feel as though it should be) - why is the complexity (KL[q(z|x)||p(z)]) inside expectation under posterior estimates of parameters? The parameters theta only feature in the accuracy term. Is there a reason for this? I think the related work section needs significant re-writing. There is no mention of previous work on BALD or similar acquisition functions, which are the same as your first function. Moreover, there is no discussion of your proposed novel contributions. In my opinion, this should be worked again from the ground up, as I found it mostly uninformative. Very small point, grammar error on line 77.
Reviewer 2
This paper addresses the ice-start problem where the costs of collecting training data and the costs of collecting element-wise features are both considered. The authors propose a Bayesian deep latent gaussian model (BELGAM) that allows to quantify the uncertainty on weights, and derive a novel inference method. Experiments are performed on two tasks, imputation and active prediction, to demonstrate the effectiveness of the proposed model. This paper is not clearly written and can be better structured. It aims to solve two test tasks (1) imputation and (2) active prediction, however, these two tasks are not clearly defined. I have been working on active learning problems, but I still found it is difficult to follow this paper. I would suggest the authors to focus on addressing one of the two tasks. For example, add a new section of problem definition that precisely defines the objective of the task (move some details from Appendix), and then present how the proposed model can be used to well solve this problem. Some acronyms are used before they are defined in the paper, for example, what is NLL? What is AUIC curve?
Reviewer 3
The introduction is very well written and the method and problem setting are well motivated. The methods are not all original, but are rather an interesting use-case/application of existing methods and are a useful contribution. In section 2, the authors present the BELGAM model, but should cite similar models (ie BNNs in the context of a VAE) that have been used for LVMs in the past (for example, Johnson et al 2016) Section 2 introduces a lot of notation, which could be better summarized in a table. I couldn’t quite follow the explanation of the PVAE; is S_i a matrix or a vector? How do you compute KL(q(theta | X)||p(theta))? Since q(theta|X) is only sampled via a Markov chain, how do you compute its entropy and therefore take its gradient? The experimental validation is quite extensive and is a strength of the paper. Johnson, Matthew, et al. "Composing graphical models with neural networks for structured representations and fast inference." Advances in neural information processing systems. 2016.