Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
POST-REBUTTAL UPDATE: I have read the rebuttal and other reviews; my score remains unchanged. ======================================================= Originality: The novelty of each proposed trick is limited. However, the combination is novel, well-motivated and supported by the experiments. Quality: The paper is technically sound. Clarity: The paper is very well-written and is easy to follow. The authors clearly pose the problems, provide well-motivated solutions for each of them, and then provide extensive experiments including an ablation study. Significance: The authors test the proposed tricks on a broad set of tasks and outline several other fields that could also benefit. I expect these findings to have a large impact in practical applications. My main concern is regarding the significance of the posed three problems: underestimation of the variance, trivialization of the variance (independence from x) and poor estimation for out-of-domain data (variance does not increase outside the data support). While all of these problems intuitively seem to be common, it would be nice to see a more thorough investigation. Right now these problems are illustrated using the toy data. The improvement in the ablation study can serve as an indirect indication of these problems, however, a direct study of these problems on several different tasks could provide better insight.
This is overall a well written paper that highlights weaknesses in a widely used basic building block for building probabilistic models with neural networks: estimating the variance of some (Gaussian) observable. The paper empirically shows that current approaches clearly underestimate variance even in very simple, low-dimensional cases. It also shows that this is true for approaches that otherwise have been shown to improve the “bayesianess” of neural network predictions. I think it is a strength of the paper that the authors concentrate on simple, low-dimensional problems to analyze the problem. Even though this issue has been occasionally mentioned before, I think this highly original work that focuses on this otherwise not sufficiently discussed problem. The authors suggest three distinct approaches to tackle the problem and demonstrate that each of them provide improvements over the current state of affairs, at least in the investigated low- to medium dimensional cases. Each of these approaches is well motivated and described in sufficient detail (especially when considering the supplement). Unfortunately I find the most effective method of these three proposed improvements a bit unsatisfactory because, as the authors point out, it requires significant additional compute resources and introduces complexity into the training. It is also a bit doubtful whether this approach works in cases with high-dimensional input data because it requires finding ‘nearest neighbours’. This aspect was unfortunately not investigated in the submission. In general I think this work can be influential and encourage further research into this particular aspect of neural network modeling and training.
Post-Rebuttal Feedback Thank the reviewers for your feedback. I think this is a good paper to appear in NeurIPS. ####################### Uncertainty estimation has always been an important problem. This paper tackles the uncertainty prediction via directly predicting the marginal mean and variances. For assuring the reliability of its uncertainty estimation, the paper presents a series of interesting techniques for training the prediction network, including location-aware mini-batching, mean-variance split training and variance networks. With all these techniques adopted, the paper demonstrates convincing empirical results on its uncertainty estimation. Weakness, I am surprised by the amazing empirical performance and the simplicity of the method. However, many proposed techniques in the paper is not well justified. 1, Overfitting is still a potential issue that might occur. Because the network is simply trained via MLE, it is possible that the network fits all training points perfectly, and predicts zero variance by setting gamma-->infty, alpha-->0 and beta-->infty. The toy experiment in the paper has a densely distributed training points, thus the proposed method performs well. But I would like to see another experiment with sparsely distributed training points, in which case we can see better on the overfitting issue. 2, Scaling to high dimensional datasets. The proposed locality-aware mini-batching relies on a reliable distance measure, while the paper uses Euclidean distance. However, for high dimensional datasets, the Euclidean distance is hardly to be trusted. Similarly for the inducing point in variance network, the distance selection has the same issue. However I feel this this is not fatal, as RBF is known to work well, and other metric learning methods can be applied. 3, Computational complexity. Although the locality-aware mini-batching supports stochastic training, searching for the k-nearest-neighbour takes O(k N^2) computational cost, which might not be feasible in large datasets. In particular, if you learns a metric (Check Q3) along training, the searching process needs to be repeated, which makes it unfeasible. 4, I am not convinced by the local-likelihood analysis for location-aware mini-batching. Because the location-aware mini-batching is also an unbiased estimation for the total sum of log likelihoods, at end of the day, it should converge to the same point as standard mini-batching (of course there might be optimization issues). 5, I think Eq5 is a typo, that it should be sigma^2 (1- v) + eta * v Strengths, I have covered it in the contribution part.