Paper ID: | 219 |
---|---|

Title: | DISCO Nets : DISsimilarity COefficients Networks |

The paper presents a method which allows for efficiently sampling from the posterior of a distribution parameterized by a neural network. They propose a method to create a proxy for the true distribution of data and also a propose a training strategy which minimizes the difference between the true and estimated distributions. They show results for hand pose estimation where their network with posterior sampling works better than the one without. They also compare their results with some previous works.

I like the general idea of paper, and the simple technique to add noises in the first fully connected layer to get multiple outputs for the same input data point. Their training strategy also seems sound. My biggest criticism is the experiments section. In 4.4, when they claim to compare with the state-of-the-art, why do they not compare with Hand Pose Estimation through Semi-Supervised and Weakly-Supervised Learning, Neverova et al. which (if I understand correctly) achieve much better results than Oberweger et al. (which they still do not outperform).

2-Confident (read it all; understood it all reasonably well)

The authors created a new class of probabilistic model using deep neural networks, with two major techniques that differ from existing methods of similar motivations: a simple method for concatenating noise samples for prediction; an objective function based on the diversity and dissimilarity coefficients. The framework can be easily applied to any network architecture with task-specific loss function. The authors demonstrated the effectiveness of the method on the hand pose prediction task. DISCO Net outperformed the non-probabilistic variant of the same architecture by a large margin, and also outperformed existing probabilistic method cGAN which has additional disadvantage in practice compared to the DISCO Net method.

The authors presented good motivations for reducing mismatch between training loss and evaluation loss in probabilistic modeling, and provided a reasonable survey of relevant prior works in this area to address the issue. The DISCO Net method that the authors proposed were supported by prior work on diversity and dissimilarity coefficients. The proposed sampling method is significantly simpler to use in practice and is applicable to more network architecture than existing methods in probabilistic modeling. This would allow more practitioners to test out the method easily. Experimental results in the paper are convincing though it would have been more powerful if the authors had been able to apply the method to the state-of-the-art work on the hand pose task, including the NYU-Prior and Feedback methods. The choice of using MEU for pointwise prediction seems reasonable but can benefit from more description of what other options there are and why the authors chose this one - and it also helps if the author could present the oracle performance of an optimal prediction selection for comparison.

2-Confident (read it all; understood it all reasonably well)

This paper presents a method for training models in the setting where a given input may map to a distribution over outputs. For such problems one wants a model which represents uncertainty in the output for a given input, and which doesn't just collapse to a mode of the posterior predictive distribution. The proposed method minimizes an objective which encourages predictions to match some known true outputs while also remaining "diverse" -- i.e. the predicted outputs for a given input shouldn't be too "clumped together" according to some similarity measure.

This paper introduces a method for solving a general class of structured prediction problems. The method trains a neural network to construct an output as a deterministic function of the real input and a sample from some noise source. Entropy in the noise source becomes entropy in the output distribution. Mismatch between the model distribution and true predictive distribution is measured using a strictly proper scoring rule, a la Gneiting and Raftery (JASA 2007). One thing that concerns me about the proposed approach is whether the "expected score" that's used for measuring dissimilarity between the model predictions and the true predictive distribution provides a strong learning signal. Especially in the minibatch setting, I'd be worried about variance in the gradient wiping out information about subtle mismatch between the model and true distributions. E.g., consider using this approach to train a generative model for natural images -- I doubt it would work very well. Matching anything beyond first or second-order statistics seems optimistic. Though, perhaps it's adequate in settings where structure in the true predictive distribution is limited. Even with the Deep MMD papers from Li et al. (ICML 2015) and Dziugaite et al. (UAI 2015), a rather large batch size was required for learning to succeed. It would be nice if the authors could provide experiments investigating whether the advantages of the stochastic DISCO Net versus its deterministic counterpart (as in Table 3) are due to the regularizing effect of increased entropy in the predictive distribution, or due to increased modelling capacity. Some experiments running a parameter sweep over gamma would be a good place to start. I'm not really familiar with work in hand pose estimation. For this dataset, is much of the difficulty related to overfitting? If not, I'd be curious to know why the convolutional portion of your architecture was so small (aside from following previous architectures). The authors should note that their structured prediction problem could also be addressed with stochastic variational inference, as in (Sohn et al., NIPS 2015), or reinforcement learning, as in (Maes et al., Machine Learning 2009). The choice of a cGAN baseline seems a bit odd, given the current challenges of training such models. Perhaps a variational approach to training the conditional/predictive distribution would provide a more competitive baseline. Evaluating both stochastic and deterministic versions of the base architecture using ProbLoss seems odd. A deterministic model will clearly do poorly on this metric, due to a negligible "diversity" term. It might be more interesting to examine a sort of CDF over the scores (using the standard metrics) for many samples from the stochastic model's predictive distribution. I.e. make many predictions for a given input, and then sort predictions by various scores and plot the sorted scores. If we truly expect multimodality and/or ambiguity in the correct output, this might better show whether the model is correctly capturing those modes. If the DISCO Net approach is being proposed as a general technique for structured prediction, rather than just for hand pose estimation, some examples on, e.g. MNIST digit completion should also be provided (see Sohn et al., NIPS 2015). There are a few minor editing issues: -- "is composed a convolutional part" (line 262) -- inconsistent superscript (line 229) -- "2 bidimensional Gaussian." (line 24) -- "represent a mixture of Gaussian." (line 28) -- misplaced parentheses (line 167)

2-Confident (read it all; understood it all reasonably well)

This paper studies an important problem, how to model uncertainty in output spaces. This problem is important and comes up often in scenarios like temporal forecasting and when dealing with multi-modal output distributions. The paper proposes a scheme for sampling multiple solutions and then minimizes the expected loss at the task under the predicted and the ground truth distribution. The paper reports results on the task of hand pose estimation on the NYU Hand Pose Dataset, and shows that ability to sample different outputs can be used to obtain good performance on a variety of metrics.

I found the problem studied by the paper important and the proposed solution novel. However, I have a few questions: 1. I am not sure if the unbiased estimate for DIV(P_T, P_G) makes intuitive sense. I understand that y' can be sampled from P_G by sampling different z vectors, but why can we only use one sample y from P_T to estimate the DIV(P_T, P_G)? In the situation where we dont have access to any other sample y from P_T, should the estimate not be modified by using the minimum \Delta instead of the average \Delta between the ground truth sample and samples from the model? 2. The experimental results in the paper only target the application when the output metrics may be different from those used for training. To better explain the difference that can be caused by using different metrics during training and testing can the author report performance metrics when the solution is selected as per one metric but performance is evaluated using another? 3. I believe another major application of the proposed technique can be to sample different modes of prediction (say confusions between front facing and rear facing hands). Does the current model produce reasonable predictions under such cases? The visualizations presented in the paper do not suggest this, can the authors be more explicit about this? 4. Minor Typo: Second part of Eqn 5 is missing a sum over examples.

1-Less confident (might not have understood significant parts)

The authors propose a novel type of network that is trained to sample from a posterior distribution rather than only giving a single estimate. The authors then evaluate the network on a hand pose dataset.

What was unclear to me was whether the superior performance of the DISCOnet was primarily due to the probabilistic output or also due to the regularization effect of noise during training? Are DISCOnets just implementing a different type of dropout? What happens if the DISCO_{\beta,\gamma} net is evaluated after training without injected noise (with evaluation similar to the BASE model)?

1-Less confident (might not have understood significant parts)

This paper proposes a new way to introduce uncertainty estimation into deep models. It does so by optimising a different cost, the dissimilarity coefficient, which allows for any arbitrary cost to be considered while still capturing the uncertainty of the samples into account. They apply it on the NYU Hand Pose estimation dataset and compare to GAN and cGAN.

[EDIT] I'd like to thank the authors for their response, which addressed most of my concerns and made me more confident about the paper. - The connection between using U[0,1] noise and performing Inverse transform sampling makes total sense and is actually quite interesting, it would be really useful to state it in the paper. - The loss restriction point was helpful as well thanks, it is indeed general enough then, no issues. - I completely agree with the assessment of the differences/difficulties in comparing DISCO Net and VAEs. A comparison would be great, but perhaps it would actually be more meaningful and easier to perform it on another dataset entirely, which may prove quite a lot of work just for this paper. I think mentioning the similarities/differences clearly but leaving it for further work would be fine. [Original review] I'm not sure how I feel about this paper. I wasn't aware of the dissimilarity coefficient metrics, but they seems to be supersets of Jensen-shannon divergences, so at least they are expressive enough. I'm not sure how useful they would be without actual proofs that P_G(y|x) actually converges to the true one. Hence most of the motivation for the paper is conditioned on the end of Section 3.2 "Proper scoring rule". Hence actually, the distances that can be used are restricted, and they consider Euclidean distance only. I would have appreciated more space to be spent on explaining exactly how well the distributions are properly captured. The authors just inject uniform noise, without any proof that this will be sufficient to capture the correct distribution. Variational Autoencoders (VAE) are mentioned in the introduction, but are not used for comparison, and this is quite lacking. Indeed at least for VAE the assumptions about the approximations to the posterior are known and well captured. It would have been great to see how a VAE (not really autoencoder, but say a network which takes the depth map as input for the Encoder, and then Decodes back to the poses, while still having a Gaussian prior over the hidden code). Indeed, I actually think that the particular architecture shown in Figure 2 is really just a slightly different way to represent a VAE (although using a different noise prior and a different optimisation function, which could be either good or bad depending on how much you want to keep the theoretical proofs of VAEs): The Encoder in VAE usually output two estimates: a vector for the mean of the posterior, and another vector for the covariance of the Gaussian posterior (a diagonal covariance actually for efficiency reasons). When training, the reparametrization trick uses samples from a unit N(0,I) gaussian, multiplies it by the covariance posterior estimator and adds it to the mean. This is functionally equivalent to the architecture shown in Figure 2, if we consider the output of the Convolutional layers of DISCO-nets as the mean of the posterior. The concatenated noise sample will be multiplied by a weight matrix, and then added to the next the layers alongside the mean (which will get its own weight matrix as well). Hence the input, as seen from the first dense layer of the decoder, is exactly the same as for VAE: learnt_mean + sampled noise * learnt_variance. The difference, as said before, is in the form of the sampled noise, what the cost is, and how it's estimated. DISCO-net assume that K samples are used in order to evaluate the cost. This is, again, similar to extensions to VAE which use multiple samples in order to improve the variational lower bound (https://arxiv.org/abs/1509.00519). So, this is not really a problem, but more an observation, which would be interesting to discuss. The comparison with the BASE model are good, but I think \sigma should really have been optimized as an hyperparameter to make it fair. At least I'd like to see how a BASE model with \sigma=0 performs on metrics that did not require multiple samples. Another aspect of my unease with trusting the overall model is the fact that no checks are done on other datasets which require to model the full distribution. The authors could have used more standard datasets (usually MNIST or OMNIGLOT now) for that, or synthetic distributions like done in [http://arxiv.org/abs/1505.05770]. I commend the authors on their attempts to make GAN and cGAN work. They did seem to run into the usual stability issues with them, but as far as I can tell they tried their best and made a fair comparison. Unfortunately, I felt the authors didn't cover the other models as well, in Section 4.4. This section lacks interpretation and discussion, and Figure 4 is neither presented nor commented. Considering that DISCO Nets do not achieve SOTA compared to these, it would have required a bit more explanation. The rest of the paper is clear, the supplementary material covers additional training details, and all the information to reproduce their results is easy to find. In conclusion, I am happy to see another method to construct DNN which capture uncertainty, but given this paper I am not certain of how correct their uncertainty estimate really is.

2-Confident (read it all; understood it all reasonably well)