Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The main point of conduction to me is the main objective of this paper (not the objective they introduce). The authors write about “diversified representations”. Why is this useful? What does this mean? How is this different from disentangled representations? How did they show their own definition empirically. This main argument is hard for me to distill from the current version. That being said the paper is well written and easy to follow. Next I would like to ask for further clarifications on the objective that is been introduced, the method, itself. a) In section 3.3.2 , could you expand your explanations ? Why is (8) a bound on (7)? b) The optimisation seems very involved, can u say more about the hyper-parameter space you have been looking at/ sensitivities? c) Can you extend on the kind of networks that you have been using ? aka the backbone networks? d) What would happen if you let parts of your objective go? E.g. do you need term one Eq 3? e) What form does the discriminator network have? If I was to say that the classification performance is probably so good because the discriminator is pretty close to a what a classifier does, what would you say? Similarly like a GAN would have better FID scores than a VAE because the discriminator has such a similar functional form. And thus in that way you are somewhat cheating? In this submission, I would have enjoyed reading the code. The experiments seem competitive to other methods, even though error bars have not been provided. However again I can not find a clear hypothesis and respective experiments on what diversified means. In what context other than classification is it useful?
------ update after the rebuttal ------ the following points are revised from the discussion ------ 1/ In multiview learning, normally people could collect data from two conditionally independent resources or people could split the current existing data into two disjoint populations which creates multi-view in a cheap way. The way people split data into two disjoint populations could be thought of as minimising the mutual information between two "representations" of the same data. My point is that the authors shouldn't claim their work as totally independent/different from multi-view learning work in the rebuttal since IMO these two research lines are deeply connected. 2/ Maybe it is just my personal opinion. If the goal is to learn diversified vector representations, the paper needs to thoroughly justify the reason for using information bottleneck and also the whole variational inference, which was mentioned in my first review. To me, this paper threw variational autoencoders every where possible and didn't even both looking into the derivation and checking whether there were redundancy. From a probabilistic point of view, given x, y and z are naturally conditionally independent. Since the label is also presented in learning, which diminishes the conditional independence between y and z, the only thing we need to consider is to make sure that the mutual information between y and z is minimised, which could be implemented by a single adversarial classifier. That is the immediate thought I have when I was reading the title. With a certain Lipschitz constraint, people can prove error bounds on this issue. 3/ I couldn't figure out why I(r,t) term was necessary as there are two classifiers on y and z respectively, and the classification on the downstream tasks could rely on these two classifiers. ---- Variational inference is indeed interesting and also variational autoencoders are a huge win in the deep learning settings as SGD optimisations could be applied. However, we still need to carefully consider and justify why variational inference is necessary here. ---- ------end of the update ------- 1/ The necessity of the information bottleneck objective is doubtful. The goal of the information bottleneck objective is to learn "minimally sufficient statistics to transmit information in X to Y (which is T in this paper)". Given the goal here is to learn diversified representations of the input source, I don't see how the information bottleneck objective is being crucial here. 2/ Learning diversified objective through minimising the mutual information between two or among multiple pieces of information is not new. The performance gain one can get in multi-view learning or consensus maximisation is by ensuring that data representations (raw input representations or distributed representations learnt in neural networks) are conditionally independent given the correct label if they belong to the same class. Therefore, after learning, fusing multiple representations leads to improved performance compared to individual representation. This idea has been established around 30 years ago, and I recommend authors to at least refer to their papers in multi-view learning. 3/ If the main goal is to say that the proposed framework is capable of learning diversified representations and fusing representations gives better performance, then at least, a proper baseline should be, for a given baseline model, train two models with different initialisations and then take the ensemble of them when comparing performance. 4/ Lack of ablation study of the proposed objective. This relates back to my first point that the information bottleneck objective is not necessary and also the proposed paper didn't show why it was crucial to have it. Also, the objective function itself seems to be obese. For example, maximising I(z,t|\phi_z) + I(y,t|\phi_y) is a sufficient condition for maximising I(r,t|\phi) given that r is a concatenation of z and y. I hope the authors could critically evaluate their proposed method.
The authors examine the ability of mutual information constraints to diversify the information extracted to the latent representation. They study this by splitting the latent representation into two parts which are learned in competition using a mutual information constraint between them and opposing (min/max) mutual information constraints between the latent representations and the data. To train the model, the authors use a two-step competition and synergy training regime. The authors demonstrate the improved performance resulting from learning diversified representations in both the supervised and self-supervised setting. In the supervised setting, the authors show that the competing representations differ in their sensitivity to class-wise differences in the data. The difference is proposed to arise from the separation of high-level and low-level information learned by the two representations. In the self-supervised setting, the authors examine the role of the diversified representations in disentanglement and show increased performance compared to B-VAE and ICP. Major Comments: - The idea is well-motivated by the information bottleneck literature and the although the mutual information derivations present in this work are not novel (1), the role of diversified, competing representations in this context is not so well studied. Minor Comments: - an x-axis label stating which dimensions are from z and which are from y should be included.  Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620, 2018  Kim, H. and Mnih, A. Disentangling by factorising. In Proceedings of the 35th International Conference on Machine Learning, 2018.