Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper is original. The authors, despite similarities to LSGAN (Mao et al., 2017) do provide generality over the technique and provide theoretical guarantees about this. Overall, the paper quality is Okay. While there was some work to defend their claims, they are not fully backed up. First, they use discrepancy as both a loss function and a measure of performance. Of course their model would perform better on that metric: it's been directly trained to do so. A loss function is supposed to be a stand-in approximation for the true task for the model (E.g. no one is optimizing models for the end goal of reducing BCE loss, but rather, increasing accuracy or F1-scores on a test set), comparing losses directly is not useful. A better result would show the model performing better on some other metric (e.g. average performance of identically initialized networks on a test set trained only using generated data). Also a concern: selection of the hypothesis sets for the approximation of discrepancy, as far as I can tell, is not addressed beyond general phrases like "the set of linear mappings on the embedded samples." (This is excepting, of course, the linear case where they show it to be twice the largest eigenvalue of M(theta)) There should theoretically be an infinite number of linear mappings on the embedded samples, so how do they subselect? That would be something I'd like seen in the paper, or, at least, made more clear. Other small issues with the paper include some grammatical mistakes.
This papers proposes to use a discrepancy measure where an hypothesis set and a loss function are part of the discrepancy definition. In this very well written paper, the Authors make a compelling case for the potential benefits of a discrepancy measure capable of distinguishing distributions which would be considered as close by the Wasserstein Distance. The supplemental material is also a great addition to the flow of the paper allowing the Authors to clearly demonstrate their points. The DGAN derivations allow for a closed form solutions which can be generalized to a more complex hypothesis set if an embedding network is employed first. The paper is very convincing at presenting the potential advantages if one knows which loss and has a good enough set of hypotheses (from the embedding network). However, there is a bit of a sense of let down in the experimental section. The results for GAN training gets only a short treatment in 4.1 when it could have been a very strong point for the use of discrepancy. The results (7.02 IS, 32.7 FID) are reasonable but not a clear separator from other GAN methods. It would have been nice to have a thorough set of experiments there, instead of for ensembling. The EDGAN approach is working. However it is not as compelling a task as pure Generation. The model basically learns to ignore high discrepancy models and uses only output of a few low discrepancy models (as seen in weights in Table 4 in Supplemental) which is pretty much what intuition would dictate if you hand-picked the interpolation weights. You basically estimate your whole technique based on picking only 5 linear weights... If ensembling is the key to comparing your technique, some other methods out there (wasserstein model ensembling in ICLR'19) are also showing some decent results beyond model averaging. Overall a very good paper, experimental results could have focused more on DGAN than EDGAN. Note: line 205, should 'H' be the same font of the other hypothesis sets elsewhere in the paper? --------------- Thanks for the authors' rebuttal. I am maintaining my score of 7.
This paper considers learning generative adversarial networks (GAN) with the proposed generalized discrepancy between the data distribution P and generative model distribution Q. The discrepancy is novel in the sense that it takes the hypothesis set and the loss function into consideration. The proposed discrepancy subsumes Wasserstein distance and MMD as a special case by setting a proper class of hypothesis set and loss function. The author then proposed DGAN and EDGAN algorithms where they consider the set of linear functions with a bounded norm and loss function using square loss. Due to the restricted class of hypothesis set, the discrepancy has a closed-form simple estimation, which is very similar to matching covariance matrix of the data. For the EDGAN algorithm, they propose to learn the combination weights of a mixture of pretrained generators, which is a simple convex optimization problem. In short, the overall writing is clear and the setup is novel with the solid theoretical analysis.