Paper ID: | 3674 |
---|---|

Title: | Quantum Wasserstein Generative Adversarial Networks |

After rebuttal: Thank you for the rebuttal. It helped me understand the sampling more / evaluating the loss more. Also, as your scheme is not designed to generalize OT to the quantum setting, I am fine that the quantum Wasserstein semimetric does not allow for a general cost function. Based on these and the promising real life experiment mentioned in the rebuttal, I have decided to raise my review to marginally above the acceptance rate. ------------------------------------------------------------------------------------- The paper introduces the Wasserstein semimetric between quantum states, which is then applied in learning to generate an empirical quantum state. The properties required for a semimetric are shown and furthermore the authors show that it behaves in a smooth way with respect to the quantum states. The method is then compared to other work in the literature in a non-direct way, as the referenced work have carried out the experiments in very different settings. The paper is well written and the contributions are introduced in a clear way mathematically. For a non-expert in quantum computing the paper was a hard read though, mostly due to the extensive background needed (as evidenced by the 20+ page supplementary material). On top of this, in many cases there is a reference to the supplementary material for discussions and analysis, which I think harms the ability of this paper to stand out on its own. To some extend this is fine, but in this case I think it made the paper lack important content. I was missing more comparison between the classical WGANs and qWGANs. For example, it remains unclear to me how the sampling is done in the qWGAN case. Furthermore, I was not able to find how to evaluate the cost function c(x,y). In the classical WGAN case, x and y are samples from distributions, but what are they in this quantum case? Also, which cost function is used in this work? As far as I understand, the matrix C does not depend on the supports of P and Q ( that is, there is no ‘c’), which is very different to the classical situation. As C does not depend on a cost function on the underlying ‘sample space’, whichever that is in the quantum case, it is difficult to agree that (4.1) would be a natural generalization of the primal formulation of the Wasserstein metric. This is because the fundamental property of optimal transport is the ability to take the geometry of the sample space into account. Specific remarks: - Line 30: seminar -> seminal? - Line 58-59: KL and JS divergences aren’t metrics in the strict sense of the word. - Line 67: The main contribution is claimed to be the initiation of the study of quantum Wasserstein GANs. This wording sounds quite grandiosa, taken into account that GANs have been studied in the quantum setting already. - Line 75: The references for the formulation of the Wasserstein sense in the coupling form (primal) and ‘optimal transport’ form (dynamic I guess?) are really vague, especially as Villani’s book discusses all three formulations (primal, dual and dynamic). - Line 118-119: Denoting the distributions and densities with the same symbols is a bit confusing, although understandable as this notation is not that present in rest of the work. - Line 132-133: Weight clipping was the heurestic used in the vanilla WGAN, but is not the state-of-the-art anymore (thus the word often is a bit off here). I believe the gradient penalty method is much more prevalent nowadays. - Line 162-163: ‘A more reasonable choice of C would be the projection onto the orthogonal…’, this was a bit confusing at first, but I guess here the projection of the cost matric c(x,y) is meant? - Line 194-195: The terminology of ‘expected value of observing Hermitian psi on quantum state Q’ is not familiar to me. Does this have a physical interpretation? The probability of observing something sounds more familiar. - Line 209-210: What is a ‘simple tensor product of Pauli matrices’? - Line 211-212: the ’s’ after the expectation is confusing. Similarly the s after alpha_k and beta_l, avoid mixing mathematical notation with ‘grammar notation’. - Eq. (4.4): I guess min here should be with respect to the parameters of G. - Line 239: Using notation like \{p_i\}_{i=1}^N instead of just \{p_i\} would be clearer. - Experimental section: I believe fidelity is only defined in the supplementary material, but the right place for this would be in the main paper in my opinion. It might also be interesting, from the optimal transport point of view, to remark that the fidelity is closely related to the 2-Wasserstein metric between SPD matrices. -Supplementary material: in the proof of Theorem 1. 3), C is defined as (I - SWAP)/2, but in 3) C = (I + SWAP)/2 is used (or atleast written down).

This work connects together the idea of using Wasserstein distance for cost functions in GANs to stabilize training with the recently introduced notion of quantum GANs. To someone with sufficient visibility into both the machine learning and quantum computing literature, this might be seen as a natural progression and tying together of ideas from existing works in a straightforward way. Admittedly, though, there may be very few such dual-field experts out there. For quantum computing researchers, this work provides a method to potentially train larger-scale models on noisy machines than have been previously managed. To machine learning researchers, this work could be interesting because extends the familiar notion of WGANs to run on a new type of (available) hardware that can potentially model interesting intractable distributions. The paper is technically clear. I did not spot any obvious flaws or scientific errors. The numerical results support the motivation and theoretical results of the paper (that training should be smooth and efficient). The submission is clearly written and the argumentation is clear. I could follow the narrative of the work without any big issues, and extensive supporting material, proofs, and code are provided in the appendix for those seeking more concrete details. There were a few minor typos (more of a nuisance than any significant barrier to understanding). I would recommend that the authors go through an additional round of editing to polish it up. This work builds upon previous work in a positive way, showing how existing training methods for quantum GANs can be made smoother and more robust, potentially allowing much larger models to be trained easier on noisy hardware. Regarding significance, I would vote in favour of acceptance of this paper in NeurIPS because it is important for ML researchers to be made aware of advances and potential advantages of near-term quantum computing devices for machine learning. In particular, the ability to build generative models which can model and sample from otherwise intractable distributions---with the ability to run these models on currently available (or near-term) noisy quantum hardware---might be of general interest to the machine learning community and should be highlighted. Wider awareness of the current ideas in quantum machine learning could potentially lead to interesting breakthroughs and new bridges being built by experts from both sides.

While it is challenging to me to understand all the new properties of the proposed quantum Wasserstein GAN, The motivation and background knowledge on the Wasserstein distance and Wasserstein GAN are presented clearly. The idea of applying the Wasserstein distance and its derived GAN model on quantum data sounds interesting and straightforward. It gives me the impression that all the used components including the regularization are studied well on regular data, and the paper seems to apply them to quantum data with some special designs. The experiments look very insufficient. It only evaluates the proposed method on numerical experiments without any comparisons against relevant methods or their adaptation.

I think the claim that the proposed definition of qW is new is not true: - This definition was considered before by the team of Golse and collaborators [1], and is recently reviewed in [2]. These authors consider an apparently related but different cost function C and in particular introduce an epsilon parameter that should goes epsilon->0 to obtain a definite positive semi-distance, but my feeling is that it is because these work consider possibly infinite dimensional Hilbert spaces (see below about why I think this is important). - The same idea of using partial trace constraints was also considered in [3]. This work also considers an extra spatial dimension, but for a space with a single point, this gives the same type of problem. - The idea of using quantum entropic regularization is considered in [4]. This is for a slightly different problem, where partial trace constraints are replaced by quantum-KL discrepancy ("unbalanced qOT"), but this leads to the same expm Gibbs factorization of the solution. Regarding the importance of using qW with respect to simpler losses. The authors claim that "the discontinuity of ... the trace distance", but I think this is not true, the trace distance is continuous. So, to me, since the paper only considers finite dimensional settings, it is not clear how to support the narrative that qW is better than alternative distances (just as with classical OT, where one has to goes to continuous spaces to see the difference in term of induced topology between TV and W). So, the work of Golse and collaborators is very relevant here, since they do provide a quantitative comparison between qW and W applied to Husimi transforms of the states. The present paper lacks this type of control to justify the use of the qW loss. Related to this, it is a pity that, despite not having theoretical results regarding their qW loss, they did neither provide numerical results supporting the superiority when training GANs. The nuclear norm and the Bures metric (as review in the supplementary section "Distance measure") are both convex functionals, so one could also dualize and develop a GAN, just as vanilla GAN does for the JS divergence. It is important to realize that by modifying the dual problem and using a neural network actually stabilize the primal divergence, and in practice, vanilla GAN is still the mostly used method, and the superiority of the WGAN is only very marginal (or even not true). Also, and this is related to these issues, I think the statement that the qW loss is differentiable is wrong (just as W is not differentiable). To have differentiability, the solution of the dual problem must be unique (up to an additive constant). This is in general not true, just as for classical OT. This is why adding entropic regularization is important, to obtain a smooth loss function. It is a shame that the authors of the paper did not insist on this (in my mind) important contribution, since smoothing the W distance is a very important topic, especially when dealing with high dimensional problem where OT is known to perform very badly (curse of dimensionality). Lastly, my overall feeling is that the paper does a poor job at actually presenting the method and the take home message. Most of the interesting part is actually in the appendix, in particular: - the precise definition of the cost C is super important, and should be given in the main text *before* stating the theory. One should not have to read the supplementary to be able to understand the statement of the theorem (of course it is fine with me to have the *proofs* in the supplementary, but not the hypotheses). - all the narrative and intuitive explanation about why qW is better than other norms is super important. Currently, this material is spread in the supplementary, and is explained in a quite fuzzy way. In particular, the paragraph "Classical examples of simple sequence ..." is very important, but there is no equation and no precise statement (in particular in light of my previous remark about the fact that the nuclear norm is continuous just as qW). To summarize, I have mixed feeling about this paper, but I would still support acceptance under the conditions that: - The authors give a proper credit to previous works on static Kantorovitch-type formulations for qOT, and in particular put in context their main contribution with respect to the previous work of Golse and collaborators (the critical part being to understand the differences/similarities between the cost functions, and understand this in the context where the dimension -> +infty). - Does some "mass transport" between the supplementary and the main text so that the paper becomes self-contained and explains more clearly the advantage of the proposed metric. ---- References --- [1] On the Mean-Field and Classical Limits of Quantum, F. Golse, C. Mouhot, T. Paul, Mechanics, Commun. Math. Phys. 343 (2016), 165–205. [2] Quantum optimal transport is cheaper Emanuele Caglioti, François Golse, Thierry Paul https://arxiv.org/abs/1908.01829v1 [3] Matrix-valued Monge-Kantorovich Optimal Mass Transport Lipeng Ning, Tryphon T. Georgiou, Allen Tannenbaum, https://arxiv.org/abs/1304.3931 [4] Quantum Optimal Transport for Tensor Field Processing Gabriel Peyré, Lenaïc Chizat, François-Xavier Vialard, Justin Solomon, https://arxiv.org/abs/1612.08731