NeurIPS 2020

Learning with Optimized Random Features: Exponential Speedup by Quantum Machine Learning without Sparsity and Low-Rank Assumptions

Review 1

Summary and Contributions: The paper present a quantum algorithm for sampling Fourier features with probabilities proportional to the ridge leverage scores of the kernel. The proposed algorithm can generate a sample from this distribution in time nearly proportional to the dimension of the dataset.

Strengths: Fast quantum algorithm for large scale kernel method. Theoretical analysis of the proposed method.

Weaknesses: The paper introduces quantization of the data and Fourier domain but does not analyze the error that is introduced by this. In particular, in page 5, the data domain is quantized and the points of the dataset are rounded to their nearest grid point, however, the effect of this rounding is not quantified. This can be an issue. For instance, if a dataset contains a cluster of points that are all very close to each other such that after rounding they will be all rounded to the same grid point. In this case even if you compute the exact kernel matrix after rounding, it will not be spectrally close to the original kernel matrix.

Correctness: Besides the rounding issue, I did not find any issues in the proof.

Clarity: Yes the paper is mostly clear.

Relation to Prior Work: I do not agree with the claim that is made about the prior work of [20]. The authors criticize [20] for not sampling the Fourier features with respect to the ridge leverage scores distribution of the integral operator \Sigma. First of all, this integral operator in unknown in practice because it is defined by making the assumption that the distribution from which the dataset is generated in known. This assumption is clearly not true in most real world settings. In particular, in page 5 the authors try to circumvent the issue of having unknown distribution on the input dataset by approximating the distribution using a heuristic method whose performance is not analyzed in the paper. Second, the focus of [20] is to spectrally preserve the kernel matrix K whose entries are the kernel value on pairs of points from the input dataset. This is sufficient for proving that the solution to any of the downstream kernel methods obtained by using the approximated kernel matrix is a (1+\eps) approximation to solution of the exact kernel method. The spectral approximation to the kernel matrix K is obtained in [20] by sampling the Fourier features with respect to the leverage scores of the kernel matrix K.

Reproducibility: Yes

Additional Feedback: See the above sections "Relation to prior work" and "Weaknesses". ============================================== I thank the authors for taking the time to provide feedback. I find the clarifications in the rebuttal very helpful. My assessment is that the presentation of the paper is suboptimal and it is hard to see the main techniques/ideas. The authors have addressed my concerns about rounding and also the position of this work in relation to [20]. I encourage the authors to add these clarifications to the main body of the paper. Given this, I decided to raise my score.

Review 2

Summary and Contributions: This paper proposed a quantum algorithm for kernel methods with random features. Specifically, their quantum algorithm runs in time O(D) where D is the dimension of the input data, whereas classical algorithms take O(exp(D)) time.

Strengths: I think the main strength of this paper is the claim that it does not have sparsity or low-rank assumptions. These two assumptions have been dominant in previous literature on quantum machine learning, and it’s of general interest to remove such assumptions which can be too strong in practice. Roughly speaking, this paper achieved this using the quantum Fourier transform, the core behind the Shor’s algorithm for integer factorization.

Weaknesses: I think the main weakness of this paper is that the writing is too dense to parse, while on the other hand the current content in the main body is not enough to examine the correctness of Theorem 1 and 2. To be more specific, Theorem 1 and 2 are all both technical results, especially Theorem 1. From my understanding of Section 3.2 and the relevant part in the appendices, the authors used the quantum RAM as well as the quantum singular value transformation (QSVT) algorithm to achieve the speedup for sampling from the features using the quantum Fourier transform. I understand that NeurIPS submissions have page limitation, but I think all the steps should at least be highlighted. In particular, I feel that discussions are needed for: - What quantum RAM do we need for the task? - What can we use the QSVT without the sparse or low-rank assumption? They are actually important factors to turn into the block encoding (see Lemma 48-50 of - How does the quantum Fourier transform come into the overall algorithm? On the other hand, I’m not totally sure how Section 2.3 (discretized representation of real number) and Section 2.4 (assumption on data in discretized representation) help with the overall story--I can foresee their usage, but the paper can probably shrink this space and highlight more on the proofs of Theorem 1 and 2. These are the two main technical results, but are only given in the last page (Page 8) without a comprehensive discussion. A practical solution is to fulfill more details between Line 251-272. In all, my feeling of the current version of the paper is that it somehow manages to show a bound that’s independent of sparsity or low-rankness in an explicit way, but the problem is involved, the algorithm has various of black-boxes, and I’m not totally sure that after opening all the black-boxes between the connections among QRAM, QSVT and quantum Fourier transform, the claim is as strong as the current one. Minor points: - This paper didn’t give any discussion on the practical side of their quantum algorithm. It is fine as a theory paper, but for the general interest of the NeurIPS, it would be much better to introduce more of that. For instance, the paper [47] by Havlicek et al. applied the kernel method for a supervised learning problem on a 5-qubit machine. Although Ref. [47] has drawbacks in precision, but considering the similarity between methods and problems, and for the interest of the implementation side, the current submission should probably discuss more. - I’m a bit confused about Line 59-63. It seems to me that if the inverse operator is hard to calculate, it is purely a mathematical thing, and requiring exp(D) dimension should apply to both classical and quantum. What’s the difference here?

Correctness: The overall story looks plausible, but as I said in the weakness part, it’s not easy to verify the correctness of Theorem 1 and 2 without checking many details of the supplementary material. This paper is purely theoretical, so the second question doesn’t apply.

Clarity: The English is fine, but the presentation is too dense, and more efforts can probably be made to articulate the algorithms and the proofs behind.

Relation to Prior Work: Yes, as far as I see. A minor point: some reference information was outdated. Ref. [41] was accepted by STOC 2020 and Ref. [40] was accepted by MFCS 2020.

Reproducibility: Yes

Additional Feedback: After rebuttal: I would like to thank the authors very much for answering my questions as well as those from other reviewers'. I'm happy to increase my score. The main reason is that the original presentation didn't articulate the technical contributions of the work; after having explicit pointers and discussions, I have a better understanding about the paper, and it turns out that the contribuion is more significant than I previously thought. Another minor comment I just realized: There seems to be minor style issues of the references -- all the titles of the papers are omitted. This is not very common in computer science papers; is it possible to fix this?

Review 3

Summary and Contributions: This paper considers feature extraction using quantum machine learning (QML) algorithms. In fact, it is using classical mathematical tools to analyze the QML algorithm running time to speed up the random feature extraction. The claimed contributions are the algorithm is speed up without considering sparsity and low-rank assumptions. Especially, the running time of O(D) is attractive but might not be feasible.

Strengths: The paper is well written. It describes the problem well, though there is a long introduction before the contribution of this paper. But it might be necessary for a topic of QML. +) Motivation of this work is clear and strong. +) The derivation of the model from (5) to (8) is solid. I assume (8) is actually the real formulation and the starting point of the contribution of this paper.

Weaknesses: My concerns lie in the following aspects: -) The contributions of this paper is actually 2 parts: 1) the O(D) running time and 2) without considering the sparsity and low-rank. However, for 1) in line 270, I don't think the \tilde O(Q_max / polylog(1/\Delta)) can be ignored. Considering the Delta is very small, then this term can be big due to 1/\Delta. In this case, the whole statement of O(D) does not hold anymore. For 2), even the authors do not consider sparsity or low rank, the proof seems to me using concentration-of-measure, i.e, the high probability considering the tails of the distribution. In this case, it essentially has some connection to low rank. So the claim of not considering low rank is not that strong either. -) Even the claims are true (most likely not), the results have been in [39][42][44] It seems the result is a simple extension of existing results.

Correctness: As mentioned in the weakness, it seems the claims are not well defined. I am not an expert in QML, so I cannot evaluate the applications of these methods.

Clarity: The readability of this paper is hard, especially the reference. No title of the the paper is presented in the reference. This costs too much time for the reviewers. For example, in [7], F. Bach has two papers in JMLR and they are in the same issue. Which one does the authors want to cite?

Relation to Prior Work: Yes. Most of them are clear.

Reproducibility: Yes

Additional Feedback: The contributions should be clearly defined without over claiming. Some real applications might need to be considered. === Post Rebuttal: After read the rebuttal. I am OK with O(D). However, for the "concentration inequality" based proof, it essentially impose sparse or low-rank. The contribution is incremental.