Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
1. Originality: The algorithm and analysis in the paper is novel. 2. Quality: The method is supported by any essential theoretical analysis and experiments on some large scale regression datasets. 3. The overall clarity of the paper is good.
# Originality I am not an expert in the literature in online learning with kernels, but it seems [12, 13] are indeed the most relevant prior work, which the paper cites and builds upon. I would be surprised if the idea of using Taylor approximations for kernels is completely new; please consider adding relevant references. # Quality I have not verified the details of all the proofs, but the math in general does seem to add up: The main claims do seem to follow correctly from the central Theorem 9 and the other relevant results. So does Theorem 3, which is central to the third contribution which I find to be the most interesting. I believe, however, that the claims in Section 3.1. regarding Gaussian kernels have to be taken with caution. In particular, the computational and storage complexity does not scale with the effective dimension as claimed, apparently, but only with a quantity that is an upper-bound on the effective dimension. As such, it cannot be claimed that the algorithm is more efficient than the $O(n^2)$ ones (and it is not a really useful property to be more efficient that $O(n^2)$ whenever $n=O(e^d)$). Having said that, the authors do provide experimental results to show that the algorithm with taylor approximation is fast. # Clarity The paper is clearly written and easy to follow, though it certainly requires a proof-reading as there are many typos, gramatical errors, etc. The proofs also seems cleanly written and easy to follow. # Significance I have commented on the significance in the previous question. In general, this is a clear paper and a nice addition to the literature, hence I am recommending acceptance. However, I have some questions about the third contribution (see below), and the first and second contributions do not seem enough on their own. # Questions for the authors: - Is there any other benefit to your use of AWV other than removing the bound on $||f||$? Wouldn't Pros-N-KONS obtain the same regret bound as PKAWV if you apply Theorem 9 (with the difference that the first two terms ("regret terms") in the bound are handled by the KONS regret analysis rather than the K-AWV analysis)? In particular, is a version of Corollary 8 possible for KONS? - What is the key property that makes Corollary 8 possible? I realize that the approximation terms in Theorem 9 capture the change in the projection operators, but shouldn't this be compensated for by the adaptivity of the projections? Also, why is the $m$ term not there in the projection error term bound of Pros-N-KONS (why is it only in the regret bound)? ####################################### After the author response: ####################################### After reading the other reviews as well as the author response, I would like to increase my score. The author response does answer my questions for the most part. I can also agree with the authors that there are $(n,d)$ regimes in which this is applicable and useful, and the authors also correctly note that the result is more general, and less computation is possible at the expense of more regret.
While the paper makes some good theoretical contributions, I do feel that the contributions may be rather niche, and of limited impact. Moreover, the exposition was hard to follow. Several undefined notations and grammar mistakes make the reading unpleasant. Originality: I would rate the ideas low on originality, as the idea of using Taylor expansion and Nystrom approximation has been extensively explored for online learning with kernels (and in general learning with kernels). Quality and Clarity: - There are several grammatical and semantic/notation errors which disrupt a smooth reading (e.g. we aims” at; little “considers” non-parametric function; AWV seems like an arbitrary term for denoting an algorithm – and I could not find what AWV stands for; What is B and d_eff in Eq 2?; Line 122 – is not “The on the space…”; and many more such instances throughout the paper ) - First paragraph in the introduction makes several strong claims about the research directions in general - not properly validated through references or other evidence - What are the pros and cons of considering squared loss over a Lipschitz loss? - What was the motivation of choosing the specific basis function for the Gaussian kernel? How does this approach differ in principle from other projection methods (e.g. Fourier projection) - Since the dictionary for Nystrom method is fixed, would that make the method sensitive to noise? - The experiments show algorithm as PKRR, while the algorithm has been denoted as PKAWV in most of the paper. Experiment section stats that M = 2,3,4 are tried, but only results for 2 and 3 are shown - I am not sure why the proposed approaches are faster than FOGD, as all of them are essentially linear projection algorithms. All algorithms should grow linearly in time (except KRR), but from experiments do not seem to indicate this for the baselines. Time out of 5 minutes (and later stated as 10 minutes) seems to be very small and arbitrary. Some of the learning curves also seem to be a bit strange. How significant is the performance gain of the proposed methods? The log scale graphs make it hard to see – perhaps a table with standard deviations gives a better picture. - Were the classification experiments done using a squared loss too? Would that make the comparison slightly unfair to favour the proposed method? Significance: While the work in general is good from a pure theoretical perspective, I think that, the audience for this work might be a bit too narrow (focusing on one specific type of loss function, and primarily improving the speed and maintaining the same regret bounds), limiting its significance.
Originality/related work: The paper feels incremental as it combines existing techniques/ideas in a (perhaps) novel fashion (perhaps the main novelty is bringing in the Vovk-Azoury-Warmuth forecaster). The main claim is that this is the first method that achieves subquadratic per-round complexity (for reasonable classes of kernels). In relation to this, it would be great if the authors could comment on the relation of the present work and the papers [1,2,3]. In particular, the recent paper by Zhang and Liao seems to be doing something very close to the present paper, though they consider general convex losses (unlike the present paper which considers squared loss), hence their regret results will be different. However, the essence of both papers seem to be the same (this paper is also close to , cited in the paper). Calandriello et al's COLT paper  considers optimization with bandit feedback, but otherwise the ideas are very close. The same applies to , which I would have also expected to be cited and compared with. Quality/clarity: The paper is generally well written. It is a bit hard to make out the algorithms though based on the text of the main paper. Some minor comments are listed below. Significance: The method looks promising; the main theoretical contribution is perhaps Theorem 3 (or, its more advanced version, Theorem 9). Minor comments and some questions: * abstract mentions adversarial datasets; this seems like a nonstandard terminology that is perhaps not needed * e^d seems large.. * Throughout the paper, paper numbers are used with no names. This makes the paper unnecessarily hard to read. Please write citations in a way that allows mortal humans to follow them. * After Eq (2) it is mentioned that the bound in Eq (2) is "essentially optimal" and the next section will explain this; but I could not really find where this is explained. Please be more precise with cross-references. I guess this refers to minimax lower bounds by Zhang-Duchi-Wainwright; I'd appreciate if the appendix had the precise statement (reformulated in the formalism of this paper). * Page 2, line 50: aims->aim * Page 3, line 88: player->learner * line 99: what about larger gamma?? * line 109: "rough inequality" -> "approximate inequality"? Comma missing after Finally. * line 119: f = sum alpha_i phi(x_i) is not formally correct * line 120: K_pp has not been introduced yet * page 3, line 122: First sentence is weird. * page 4, line 167: C is the covariance operator: which one? * line 168: Euclidean projection wrt the RKHS norm, right? * page 6, line 202: So what is the range of the grid? Can you add more details? citing the CBL book does not help much here * line 209: "works"->"work". What does "It" refer to? Consider rephrasing. * line 213: So where is that algorithm? * line 235: classical assumption: perhaps "common setting"? (this is not really an assumption; more like a setting or an example) * page 8, Fig 2 (also in appendix): Figures seem to use labels that don't match the main body. References ---------------  Zhang, X., & Liao, S. (2019, May). Incremental Randomized Sketching for Online Kernel Learning. In International Conference on Machine Learning (pp. 7394-7403).  Calandriello, D., Carratino, L., Lazaric, A., Valko, M., & Rosasco, L. (2019). Gaussian process optimization with adaptive sketching: Scalable and no regret. arXiv preprint arXiv:1903.05594. (also at COLT 2019)  Rudi, A., Calandriello, D., Carratino, L., & Rosasco, L. (2018). On fast leverage score sampling and optimal learning. In Advances in Neural Information Processing Systems (pp. 5672-5682).