Paper ID: | 7167 |
---|---|

Title: | Stochastic Continuous Greedy ++: When Upper and Lower Bounds Match |

The results established in this work are significant from a theoretical standpoint. Through the proposed algorithm SCG++ , the authors have developed a method where quite literally both the upper and lower bounds match for maximizing a monotone DR-submodular function 'F', being true to title of the paper. The approach taken in proving this result based on a variance reduction method that estimates the difference of gradients in the non-oblivious stochastic setting is novel. The ideas developed in Section 4 are very interesting and useful. So on originality, quality and significance, I would rate the paper in the top tier. My primary concern with the paper is how much these results add value in a practical setting. Since no experimental evidence are provided, it is hard to judge the computational speed up gained by SCG++ over competing methods like SCG, SGA, Continuous Greedy methods etc. when applied to an actual problem. For instance: a) When both SGA, SCG and SCG++ use the same number of gradients, what is the difference between the function values at the obtained solution? b) Experimentally, how many number of gradient evaluations does each of these algorithms need to obtain the same quality of the solution? c) In the setting when the monotone DR-submodular function 'F' is available in closed form and the its gradients can be computed exactly, how much the variance reduction method based on difference of gradients actually help say in reducing the variance and giving a superior estimate? Disclaimer: I have not read the proof in detail to aver its accuracy. Typo: 1. In line 120, I believe it should be \nabla f(y, \mathcal{M}). \nabla is missing before 'f'. EDIT: ----- The authors have addressed my concerns in their rebuttal report.

High level takeaway from this paper: This is a solid theoretical work with improved theoretical and algorithmic results. Pros of this paper: - Tighter approximation guarantee while achieving a tight convergence rate - Improved variance reduction technique for estimating differences of gradients which is a critical ingredient in their setting - Lower bounds showing their results are tight - Extension to stochastic submodular maximization Cons of this work (also includes some questions/clarifications which are not clear in the paper) - Seems like a condensed version of a much longer paper: While this is not uncommon in NeurIPS, it does affect the clarity of the paper. Also all the proofs and main ideas are in the extended version which makes the paper harder to go through. This is not a criticism as such, but just a suggestion to improve the readability of this manuscript - Extension to the discrete setting: While the result seems to follow from the results of the continuous submodular counterpart and the multilinear extension, I do not understand how one would compute the multilinear extension efficiently. One still needs to sample the ML extension which is high polynomial in complexity? I'm not sure how the authors are circumventing this. - Lack of concrete examples: This comment is coming more from a practitioners perspective. It is unclear how to use such an algorithm in practice. What are concrete examples of continuous submodular functions and how do the results in this paper impact them? Lack of empirical results demonstrating the improved convergence: Related to the above. This paper has no empirical results to demonstrate how these carry over in practice. Does the improved convergence results imply improved performance in real world applications? I would like to see this in the main version of the paper.

Originality: The paper studies a known problem of maximizing a continuous submodular function under convex constraints. The main technique to achieve the improved number of queries are 1) adapt the gradient variance reduction technique to the non-oblivious case using Hessian and 2) the approximation of the Hessian with O(d) gradient evaluations. The proposed bound needs the assumption about the smoothness of the Hessian (assumption 4.7), and I think without such an assumption, it would require O(d^2 \epsilont^-2) oracle calls. The dependence on d seems also important to me. Maybe the authors can illustrate more on this point. Quality: The paper has theoretical analysis of the proposed algorithm as well as a lower bound. Clarity: The paper is well organized. Significance: The proposed algorithm achieves the best-known results of the number of queries to get an almost optimal guarantee for DR submodular functions. The number of queries is also shown to match the information theoretical lower bound. The paper's result is therefore significant. After rebuttal: I have read all other reviewers' comments and the authors' feedback. I keep my evaluations unchanged.