Paper ID: | 3603 |
---|---|

Title: | Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates |

Update: I have read the reports from the other referees and the response from authors. The authors have done a good job addressing the overall concerns, and with the addition of comparisons to the baseline truth, I have changed my assessment to an 8. Originality: It is clear that the method is inspired from the the Passty algorithm, including the proofs therein. The originality is in using a Langevin dynamics over the usual Passty algorithm. Quality: The results are clear and well explained, and the example demonstrates the theoretical results as well. Significance: Overall, the article seems useful, particularly due to the possibility of sequential implementation.

** General comments The paper provides an original theoretical contribution which should have impacts at least in methodological developments. It is not clear to me to what extent the necessity to have access to a proximal operator (routinely) represents a serious bottleneck. How of a limiting factor is that? ** Major comments/questions include: - the paper presentation could be improved, see remarks below. Some English mistakes too. - however, the contribution is clearly mentioned, even though a more explicit presentation of the main results of [17] would be helpful. At a technical level - how can bounds involving a random number of iterations be useful in practice? This is also a problem in Table 1 where those rv are not defined. A remark on fact that \mu^\ast is compared to a random measure would be helpful. Is that standard in the related bounds found in the literature (eg [17])? - why are the KL bounds for convex F and strongly convex F involving two different stage of a SPLA transition? For convex F the KL is between \mu^\ast and the r.v. before the proximal step while for strongly convex its between \mu^\ast and an entire SPLA transition. An insight would be very helpful. - check below some proof-related questions. - for the application part: were the Author able to find an application of thm 1 and other corollaries by identifying the constants L, \alpha and C? Also, instead/or in addition of numerical results it would be more useful to compare directly the bounds of SPLA with bounds of other subgradient LA algorithm (if applicable). ** Minor points Check the text around Eq. 1: the first part should be in the text do not use where in series. l26 the proposed additive model for U offers and check the end of that sentence. l54 rather than certain samples perhaps the proba. measures of a specific collection/or subsequence of random variables generated by our method. l67-68 it would be useful to have this comment in a separate remark environment, to highlight this point. l78 if more than one method, name it otherwise correct to ``methods that use'' l128 entropy is not well defined l142 title, perhaps convergence rates instead of rate l166 comments of this paragraph and especially the last part of it should be moved to a dedicated section as they bring the subtle part of the work. l173 applies to l378 =\int g d\mu not )\int g d\mu l382 the statement is not trivial since in 17 Lemma 4 the inequality is for the function \mathcal{E}_{U} and not \mathcal{H} check ref. [36] Lemma 6, to reach \mu_{y_0^k}, isn't it required to take t=2\gamma instead of t=\gamma? Indeed, \nu_\gamma has the same law as z^k+\sqrt(\gamma) W^k which is not precisely the law of y_0^k. But this would change the inequality, with a factor 4\gamma instead of 2\gamma in [9]. Can you please check that? l443 the factor \gamma^2 should be inside the sum if the sum also applies to the 2nd and last terms ; otherwise a sum is missing for the 2nd and last terms. l449 Why is Lemma 5 required here? l477 it is not clear how the measure \cal{F}(\mu_{y_0^j}) is substituted by \cal{F}(\mu_{\tilde{x}^k}) Application l197 it could be useful to identify the functions f and g_i and the rv \xi of the expectations of functions F and G in this example. Esp. f\equiv f(x) and in this case there is no randomness. l214 v and w are not defined, are they vertices?

Response to the rebuttal: Thanks to the authors for responding to my questions. I still feel the experiments of the paper can benefit (especially from the rebuttal) from more work but keeping the theoretical contributions I have decided to increase the score to 7. ========== The authors propose a new sampling algorithm, called Stochastic Proximal Langevin Algorithm, and analyze it for sampling from log-concave distributions where the log density is a sum of smooth and differentiable function along with a sum of non-smooth convex functions. They establish the rates of convergence for the algorithm, linear when the smooth part is strongly convex and sub-linear otherwise (the rates are not surprising). The authors provide a numerical comparison of their algorithm with a few other algorithms for trend filtering on large graphs, demonstrate that for this set-up their algorithm is useful when the full proximal operator is computationally prohibitive and in general it performs faster than Stochastic Subgradient Langevin Algorithm. On technical aspects, their analysis appears to be borrowing heavily from [17] (as the authors have correctly acknowledged). I would like to remark that the paper is well-written and easy to read. I have a few remarks: — Given the introduction of the new algorithm, I believe the work can be motivated better if the authors provide several concrete set-ups/examples where their algorithm can be potentially useful (with some rough gains that one can obtain). Even for the example, they considered, the significance of sampling in comparison to optimization (simply finding the MAP) is not clearly spelled out (other than noting that it might be useful). — It is not clear to me when is SPLA faster than SSLA? and vice-versa? — There should be some trade-off of splitting the non-smooth part into an arbitrary sum. In simple words given f+g, I should not gain by treating g as g/2 + g/2. — I would recommend the authors to provide some clear explanations for computational trade-offs when such splitting is beneficial, and some intuition as to? Moreover, do we gain/lose if we use different noise draws for each proximal split? Is the gain simply analogous to the gains of SGD over GD when the number of samples is large (or is there something more to it)? A paragraph providing intuitions about the new algorithm would make the paper more insightful.