Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
- The contribution is not clear in the introduction section. - The related work is not sufficiently included in the paper. - The theoretical results in the paper are interesting. Especially, the characterization of the convergence rate of the continuous algorithm in (13). The different discrete optimization algorithms are recovered by using the Forward-Euler scheme. The main missing part is related to the discussion of convergence rate results after doing the discretization. How does the rate in (13) extend to the discretized version of the algorithms? *Minor comments* Line 15: "algorithms to perform"--> "algorithms perform" Line 89: "gradiet" Line 44: There is a (!) sign there, I wonder if something is missing
This paper studies a variational theoretical framework for stochastic optimization. In particular, the authors showed that finding the minimizer of stochastic optimization is equivalent to finding the solution of a variational problem over a latent function space and also equivalent to finding the solution of a forward backward SDE. They also showed how to recover some popular stochastic optimization algorithms through discretizing the optimality equations defined by the SDE. However, this part is not so clear in clarity and still requires more discussion on the theoretical behavior of these discretized algorithms. Overall, this paper is well written and has strong results. My major comments are as follows: The stochastic mirror descent problem has also been studied by [1,2] for general convex and strongly convex functions in a similar way to this paper. Under a very similar Bregman Lagrangian to equation (8) in this paper, they first derive the corresponding ODE for deterministic optimization. Then they derive stochastic differential equations based on this ODE and noisy gradient. Both papers provided Euler discretization of the SDE that enjoys the optimal convergence rates as the stochastic optimization algorithms. Due to the similarity, I think a discussion between the difference of the method in this paper from these references should be provided. References:  Xu, Pan, Tianhao Wang, and Quanquan Gu. "Accelerated stochastic mirror descent: From continuous-time dynamics to discrete-time algorithms." International Conference on Artificial Intelligence and Statistics. 2018.  Xu, Pan, Tianhao Wang, and Quanquan Gu. "Continuous and discrete-time accelerated stochastic mirror descent for strongly convex functions." International Conference on Machine Learning. 2018. For the discretization methods presented in Section 5, can you guarantee the convergence of these methods under general assumptions? A discussion on the theoretical behavior of the discretized methods is needed. In particular, it would be interesting to know how the analysis of the variational problem can be used to understand the convergence of these discretized methods. Minor comments: Line 44: there is a (!) symbol and the text color is also different Line 112: “define” should be “defined” Line 330: “FBSDE 9” should be “FBSDE (9)” Line 474 & 482: equations are exceeding the margins === after author's rebuttal === I have read the authors’ responses, which addressed my comments. Given their rebuttal, I am willing to raise my score to 7. I am looking forward to seeing more discussion on the relationship between the variational models and the discretized algorithms.
1. This is a terse paper but it is extremely well-written. 2. How novel is Theorem 4.1? I am wondering if the system of equations in (9-11) given the action functional in (6) can be derived using existing techniques, say the ones in Casgrain and Jaimungal (2018 a,b,c). 3. The convergence rate in (13) is particularly interesting. Can you comment on the scaling of the stochastic term with time? 4. The development in Section 5 leaves a lot to be desired because it has a number of debilitating assumptions. The model for grad f = sigma W^f_t on Line 222 is very basic; it is only accurate towards the end of optimization when the system is near a critical point. Similarly, the linear state-space model in Section 5.2 assumes that each component of the gradient is an independent linear diffusive process; this is not at all realistic. Can you give examples where the Bregman Lagrangian leads to existing or new optimization algorithms without such assumptions?