Paper ID: 1344
Title: StopWasting My Gradients: Practical SVRG
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper extends the stochastic optimization algorithm SVRG proposed in recent years. These modifications mainly includes: the convergence analysis of SVRG with corrupted full gradient; Mix the iteration of SGD and SVRG; the strategy of mini-batch; Using support vectors etc. For each modification, the author makes clear proofs and achieves linear convergence under smooth and strong convex assumptions. However, this paper's novelty is not big enough. The improvement of convergence rate is not obvious and the proof outline is very similar to the original SVRG. The key problem such as the support for non-strongly convex loss is still unsolved. Besides, there exists some small mistakes on the notations in the appendix.

Q2: Please summarize your review in 1-2 sentences
This paper makes some practical improvements based on the stochastic optimization algorithm SVRG proposed in recent years. These improvements are proved to be effective by the theory and experiment results. However, the novelty and improvements of performance are not big enough. The proof idea is very similar to the original SVRG and key problems such as the support for non-strongly convex loss is still unsolved.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Comments:

- Tricks like non-uniform sampling and then correcting by importance sampling have been around for much longer, see e.g. [1] which claims a speed-up of 3x. Those older approaches chose the probability based on error instead of L_i -- could you discuss/compare such an approach, at least in practice? I'd suspect this may be useful in practice where the L_i may not be known.

- Channeling Leon Bottou: please discuss the limitations, and the cases where your method may *not* be applicable.

- I would really appreciate an open-source code release accompanying the paper, which would increase its impact.

- The experimental section and figures are clearly the weakest part of the paper, I'd recommend a serious second pass. For example, all plots should have error bars, state over how many runs they were averaged, labeled more clearly (SVRG -> full batch in top plots), the test-error plot does not have the resolution to be readable (maybe log-scale is better even if it won't go to zero). The same holds for the figures in the appendix. I do not understand why there is no difference at all between uniform sampling and Lipschitz sampling (in experiment 3)? Of course, a more challenging domain than logistic regression would be appreciated, but that can rightfully be shown in another paper.

[1] Geoff Hinton, "To recognize objects, first learn to generate images", 2007.

Typos:

L19 a variant L53 increases over L77 expectation missing L96, L132 expectation symbol misformatted L164 B^s L289 can be used L316 superscript instead of subscript L325 sometimes L416 sometimes often?
Q2: Please summarize your review in 1-2 sentences
This is a solid and important paper about variance-reduced gradient descent. It has many good, new ideas, explains them clearly, and determines convergence rates for all its proposed variants. The paper's weakness is the experimental section.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors should be explicit on what this new methods adds to existing stochastic optimization (e.g., Hu Kowk and Pan, 2009) and related mini-batch optimization (e.g. Konecny et al, 2013 and 2014), or Smola's work (e.g., M Zinkevich, M Weimer, L Li, AJ Smola, 2010), including using such methods as baseline on the experiments.

Merely comparing the method to variants of itself is insufficient.

This is a crowded field.

It behooves the authors to explicitly delineate their novel contributions.

The paper is relatively well written, and the comparative bounds table (sec 8) is useful.

Q2: Please summarize your review in 1-2 sentences
The paper discusses the stochastic variance-reduced gradient method to speed convergence in optimization of a sum of strongly convex Lipschitz continuous functions. The method trades off inexact computation for speed, and it is not clear whether it represents a sufficient improvement over the state of the art in stochastic optimization to merit publication in NIPS.

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes numerous methods to speed up svrg. The methods are growing batch size, mixed SGD/svrg, and a new minibatching scheme.

Specific comments 1. It would be nice to have empirical comparison against a well-tuned SGD implementation. 2. I'm not clear on how the results for "time to reach E_est + E_opt for FG and SVRG were reached. E_est should only depend on n. For alpha =1, it should be O(1/n). It is unclear how log^2 1/epsilon and d^2 arises.
Q2: Please summarize your review in 1-2 sentences
This is a nice paper that proposes numerous modifications to SVRG.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper introduces several approaches to increase the convergence speed of a well-known stochastic gradient method called stochastic variance-reduced gradient (SVRG).

The paper is well-written and it is very easy to follow the arguments of the paper. The authors start with a short introduction of SVRG method and subsequently explain three approaches for improving the convergence speed of SVRG.

This paper starts with a key proposition showing that SVRG does not require a very accurate approximation of the total gradient of the objective function needed by SVRG algorithm. The authors use this proposition to derive a batching SVRG algorithm with the same convergence rate as that of original SVRG. Then, the authors propose a mixed stochastic gradient/SVRG approach and give a convergence proof for such a scheme. As a different approach of speeding up, the authors proposed a speed-up technique for Huberized hinge-loss support vector machine.

In addition to speeding-up strategies and their convergence analysis, the authors include similar analysis for the case of regularized SVRG algorithms. The authors provide propositions for the convergence of mini-batching strategies. Through extensive simulations, the authors show the improvements obtained by their methods.

Except minor comments stated below, I do not have any major comments for further improving the manuscript.

Minor Comments: * On the right hand side of inequality in line 77, there should be an expectation * Question marks in the first line and middle of page 10 should be corrected
Q2: Please summarize your review in 1-2 sentences
The topic of the paper is of interest for nips community and the analysis and simulation results were done perfectly.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We thank all the reviewers for taking the time to read and comment on our work. We will use the comments to improve the paper. Below we comment on some specific issues that were raised.

R1

These are good points regarding the experiments, we will update the plots following these suggestions. Note that uniform and Lipschitz are the same in some plots because the rows of the data are normalized (Lipschitz+ can still give improvements here because it depends on the potentially-smaller Lipschitz constant of the deterministic part.)

Choosing the probabilities based on the error is an interesting suggestion. Proving a convergence rate in this setting may be hard, but for many models the errors are related to the *local* L_i values which give much faster rates than global L_i values.

It was hard to fit in a discussion of limitations/extensions given the page limit, but we'll try to squeeze it in. We are planning to release an open-source implementation of SVRG and our various extensions (including using local L_i values).

R2

Unfortunately, we accidentally removed the last few sentences of Appendix H that showed how to obtain the new "time to reach E_est + E_opt" result. The subtlety in that result is that, if the condition number is large, we can choose the number of examples 'n' to be smaller than the total available in order to jointly drive (E_est + E_opt) to zero at a faster rate.

There are a few reasons we did not compare against SGD:
(i) beginning with Le Roux et al. there are already many empirical comparisons showing an advantage of linearly-convergent methods over tuned-SGD.
(ii) our main contribution is improving on SVRG (as the only memory-free linearly-convergent methods).
(iii) we are running methods as parameter-free `black boxes' and SGD usually requires tuning for good performance.

R6

Please note that our introduction explicitly states why we are focusing on SVRG; it is the only linearly-convergent stochastic method that does not have a memory. There is now ample evidence that linearly-convergent methods often outperform classic stochastic gradient methods that only have sublinear convergence rates (The works of of Hu et al. and Zinkevich et al. that are referenced in your review have sublinear rates). But the lack of a memory requirement clearly makes SVRG advantageous over other linearly-convergent methods in many applications. These two reasons make SVRG stand out among the crowded field, and (as discussed in the paper) is why we focus on SVRG.

We ultimately removed a more thorough discussion contrasting our work with Konecny et al. in order to save space, but note that we still discuss this work explicitly in Section 7. In particular, their method uses mini-batches in the inner iteration to reduce the variance but still uses full passes through the data to compute mu^s. Our method is complimentary to the one of Konecny et al. (you can use both modifications), and further in Section 7 we give tighter bounds on their method and propose a new method with better properties in many practical scenarios.

R7

It is true that some of our proofs follow the same structure as the SVRG argument. However, we have extended these arguments in various novel and significant ways (allowing error, allowing regular SG steps, showing that the rate is improved via the regularized update, better mini-batch strategies, etc.). Further, some of the results (like the generalization error result) are based on completely different methods.

Regarding the performance improvements, please see the bottom row of Figure 1 in the supplementary material. The difference between the 'SVRG' and 'Grow' methods is huge here, and this row corresponds to the largest datasets (the caption is unfortunately wrong in this particular figure). We could have done "data set selection" and only shown these largest datasets to make the difference seem more significant, but instead chose to plot performance on all datasets to give the reader the full picture. Further, the differences become even larger when we combine multiple ideas from this paper.

While the non-strongly convex case is interesting, in machine learning we often don't worry about this case because we can add an L2-regularizer to make the problem strongly-convex. In contrast, in this work we have analyzed many other important issues (e.g., improving performance on huge datasets, justifying the regularized update used in everyone's implementation, analyzing the generalization error, exploiting GPUs, etc.). Indeed, the generalization error in particular is one of the most important issues in the field of machine learning.

Nevertheless, as shown in http://arxiv.org/pdf/1312.1666v2.pdf, if we simply add an L2-regularizer with lambda=O(L/epsilon) then all of our results also give rates in the non-strongly convex case.