|
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Quality This is a solid work that explores the trick of the gradients of neighboring data points to be similar to modify the recently proposed SAGA for SGD. The advantage of the proposed algorithm has been shown both theoretically (in Section 3) and empirically (in Section 4). However, I still think that establishing the neighborhood of the data points from the practical point of view as discussed in the paper could be an issue which could make the computational gain due to the proposed trick not as attractive as it claims.
Clarity Section 3 is not particularly easy to read through and the authors may consider adding some higher level description of the overall ideas before proceeding to the formulations and theorems.
This paper argues for its uniqueness of being able to be run in a streaming mode. This point is not clearly presented with evaluation results in the paper.
Also, there are some minor typo errors throughout the manuscript. For Eq.(3), should "i \in N_j" be "j \in N_i" instead?
Originality The trick of exploring the neighborhood is a reasonable extension of SAGA.
Significance Its significance lies on the point that the proposed enhancement is a reasonable one and is carefully studied theoretically and empirically with promising results as reported in the paper.
Q2: Please summarize your review in 1-2 sentences
The paper proposed some enhanced versions of SGD by updating the gradients of also the neighboring data points with the objective to speed up the convergence. The proposed algorithm and its variants can perform better than the recently proposed SAGA.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
NeighborhoodWatch is a existing algorithm for partial monitoring problems. Consider changing the title.
The plots are illegibly small.
Q2: Please summarize your review in 1-2 sentences
The paper presents a new method for reducing variance in stochastic gradient descent, making use of stored gradients for close-by data points. The paper provides theoretical guarantees which come with a quantitative neighborhood assumption, of which it is not clearly argued that it will typically obtain. The method is evaluated on real-world data where it is shown to be competitive with the state-of-the-art competitor SAGA.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents an interesting improvement N-SAGA over SAGA and SVRG by considering neighboring structure during the updates of w. Though the paper provides improved results in convergence rate and experiments, it is not clear whether the new algorithm actually accomplishes any substantial benefits in a streaming and single epoch setting, as what authors claim, which is important as the work is incremental. Minor error: y-axis of Figure 3 should be test error or objective?
Q2: Please summarize your review in 1-2 sentences
The paper presents an interesting improvement N-SAGA over SAGA and SVRG, SGD variants that attempt to reduce variance of w. Though the idea appears interesting but it is not clear whether this "improvement" accomplishes any significant from SAGA and SVRG.
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The contribution of the paper is between learning and optimisation.
In machine learning we often encounter optimisation problems of the kind
\sum_i l(x_i,w) + \Omega(w) \to \min_w,
where x_i are known datapoints, w is a vector of model parameters, l is a measure of loss and \Omega is a regularisation functional. Using gradient descent may be time-consuming if the number of datapoints is large. It is therefore common to sample from datapoints as a way of approximating the true gradient.
This paper proposes to take advantage of the structure of the set of datapoints. They often cluster into a small number of neighbourhoods. New algorithms exploiting this idea are developed. Convergence guarantees and empirical analysis are provided.
Q2: Please summarize your review in 1-2 sentences
The paper proposes an optimisation algorithm for machine learning utilising the fact that training points usually come from a small number of neighbourhoods.
Submitted by Assigned_Reviewer_5
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
-- Comments after rebuttal -- The rebuttal from the authors was concise.
However, I was not convinced about the assumption in Eq. 14 of the paper and how the
authors defended it. The authors say: "Many documents (text categorization),[..] or time series signals (speech recognition) in a training set are alike. This fact is not systematically exploited by any existing stochastic optimization method!" I don't think this is correct. In fact I think this is what statistical learning is all about,
i.e., exploit similarities in data. Also in their point (B.2) the reviewers say "Eq. 14 should not be seen as an assumption but rather as one part of a trade-off. We can make \eta arbitrarily small by restricting the size of the neighborhood." These trade-offs are very hard to do in practice. This is in fact where the crux of their method lies. It would be very interesting if the authors had a concrete answer to that.
Still, I slightly increased my rating for this paper because, judging by the rebuttal,
it seems that the authors understand the shortcomings of this paper and how to resolve them. This should allow the paper to get published if other reviewers are convinced by the rebuttal.
----
The authors study a general formulation of stochastic variance reduction (SVRG)
methods and derive a non-asymptotic analysis under strong convexity.
This work is ambitious and creative. The idea of using neighborhood
structure between data points to share information is very smart,
and looks natural only in hindsight.
However, by reading this paper I felt that it is work under progress with several
important details left unfinished. If I am mistaken I would be happy to recommend an accept.
On a high-level the paper aims to do many things at the same time, e.g.,
have a common analytic framework for SVRG methods, introduce and analyze the
idea of data neighborhoods in stochastic approximations, apply SVRG in streaming settings, etc. As a reader I felt that none of these goals were fully achieved.
Regarding the common analytic framework of SVRG methods
I struggled with assumptions in Ineq. (14) and the uniformity assumptions for the memorization
algorithms (e.g., Lines #213-215).
Why are these assumptions reasonable? I am especially skeptical of Ineq. (14) because it uniformly bounds all past and present gradients and basically assumes everything is fine with the stochastic approximation. Although I could buy such an assumption for implicit procedures (e.g.,
implicit SGD) which often have these uniform bounds,
explicit procedures as in Eq. (2) generally do not.
Regarding the streaming setting, I am confused how the authors define the streaming
setting and how they analyze and test in practice. In Line #411 the authors note: "We have found very similar results in a streaming setting, i.e., presenting data once in random
order."
I don't think this defines a streaming setting, and also seems to disagree with authors'
discussion on the streaming setting in Abstract and intro sections. My own understanding is that in streaming settings one has access to infinite
data streamed one-by-one and then discarded. One key difficulty in the analysis of SVRG procedures in streaming settings is that one has no access to unbiased estimates of the full gradient.
How do the authors cope with this difficulty? I suspect they use Lipschitz bounds as Ineq. (14)
however this is not made explicit.
Another issue here -- potentially more important -- is that
in streaming settings the estimation problem is different. In finite data sets SVRG
methods -- as well as other methods -- compute the empirical risk minimizer, e.g.,
the MLE. However, in streaming settings one estimates the true model parameters. In the former case one approximates an estimator (e.g., MLE) and in the latter case one approximates the true model parameters. Thus, in the streaming case the "linear convergence rate" of
averaging methods (mentioned by the authors in the abstract) cannot be surpassed. In the non-streaming case one can geometrically converge to the empirical minimizer, however
this minimizer is an estimator that is roughly O(1/n) away from the true model parameters. It is the nature of stochastic learning problems where the 1/n simply cannot be surpassed. This is a key issue that causes confusion in stochastic optimization from my own experience. Therefore, it would be important that the authors clarify the streaming setting they
mention in the abstract, where it appears in the analysis, and how it is being tested in experiments. Under this light, it is unclear what they mean in Line #411.
Finally, the promising idea of using neighbors of data remains underdeveloped. In Lines #138-144 the authors attempt to give practical advice on the matter but they are not specific. I am unsure how one can
efficiently define these potentially O(n^2)
neighborhood relations. The related analysis is solid using the Lipschitz conditions,
but in the experiments it was unclear how the neighborhood SVRG variants were implemented.
Other issues: - Figure 3 is not explained in main text. - Captions in figures are not informative. - Line #53, the Robbins-Monro paper is not about SGD. It is about the much more general idea of
stochastic approximations in the streaming setting.
Q2: Please summarize your review in 1-2 sentences
This paper is ambitious and creative. The idea of using neighborhood
structure between data points to share information is smart, but is left underdeveloped and based on very strong/ unjustified assumptions.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We would like to thank the reviewers for their
insightful comments and for spotting mistakes in the
manuscript.
There are a number of points that are straightforward
to address in a minor revision, in particular:
(A.1) Adding
precision to what we call "streaming mode". As pointed out correctly by
Reviewer_3, we need to be clearer about the memory footprint of our
method. Due to a tradeoff between the level of accuracy \eta in Eq. 14 and
the number of past gradients that we keep, the memory needed depends on
\eta and also on the data distribution. We will provide worst- and
best-case examples of such data distributions to get more precise results.
However, irrespective of the memory footprint, our algorithm can be run
such that a fresh example is "consumed" at each iteration. While this is
trivial for SGD (but without linear convergence), this mode is not
supported by SAGA, SDCA, or SVRG, which are either based on the concept of
batches (SVRG) or of revisiting previous data points (SAGA,
SDCA).
(A.2) Presenting experimental results for the streaming data
setting more clearly and with more detail on the utilized data structure
(pointed out by Reviewer_3) as well as and fixing some issues with axis
labels (pointed out by Reviewer_1).
(A.3) Improve the presentation
and writing in light of the details in the feedback.
However,
Reviewer_3 points out a few aspects of our analysis that are more
fundamental, which would like to respond to.
(B.1) We agree that
the assumption in Eq. 14 may be quite restrictive. How small \eta can be
chosen depends on the data distribution and so does the efficiency of our
algorithm. We do not claim - and maybe should be even more explicit about
this fact - that our methods offers advantages for *all* data
distributions. Compared to much of the work in this area, this can be seen
as a weakness. However, this dependency can also be seen as a strength as
it also opens up new possibilities for distribution or data-set dependent
speed-ups. In many practical data sets, we have naturally arising
redundancies and many training instances may be quite similar (in the
input space). Many documents (text categorization), users (preference
prediction), images (object recognition), or time series signals (speech
recognition) in a training set are alike. This fact is not systematically
exploited by any existing stochastic optimization method! So what may be
"questionable" assumptions from the point-of-view of optimization theory,
can well be "reasonable" when dealing with risk minimization for machine
learning. We feel that our paper is innovative in exactly providing this
change of perspective.
(B.2) While our method tries to exploit
self-similarity properties among training instances, it does not simply
fail if this assumption is not satisfied. In the practical mode of
operation of N-SAGA, Eq. 14 should not be seen as an assumption but rather
as one part of a trade-off. We can make \eta arbitrarily small by
restricting the size of the neighborhood. Of course, as we reduce q (the
average size of the neighborhoods) to q=1 in the extreme case, we also
lose any speed-up relative to SAGA. So in practice, one would start with
specifying a \delta (approximation accuracy) then chose \rho(\delta) and
then construct neighborhoods such that Eq 14 will be approximately
fulfilled (e.g. by comparing the difference in stochastic
gradients).
(B.3) Empirical vs. expected risk. Of course,
Reviewer_3 is absolutely right. There are two things to point out
here: (i) We are not just interested in asymptotics, but also in the
transient regime (as stated in the paper), i.e. quickly get to a decent
level of accuracy for the empirical risk. This accuracy may be sufficient
by itself (going further might result in overfitting) or it may be a good
starting point for a method that is guaranteed to converge to the
empirical minimizer. Taking N-SAGA further, one could perhaps even control
the neighborhoods more dynamically. (ii) Of course, we do not claim to
beat the statistical limit of O(1/n) efficiency for the generalization
error :) However, maybe our use of the "streaming setting" was misleading
(see above) and the writing will be improved to avoid
confusion.
Conclusion: We can see that our assumption appear to be
"strong" from an optimization point of view as they depend on the
data-distribution. However, we think that this opens-up new possibilities
for innovative stochastic gradient algorithms in machine learning, where
natural data distributions may obey certain characteristics that are not
yet exploited. We think that - with a few revisions and clarification
mentioned - our work is a first step in this direction with a novel
analysis and experimentally encouraging results. As always, there is more
future research to be done here, but we feel that the paper has an
end-to-end story and with a bit of "polishing" will make a solid paper
that can be built upon. |
|