
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper looks at the problem of optimizing stochastic local, quasiconvex functions. This encompasses a bigger set of functions than convex functions and the authors provide an example of a function related to training deep networks which falls on the set of local quasiconvex without being convex. They devise algorithms that converges with high probability $\epsilon$ close to the true minima in $O(1/\epsilon^2)$ time for SLQC functions and in $O(1/\epsilon)$ time if the function also has local smoothness properties using an algorithm that is based on normalized gradient descent operated on a minibatch at every iteration.
The paper is well motivated and the mathematics looks clean and easy to follow. The authors also give examples of GLM and noisy GLM which fall under the domain of SLQC functions and hence their algorithm can be used to optimize these classes of objectives as well. There does not seem to exist much previous work on optimizing local quasiconvex objectives and these class of functions is well motivated.
It would be good to see some more experiments for other objectives related to neural network objectives as well as any comparison of the performance of the algorithm with the Isotron algorithm. Finally, the analogy given about gradient descent jumping between very low and very large gradients using SGD is very nice. However it would be good to see a mathematical example related to this. In particular, it would be good to have an example of a function which is SLQC and would have very large gradient values (preventing SGD from converging if we ran on it directly).
Q2: Please summarize your review in 12 sentences
This paper looks at the problem of optimizing stochastic local, quasiconvex functions. This encompasses a bigger set of functions than convex functions and the authors provide an example of a function related to training deep networks which falls on the set of local quasiconvex without being convex. They devise algorithms that converges with high probability $\epsilon$ close to the true minima in $O(1/\epsilon^2)$ time for SLQC functions and in $O(1/\epsilon)$ time if the function also has local smoothness properties using an algorithm that is based on normalized gradient descent operated on a minibatch at every iteration.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary: This paper analyzes the stochastic version of normalized version of normalized gradient descent (NGD), which is the first effort to explore the efficacy and property of stochastic normalized gradient descent (SNGD). In order to verify the benefits of NGD in training nonconvex optimization problems, this paper introduces a new property, localquasiconvexity, to prove its convergence to a global minimum. Particularly, they prove that NGD finds an \epsilonoptimal minimum for local quasi convex functions within O(1/ \epsilon^2) iterations. In addition, this paper introduces a new setup: stochastic optimization of locallyquasi convex functions, in which the gradient is estimated using a minibatch of examples. Empirically, this paper reports experimental results by training deep neural networks by comparing with thestateofthearts methods, minibatch SGD and Nesterov's accelerated gradient method.
Quality: This paper is technically sound without obvious mistakes. The theoretical analyses are easy to follow and achieve the expected outcome. The empirical study shows the improvement of the proposed algorithm than the existing methods. This is the first effort for the provable gradientbased algorithm for stochastic quasiconvex optimization. Nevertheless, its training model is relatively small network, that is, a single hidden layer of 100 units. This setup is very likely to be too easy to show the advantage of the algorithm, not to mention the real application of training deep neural network on large datasets.
Clarity:
This paper is easy to follow and well structured. Several typos do exist, however. For example, line 37: 'SGD is guaranteed find' should be 'SGD is guaranteed to find'; line 134: 'Inforamlly' should be 'Informally'; line 185: 'its is' should be 'it is'; line 250: 'and and' should be 'and', etc. The draft should be better checked if accepted.
Originality: In keeping with NIPS objective of rewarding new unexplored areas and research directions (more than incremental advancement of the state of the art), this paper studies an important problem for nonconvex optimization and tries the first effort on provable results, which should be scored a high score for originality. It is of significance to move the focus from stochastic convex optimization to stochastic nonconvex optimization.
Significance: This work studies an important research problem and is in a good research direction since stochastic nonconvex optimization has shown its importance for several years and needs to be solved with solid theoretical guarantee. They define the property, localquasiconvexity, to extend quasiconvexity and capture the unimodal functions which are not quasiconvex, and then leads to the proof the convergence to the global optimal. However, the empirical studies shown in the paper are not comprehensive or convincing to show that the proposed algorithm could work for training deep neural network since only a single layer model is presented.
Other comments:
The key strength of this paper is to introduce the property, localquasiconvexity, to extend quasiconvexity and capture the unimodal functions that are not quasiconvex. They propose solutions for locallyquasiconvex optimization with NGD and strictlylocallyquasiconvex optimization with stochastic NGD with both theoretical proofs.
The main concern about this paper lies in the empirical studies. I believe that the proposed algorithm aims to solve the problem of training deep neural network finally. However, in this paper, the model for training is too simple, which is just a single layer with 100 hidden units. Therefore, whether the proposed algorithm could really work for training deep neural network in practice beating SGD is yet to be verified.
Q2: Please summarize your review in 12 sentences
The research direction of the paper is good, and it represents the first effort of generalizing the wellstudied stochastic convex optimization to the challenging nonconvex optimization problems facing machine learning community. Although the empirical study is not very mature, it represents a starting effort.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper introduces a stochastic version of the normalized gradient descent (NGD) algorithm. The author(s) also introduces the concept of strictly locallyquasiconvex (SLQC) functions, and shows that both the noiseless and noisy generalized linear models satisfy the SLQC requirements (although these proofs are deferred to appendix B and C, respectively). The author(s) then proves convergence rates for both the NGD and SNGD algorithms on SLQC functions. Finally, on a very simple neural network architecture (MLP with 100 hidden units, trained on MNIST), the proposed SNGD algorithm is compared to minibatch SGD and SGD with momentum, which are now standard optimization approaches for neural networks. Both the test and training errors with SNGD match (if not beat; though no error bars or repetition numbers are given) that of the momentum approach.
Minor points:  Obviously, the manuscript is slightly over the page limit.  Missing periods: lines 192, 215, 377.  Abstract: the second half of the abstract doesn't read like an abstract; consider boasting the proven convergence rate and the synthetic experiment.  Introduction:
o line 37: "is guaranteed [to] find"
o line 102: "phenomenon" > "phenomena"  Section 2: slightly inconsistent notation for the ball, specifically with regards to the subscript d.  Section 3:
o line 204: broken sentence
o line 208: "sssume" > "assume"  Section 5:
o line 318: strictly speaking, the minibatch constituents psi_i(x) could use an index t, but this might clutter notation unnecessarily.
o line 329: "a common global minima": either remove the leading "a" or replace "minima" with "minimum".
o line 343 and 364: strange use of "then".  Discussion: "hereby" > "herein"  References: needs fixing, e.g., [1] includes "et al."; [11] formatting is inconsistent with the rest.  Appendices: title of H is wrong.
Q2: Please summarize your review in 12 sentences
The paper is clear and well written. The algorithm is well motivated with accompanying theoretical results to guarantee convergence, as well as a simple proofofconcept experiment on a real experiment (training an MLP on MNIST).
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their time and effort,
and appreciate their useful comments. We will address all of them in the
final version of the paper.
Reviewer
1: =========== *Experimental part: we presented small scale
experiments as a proofofconcept demonstrating the applicability of the
NGD algorithm for nonconvex optimization tasks to validate our
theoretical findings. We consider this as a first step in examining the
performance of NGD in tasks of larger scale.
Reviewer
3: =========== *"The proposed algorithm is not new in the sense that
most practitioners would use minibatch SGD and make sure to normalize the
data and/or solve a constraint optimization variant using a simple
projection. This might prevent some of the pathological cases to cause
trouble."
Normalizing the data and using SGD is not equivalent to
using NGD (which normalizes the gradients). For example, consider the
idealized GLM problem with sigmoid activation (depicted in section 3.1.1),
even if we normalize the training data to lie on the unit sphere, plateaus
may still occur.
Reviewer 5: =========== *"It would be
good to see some more experiments for other objectives related to neural
network objectives as well as any comparison of the performance of the
algorithm with the Isotron algorithm"
It does seem interesting to
compare the performance of NGD to the Isotron, and examine the performance
on larger scale networks. We certainly intend to pursue this direction in
future work.
*"Finally, the analogy given about gradient descent
jumping between very low and very large gradients using SGD is very nice.
However it would be good to see a mathematical example related to this. In
particular, it would be good to have an example of a function which is
SLQC and would have very large gradient values"
Thank you for the
suggestion. We will consider adding such an example to section 1.2 in the
final version of the paper.

