{"title": "Adaptive Negative Curvature Descent with Applications in Non-convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 4853, "page_last": 4862, "abstract": "Negative curvature descent (NCD) method has been utilized to design deterministic or stochastic algorithms for non-convex optimization aiming at finding second-order stationary points or local minima. In existing studies, NCD needs to approximate the smallest eigen-value of the Hessian matrix with a sufficient precision (e.g., $\\epsilon_2\\ll 1$) in order to achieve a sufficiently accurate second-order stationary solution (i.e., $\\lambda_{\\min}(\\nabla^2 f(\\x))\\geq -\\epsilon_2)$. One issue with this approach is that the target precision $\\epsilon_2$ is usually set to be very small in order to find a high quality solution, which increases the complexity for computing a negative curvature. To address this issue, we propose an adaptive NCD to allow for an adaptive error dependent on the current gradient's magnitude in approximating the smallest eigen-value of the Hessian, and to encourage competition between a noisy NCD step and gradient descent step. We consider the applications of the proposed adaptive NCD for both deterministic and stochastic non-convex optimization, and demonstrate that it can help reduce the the overall complexity in computing the negative curvatures during the course of optimization without sacrificing the iteration complexity.", "full_text": "Adaptive Negative Curvature Descent\n\nwith Applications in Non-convex Optimization\n\nMingrui Liu\u2020, Zhe Li\u2020, Xiaoyu Wang\u2021, Jinfeng Yi(cid:92), Tianbao Yang\u2020\n\n\u2020Department of Computer Science, The University of Iowa, Iowa City, IA 52242, USA\n\n\u2021 Intellifusion (cid:92) JD AI Research\n\nmingrui-liu, tianbao-yang@uiowa.edu\n\nAbstract\n\nNegative curvature descent (NCD) method has been utilized to design deterministic\nor stochastic algorithms for non-convex optimization aiming at \ufb01nding second-order\nstationary points or local minima. In existing studies, NCD needs to approximate\nthe smallest eigen-value of the Hessian matrix with a suf\ufb01cient precision (e.g.,\n\u00012 (cid:28) 1) in order to achieve a suf\ufb01ciently accurate second-order stationary solution\n(i.e., \u03bbmin(\u22072f (x)) \u2265 \u2212\u00012). One issue with this approach is that the target\nprecision \u00012 is usually set to be very small in order to \ufb01nd a high quality solution,\nwhich increases the complexity for computing a negative curvature. To address\nthis issue, we propose an adaptive NCD to allow an adaptive error dependent on\nthe current gradient\u2019s magnitude in approximating the smallest eigen-value of the\nHessian, and to encourage competition between a noisy NCD step and gradient\ndescent step. We consider the applications of the proposed adaptive NCD for both\ndeterministic and stochastic non-convex optimization, and demonstrate that it can\nhelp reduce the the overall complexity in computing the negative curvatures during\nthe course of optimization without sacri\ufb01cing the iteration complexity.\n\n1\n\nIntroduction\n\nIn this paper, we consider the following optimization problem:\n\nmin\nx\u2208Rd\n\nf (x),\n\n(1)\n\nand\n\n(cid:107)\u2207f (x)(cid:107) \u2264 \u00011,\n\n\u03bbmin(\u22072f (x)) \u2265 \u2212\u00012,\n\nwhere f (x) is a non-convex smooth function with Lipschitz continuous Hessian, which could has\nsome special structure (e.g., expectation structure or a \ufb01nite-sum structure). A standard measure\nof an optimization algorithm is how fast the algorithm converges to an optimal solution. However,\n\ufb01nding the global optimal solution to a generally non-convex problem is intractable [13] and is even a\nNP-hard problem [10]. Therefore, we aim to \ufb01nd an approximate second-order stationary point with:\n(2)\ni.e., \u2207f (x\u2217) =\nwhich nearly satisfy the second-order necessary optimality conditions,\n0, \u03bbmin(\u22072f (x\u2217)) \u2265 0, where (cid:107) \u00b7 (cid:107) denotes the Euclidean norm and \u03bbmin(\u00b7) denotes the smallest\neigen-value function. In this work, we refer to a solution that satis\ufb01es (2) as an (\u00011, \u00012)-second-order\nstationary solution. When the function is non-degenerate (i.e., strict saddle or the Hessian at all\nsaddle points have a strictly negative eigen-value), then the solution satisfying (2) is close to a local\nminimum for suf\ufb01ciently small 0 < \u00011, \u00012 (cid:28) 1. Please note that in this paper we do not follow the\ntradition of [14] that restricts \u00012 =\n\u00011. One reason is for more generality that allows us to compare\nseveral recent results and another reason is that having different accuracy levels for the \ufb01rst-order and\nthe second-order guarantee brings more \ufb02exibility in the choice of our algorithms.\nRecently, there has emerged a surge of studies interested in \ufb01nding an approximate second-order\nstationary point that satisfy (2). An effective technique used in many algorithms is negative curvature\n\n\u221a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fcurvature, its complexity (or the number of Hessian-vector products) is in the order of (cid:101)O(1/\n\ndescent (NCD), which utilizes a negative curvature direction to decrease the objective value. NCD\nhas two additional bene\ufb01ts (i) escaping from non-degenerate saddle points; (ii) searching for a region\nwhere the objective function is almost-convex that enables accelerated gradient methods. It has\nbeen leveraged to design deterministic and stochastic non-convex optimization with state-of-the-art\ntime complexities for \ufb01nding a second-order stationary point [4, 17, 16, 2]. A common feature\nof these algorithms is that they need to compute a negative curvature direction that approximates\nthe eigenvector corresponding to the smallest eigen-value to an accurate level matching the target\nprecision \u00012 on the second-order information, i.e., \ufb01nding a unit vector v such that \u03bbmin(\u22072f (x)) \u2265\nv(cid:62)\u22072f (x)v\u2212\u00012/2. The approximation accuracy has a direct impact on the complexity of computing\nthe negative curvature. For example, when the Lanczos method is utilized for computing the negative\n\u00012).\nOne potential issue is that the target precision \u00012 is usually set to be very small in order to \ufb01nd a high\nquality solution, which increases the complexity for computing a negative curvature, e.g., the number\nof Hessian-vector products used in the Lanczos method.\nIn this paper, we propose an adaptive NCD step based on full or sub-sampled Hessian that uses a noisy\nnegative curvature to update the solution with an error of approximating the smallest eigen-value\nadaptive to the magnitude of the (stochastic) gradient at the time of invocation. A novel result is that\nfor an iteration t that requires a negative curvature direction it is enough to compute a noisy negative\ncurvature that approximates the smallest eigen-vector with a noise level of max(\u00012,(cid:107)g(xt)(cid:107)\u03b1), where\ng(xt) is the gradient or mini-batch stochastic gradient at the current solution xt and \u03b1 \u2208 (0, 1] is\na parameter that characterizes the relationship between \u00012 and \u00011, i.e., \u00012 = \u0001\u03b1\n1 . It implies that\n\nthe Lanczos method only needs (cid:101)O(1/(cid:112)max(\u00012,(cid:107)g(xt)(cid:107)\u03b1) number of Hessian-vector products for\n\n\u221a\n\ncomputing such a noisy negative curvature. Another feature of the proposed adaptive NCD step is\nthat it encourages the competition between a negative curvature descent and the gradient descent\nto guarantee a maximal decrease of the objective value. Building on the proposed adaptive NCD\nstep, we design two simple algorithms to enjoy a second-order convergence for deterministic and\nstochastic non-convex optimization. Furthermore, we demonstrate the applications of the proposed\nadaptive NCD steps in existing deterministic and stochastic optimization algorithms to match the\nstate-of-the-art worst-case complexity for \ufb01nding a second-order stationary point. However, the\nadaptive nature of the developed algorithms make them perform better than their counterparts using\nthe standard NCD step.\n\n2 Related Work\n\nThere have been several recent studies that explicitly explore the negative curvature direction for\nupdating the solution. Here, we emphasize the differences between the development in this paper and\nprevious works. Curtis and Robinson [7] proposed a similar algorithm to one of our deterministic\nalgorithms except for how to compute the negative curvature. The key difference between our work\nand [7] lie at they ignored the computational costs for computing the (approximate) negative curvature.\nIn addition, they considered a stochastic version of their algorithms but provided no second-order\nconvergence guarantee. In contrast, we also develop a stochastic algorithm with provable second-order\nconvergence guarantee.\nRoyer and Wright [17] proposed an algorithm that utilizes the negative gradient direction, the negative\ncurvature direction, the Newton direction and the regularized Newton direction together with line\nsearch in a uni\ufb01ed framework, and also analyzed the time complexity of a variant with inexact\ncalculations of the negative curvature by the Lanczos algorithm and of the (regularized) Newton\ndirections by conjugate gradient method. The comparison between their algorithm and our algorithms\nshows that (i) we only use the gradient and the negative curvature directions; (ii) the time complexity\n\u00012);\n(iii) the time complexity of one of our deterministic algorithm is at least the same and usually better\nthan their time complexity. Additionally, their conjugate gradient method could fail due to the\ninexact smallest eigen-value computed by the randomized Lanczos method, and their \ufb01rst-order and\nsecond-order convergence guarantee could be on different points.\nCarmon et al. [4] developed an algorithm that utilizes the negative curvature descent to reach a region\nthat is almost convex and then switches to an accelerated gradient method to decrease the magnitude\nof the gradient. One of our algorithms is built on this development by replacing their negative\n\nfor computing an approximate negative curvature in their work is also of the order of (cid:101)O(1/\n\n\u221a\n\n2\n\n\fcurvature descent with our adaptive negative curvature descent, which has the same guarantee on the\nsmallest eigen-value of the returned solution but uses a much less number of Hessian-vector products.\nIn addition, we also show that an inexact Hessian can be used in place of the full Hessian to enjoy the\nsame iteration complexity. Several studies revolve around solving cubic regularization step [1, 18],\nwhich also requires a negative curvature direction.\nRecently, several stochastic algorithms use the negative curvature information to derive the state-\nof-the-art time complexities for \ufb01nding a second-order stationary point for non-convex optimiza-\ntion [16, 2, 19, 3], which combine existing stochastic \ufb01rst-order algorithms and a NCD method with\ndifferences lying at how to compute the negative curvature. In this work, we also demonstrate the\napplications of the proposed adaptive NCD for stochastic non-convex optimization, and develop\nseveral stochastic algorithms that not only match the state-of-the-art worst-case time complexity but\nalso enjoy adaptively smaller time complexity for computing the negative curvature. We emphasize\nthat the proposed adaptive NCD could be used in future developments of non-convex optimization.\n\n3 Preliminaries and Warm-up\n\nIn this work, we will consider two types of non-convex optimization problem: deterministic ob-\njective where the gradient \u2207f (x) and Hessian \u22072f (x) can be computed, stochastic objective\nf (x) = E\u03be[f (x; \u03be)] where only stochastic gradient \u2207f (x; \u03be) and stochastic Hessian \u22072f (x; \u03be) can\nbe computed. We note that a \ufb01nite-sum objective can be considered as a stochastic objective. The\ngoal of the paper is to design algorithms that can \ufb01nd an (\u00011, \u00012)-second order stationary point x that\nsatis\ufb01es (2). For simplicity, we consider \u00012 = \u0001\u03b1\nA function f (x) is smooth if its gradient is Lipschitz continuous, i.e., there exists L1 > 0 such that\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L1(cid:107)x \u2212 y(cid:107) hold for all x, y. The Hessian of a twice differentiable function\nf (x) is Lipschitz continuous, if there exists L2 > 0 such that (cid:107)\u22072f (x)\u2212\u22072f (y)(cid:107)2 \u2264 L2(cid:107)x\u2212y(cid:107) for\nall x, y, where (cid:107)X(cid:107)2 denotes the spectral norm of a matrix X. A function f (x) is called \u00b5-strongly\n2(cid:107)y \u2212 x(cid:107)2,\u2200x, y. If f (x) satis\ufb01es the above\nconvex (\u00b5 > 0) if f (y) \u2265 f (x) + \u2207f (x)(cid:62)(y \u2212 x) + \u00b5\ncondition for \u00b5 < 0, it is referred to as \u03b3-almost convex with \u03b3 = \u2212\u00b5.\nThroughout the paper, we make the following assumptions.\nAssumption 1. For the optimization problem (1), we assume:\n\n1 for \u03b1 \u2208 (0, 1].\n\n(i) the objective function f (x) is twice differentiable;\n(ii) it has L1-Lipschitz continuous gradient and L2-Lipschitz continuous Hessian;\n(iii) given an initial solution x0, there exists \u2206 < \u221e such that f (x0) \u2212 f (x\u2217) \u2264 \u2206, where x\u2217\n\ndenotes the global minimum of (1);\n\n(iv) if f (x) is a stochastic objective, we assume each random function f (x; \u03be) is twice dif-\nferentiable and has L1-Lipschitz continuous gradient and L2-Lipschitz continuous Hes-\nsian, and its stochastic gradient has exponential tail behavior, i.e., E[exp((cid:107)\u2207f (x; \u03be) \u2212\n\u2207f (x)(cid:107)2/G2)] \u2264 exp(1) holds for any x \u2208 Rd;\n\n(v) a Hessian-vector product can be computed in O(d) time.\n\nIn this paper, we assume there exists an algorithm that can compute a unit-length negative curvature\ndirection v \u2208 Rd of a function f (x) satisfying\n\n\u03bbmin(\u22072f (x)) \u2265 v(cid:62)\u22072f (x)v \u2212 \u03b5\n\n(3)\nwith high probability 1 \u2212 \u03b4. We refer to such an algorithm as NCS(f, x, \u03b5, \u03b4) and denote its time\ncomplexity by Tn(f, \u03b5, \u03b4, d), where NCS is short for negative curvature search.\nThere exist algorithms to implement negative curvature search (NCS) for two different cases: de-\nterministic objective and stochastic objective with theoretical guarantee, which we provide in the\nsupplement. To facilitate the discussion in the following sections, we summarize the results here.\nLemma 1. For a deterministic objective, the Lanczos method \ufb01nd a unit vector v satisfying (3) with\n. For a stochastic objective f (x) = E\u03be[f (x; \u03be)], there\nexists a randomized algorithm that produces a unit vector v satisfying (3) with a time complexity\n\na time complexity of Tn(f, \u03b5, \u03b4, d) = (cid:101)O\n\n(cid:17)\n\n(cid:16) d\u221a\n\n\u03b5\n\n3\n\n\f, \u03b4) to \ufb01nd a unit vector v satisfying (3)\n\nsign(v(cid:62)\u2207f (x))v\n\nL1\n\n2\n\n2L1\n\n>\n\nL2\n\n3L2\n2\n\n(cid:107)\u2207f (x)(cid:107)2\n\n\u2207f (x)\n\nx+ = x \u2212 2|v(cid:62)\u22072f (x)v|\nx+ = x \u2212 1\n\nAlgorithm 1 AdaNCDdet(x, \u03b1, \u03b4,\u2207f (x))\n1: Apply NCS(f, x, max(\u00012,(cid:107)\u2207f (x)(cid:107)\u03b1)\n2: if 2(\u2212v(cid:62)\u22072f (x)v)3\nthen\n3:\n4: else\n5:\n6: end if\n7: Return x+, v\nAlgorithm 2 AdaNCDmb(x, \u03b1, \u03b4,S, g(x)):\n1: Apply NCS(fS , x, max(\u00012,(cid:107)g(x)(cid:107)\u03b1)\n\u2212 \u00012|v(cid:62)HS (x)v|2\n2: if 2(\u2212v(cid:62)HS (x)v)3\n3:\n4: else\n5:\n6: end if\n7: Return x+, v\n\nx+ = x \u2212 2|v(cid:62)HS (x)v|\nx+ = x \u2212 1\n\ng(x)\n\n3L2\n2\n\n6L2\n2\n\nzv\n\nL2\n\nL1\n\n2\n\n(cid:107)g(x)(cid:107)2\n\n, \u03b4) to \ufb01nd a unit vector v satisfying (3) for fS\n>\n\n\u2212 \u0001(cid:48)2\n(cid:5)z \u2208 {1,\u22121} is a Rademacher random variable\n\nthen\n\n4L1\n\nL1\n\nof with Tn(f, \u03b5, \u03b4, d) = (cid:101)O(cid:0) d\nTn(f, \u03b5, \u03b4, d) = (cid:101)O(d(m + m3/4(cid:112)1/\u03b5)), where (cid:101)O suppresses a logarithmic term in \u03b4, d, 1/\u03b5.\n\n(cid:1). If f (x) has a \ufb01nite-sum structure with m components, then a\n\nrandomized algorithm exists that produces a unit vector v satisfying (3) with a time complexity of\n\n\u03b52\n\n4 Adaptive Negative Curvature Descent Step\nIn this section, we present several variants of adaptive negative curvature descent (AdaNCD) step\nfor different objectives and with different available information. We also present their guarantee on\ndecreasing the objective function.\n\n4.1 Deterministic Objective\nFor a deterministic objective, when a negative curvature of the Hessian matrix \u22072f (x) at a point x is\nrequired, the gradient \u2207f (x) is readily available. We utilize this information to design an AdaNCD\nshown in Algorithm 1. First, we compute a noisy negative curvature v that approximates the smallest\neigen-value of the Hessian at the current point x up to a noise level \u03b5 = max(\u00012,(cid:107)\u2207f (x)(cid:107)\u03b1). Then\nwe take either the noisy negative curvature direction or the negative gradient direction depending on\nwhich decreases the objective value more. This is done by comparing the estimations of the objective\ndecrease for following these two directions as shown in Step 3 in Algorithm 1. Its guarantee on\nobjective decrease is stated in the following lemma, which will be useful for proving convergence to\na second-order stationary point.\nLemma 2. When v(cid:62)\u22072f (x)v \u2264 0, the Algorithm 1 (AdaNCDdet) provides a guarantee that\n\nf (x) \u2212 f (x+) \u2265 max\n\n(cid:18) 2|v(cid:62)\u22072f (x)v|3\n\n3L2\n2\n\n(cid:107)\u2207f (x)(cid:107)2\n\n2L1\n\n,\n\n4.2 Stochastic Objective\nFor a stochastic objective f (x) = E\u03be[f (x; \u03be)], we assume a noisy gradient g(x) that satis\ufb01es (4)\n(with high probability) is available when computing the negative curvature at x:\n\n(cid:107)g(x) \u2212 \u2207f (x)(cid:107) \u2264 \u0001(cid:48)\n\n(4)\n\nThis can be met by using a mini-batch stochastic gradient g(x) = 1|S1|\n(cid:80)\nsuf\ufb01ciently large batch size (see Lemma 9 in the supplement).\nWe can use a NCS algorithm to compute a negative curvature based on a mini-batched Hessian. To\n\u03be\u2208S \u22072f (x; \u03be), where S denote a set of random samples, satisfy the\nthis end, let HS (x) = 1|S|\n\nf (x; \u03be) with a\n\n\u03be\u2208S1\n\n(cid:19)\n\n(cid:80)\n\n4\n\n\ffollowing inequality (with high probability):\n\n(cid:107)HS (x) \u2212 \u22072f (x)(cid:107)2 \u2264 \u00012/12.\n\n(5)\nThe inequality (5) holds with high probability when S is suf\ufb01ciently large due to the exponential tail\n(cid:80)\nbehavior of (cid:107)HS (x) \u2212 \u22072f (x)(cid:107)2 stated in the Lemma 8 in the supplement.\n\u03be\u2208S f (\u00b7; \u03be). A variant of AdaNCD using such a mini-batched Hessian is\nDenote by fS = 1|S|\npresented in Algorithm 2, where z is a Rademacher random variable, i.e. z = 1,\u22121 with equal\nprobability. Lemma 3 provides objective decrease guarantee of Algorithm 2.\nLemma 3. When v(cid:62)HS (x)v \u2264 0 and (5) holds (with high probability), the Algorithm 2 (AdaNCDmb)\nprovides a guarantee (with high probability) that\n\nf (x) \u2212 E[f (x+)] \u2265 max\nIf v(cid:62)HS (x)v \u2264 \u2212\u00012/2, we have\n\n(cid:26) 2(\u2212v(cid:62)HS (x)v)3\n(cid:18) \u00013\n\n3L2\n2\n\n2\n\n,\n\n6L2\n2\n\n\u2212 \u00012|v(cid:62)HS (x)v|2\n(cid:19)\n\n(cid:107)g(x)(cid:107)2\n\n\u2212 \u0001(cid:48)2\n\n,\n\n24L2\n2\n\n4L1\n\nL1\n\nf (x) \u2212 f (x+) \u2265 max\n\n(cid:27)\n\n(cid:107)g(x)(cid:107)2\n\n4L1\n\n\u2212 \u0001(cid:48)2\n\nL1\n\nRemark: When the objective has a \ufb01nite-sum structure, Algorithm 2 is also applicable, where\nthe noise gradient g(x) can be replaced with the full gradient \u2207f (x). This is the variant using\nsub-sampled Hessian.\nWe can also use a different variant of Algorithm 2: AdaNCDonline, which uses an online algorithm to\ncompute the negative curvature and is described in Algorithm 7 (in the supplement) with Lemma 7\n(in the supplement) as its theoretical guarantee.\n5 Simple Adaptive Algorithms with Second-order Convergence\nIn this section, we present simple deterministic and stochastic algorithms by employing AdaNCD\npresented in the last section. These simple algorithms deserve attention due to several reasons (i) they\nare simpler than many previous algorithms but can enjoy a similar time complexity when \u00012 = \u00011;\n(ii) they guarantee that the objective value can decrease at every iteration, which does not hold for\nsome complicated algorithms with state-of-the-art complexity results (e.g., [4, 2]).\n\n(cid:19)\n\n2L1\n\u00012\n1\n\n2L1\n\u00012\n1\n\n,\n\n2\n\u00013\u03b1\n1\n\n,\n\n2\n\u00013\u03b1\n1\n\nj\u2217 \u2264 1 + max\n\n(cid:18) 12L2\n\n(cid:18) 12L2\n\n(f (x1) \u2212 f (xj\u2217 )) \u2264 1 + max\n\n5.1 Deterministic Objective\nWe present a deterministic algorithm for a deterministic objective in Algorithm 3, which is referred\nto as AdaNCG (where NCG represents Negative Curvature and Gradient, Ada represents the adaptive\nnature of the NCD component).\nTheorem 1. For any \u03b1 \u2208 (0, 1], the AdaNCG algorithm terminates at iteration j\u2217 for some\n\n(cid:19)\nwith (cid:107)\u2207f (xj\u2217 )(cid:107) \u2264 \u00011, and with probability at least 1 \u2212 \u03b4, \u03bbmin(\u22072f (xj\u2217 )) \u2265 \u2212\u0001\u03b1\n1 ,(cid:107)\u2207f (xj)(cid:107)\u03b1), \u03b4(cid:48), d).\nthe j-th iteration requires time a complexity of Tn(f, max(\u0001\u03b1\nRemark: First, when \u00012 = \u00011 = \u0001, the iteration complexity of AdaNCG for achieving a point\nwith max{(cid:107)\u2207f (x)(cid:107),\u2212\u03bbmin(\u22072f (x))} \u2264 \u0001 is O(1/\u00013), which match the results in previous works\n(e.g. [17, 5, 6, 18]). However, the number of Hessian-vector products in AdaNCG could be much less\nthan that in these existing works. For example, the number of Hessian-vector products in [17, 18]\n\u00012) at each iteration requiring the second-order information. In contrast, when employ-\ning the Lanczos method the number of Hessian-vector products at each iteration of AdaNCG is\n\u00012) depending on the\nmagnitude of the gradient. Second, the worse-case time complexity of AdaNCG is given by\nusing the worse-case time complexity of each iteration, which is the\n\nis (cid:101)O(1/\n(cid:101)O(d/(cid:112)max(\u00012,(cid:107)\u2207f (xj)(cid:107)\u03b1)), which could be much smaller than (cid:101)O(1/\n(cid:16)\n(cid:101)O\nend up with (cid:101)O(d/\u00019/4), which is worse than the best time complexity (cid:101)O(d/\u00017/4) found in literature\n\nsame as the result of Theorem 2 in [18].\nOne might notice that if we plug \u00011 = \u0001, \u00012 =\n\n\u0001 into the worst-case time complexity of AdaNCG, we\n\n\u0001\u22122\n1 \u0001\n\n\u22121/2\n2\n\n\u22127/2\n, \u0001\n2\n\n1 . Furthermore,\n\n(cid:111)(cid:17)\n\n\u2206,\n\n(6)\n\n\u221a\n\n(cid:110)\n\nd max\n\n\u221a\n\n\u221a\n\n5\n\n\f(xj+1, vj) = AdaNCDdet(xj, \u03b1, \u03b4(cid:48),\u2207f (x))\nif v(cid:62)\n\n2 and (cid:107)\u2207f (xj)(cid:107) \u2264 \u00011 then\n\n2\n\n\u00013\n2\n\n\u2206),\n\n(cid:17)\n\n, 2L1\n\u00012\n1\n\n(cid:16) 12L2\n\nj \u22072f (xj)vj > \u2212 \u00012\nReturn xj\n\nAlgorithm 3 AdaNCG: (x0, \u00011, \u03b1, \u03b4)\n1: x1 = x0, \u00012 = \u0001\u03b1\n1\n2: \u03b4(cid:48) = \u03b4/(1 + max\n3: for j = 1, 2, . . . , do\n4:\n5:\n6:\nend if\n7:\n8: end for\nAlgorithm 4 S-AdaNCG: (x0, \u00011, \u03b1, \u03b4)\n1: x1 = x0, \u00012 = \u0001\u03b1\n2: for j = 1, 2, . . . , do\n3:\n4:\n5:\n6:\n7:\nend if\n8:\n9: end for\n\n1 , \u03b4(cid:48) = \u03b4/(cid:101)O(\u0001\u22122\n(cid:80)\n\n1 , \u0001\u22123\n2 )\n\nGenerate two random sets S1,S2\nlet g(xj) = 1|S1|\n(xj+1, vj) = AdaNCDmb(xj, \u03b1, \u03b4(cid:48),S2, g(xj))\nif v(cid:62)\n\nj HS2 (xj)vj > \u2212\u00012/2 and (cid:107)g(xj)(cid:107) \u2264 \u00011 then\nReturn xj\n\n\u2207f (x; \u03be) satisfy (4)\n\n\u03be\u2208S1\n\n\u00012\n1\n\n1\n\n\u00012\n2\n\nlog( 4d\n\n(1 + 3 log( 2\n\n(e.g., [1, 4]). In next section, we will use AdaNCG as a sub-routine to develop an algorithm that can\nmatch the state-of-the-art time complexity but also enjoy the adaptiveness of AdaNCG.\nBefore ending this subsection, we would like to point out that an inexact Hessian satisfying (5) can\nbe used for computing the negative curvature. For example, if the objective has a \ufb01nite-sum form,\nAdaNCDdet can be replaced by AdaNCDmb using full gradient. Lemma 3 provides a similar guarantee\nto Lemma 2 and can be used to derive a similar convergence to Theorem 1.\n5.2 Stochastic Objective\nWe present a stochastic algorithm based on the AdaNCDmb in Algorithm 4, which is referred to as\nS-AdaNCG (where S represents stochastic). A similar algorithm based on the AdaNCDonline with\nsimilar worst-case complexity can be developed, which is omitted.\nTheorem 2. Set |S1| = 32G2\n\n.\nRemark: We can analyze the worst-case time complexity of S-AdaNCG by using randomized\n\n(cid:17)\n\u03b4(cid:48) ). With probability 1 \u2212 \u03b4, the\n(cid:0)\u22072f (xj\u2217 )(cid:1) \u2265 \u22122\u00012 with probability 1\u22123\u03b4. Furthermore, the\n(cid:17)(cid:17)\n(cid:17)(cid:16) d\n) and upon termination it\n, 1\n\u00012\n1\n(cid:1). Let us\n(cid:1), which matches the time complexity of stochastic gradient descent for \ufb01nding a \ufb01rst-order\n(cid:1) for \ufb01nding an (\u00011,\n\n(cid:16) 1\nS-AdaNCG algorithm terminates at some iteration j\u2217 = (cid:101)O(max\n\u03b4(cid:48) )) and |S2| = 9216L2\n(cid:16) 1\nworst-case time complexity of S-AdaNCG is given by (cid:101)O\nholds that (cid:107)\u2207f (xj\u2217 )(cid:107) \u2264 2\u00011 and \u03bbmin\nalgorithms as in Lemma 1 to compute the negative curvature with Tn(f, \u03b5, \u03b4, d) = (cid:101)O(cid:0) d\n(cid:101)O(cid:0)d/\u00014\nof (cid:101)O(cid:0)d/\u00013.5\nmatches (cid:101)O(cid:0)d/\u00013.5\n\nstationary point. It is almost linear in the problem\u2019s dimensionality better than that of noisy SGD\nmethods [9, 20]. It is also notable that the worst-case time complexity of S-AdaNCG is worse than that\nof a recent algorithm called Natasha2 proposed in [2], which has a state-of-the-art time complexity\n\u00011) second-order stationary point. However, S-AdaNCG is much\nsimpler than Natasha2, which involves many parameters and switches between several procedures. In\nnext section, we will present an improved algorithm of S-AdaNCG, whose worst-case time complexity\n\n\u00011) second-order stationary point.\n6 Adaptive Algorithms with State-of-the-Art Complexities\nIn this section, we demonstrate the applications of the presented AdaNCD for deterministic and\nstochastic optimization with a state-of-the-art time complexity, aiming for better practical performance\nthan their counterparts in literature. We will show that how the proposed AdaNCD can reduce the\ntime complexity of these existing algorithms.\n\n, it is not dif\ufb01cult to show that the worst-case time complexity of S-AdaNCG is\n\n(cid:1) for \ufb01nding an (\u00011,\n\n+ Tn(fS2, \u00012, \u03b4(cid:48), d)\n\nconsider \u00012 = \u00011/2\n\n(cid:16)\n\n, 1\n\u00012\n1\n\nmax\n\n\u00013\n2\n\n\u00012\n1\n\n1\n\n1\n\n1\n\n1\n\n\u221a\n\n\u221a\n\n\u00013\n2\n\n\u03b52\n\n6\n\n\f(cid:17)(cid:101)\n\n(cid:16) max(12L2\n\n1\n\n\u00013\n2\n\n\u221a\n\n\u00011\u00012\n\n10L2\n\n+ 2\n\n2,2L1)\n\nelse\n\n3 , \u03b4(cid:48))\n, 2\n\n1 , K := (cid:100)1 + \u2206\n(cid:98)xk = AdaNCG(xk, \u00013\u03b1/2\nif (cid:107)\u2207f ((cid:98)xk)(cid:107) \u2264 \u00011 then\nReturn(cid:98)xk\nfk(x) = f (x) + L1 ([(cid:107)x \u2212(cid:98)xk(cid:107) \u2212 \u00012/L2]+)2\nxk+1 = Almost-Cvx-AGD(fj,(cid:98)xk, \u00011\n\nAlgorithm 5 AdaNCG+: (x0, \u00011, \u03b1, \u03b4)\n1: \u00012 = \u0001\u03b1\n2: \u03b4(cid:48) := \u03b4/K\n3: for k = 1, 2, . . . , do\n4:\n5:\n6:\n7:\n8:\n9:\nend if\n10:\n11: end for\nAlgorithm 6 AdaNCD-SCSG: (x0, \u00011, \u03b1, b, \u03b4)\n1: Input: x0, \u00011, \u03b1, \u03b4\n2: for j = 1, 2, . . . , do\n3:\n4:\n5:\n6:\n7:\n8:\nend if\n9:\n10: end for\n\nGenerate three random sets S,S1,S2\nyj = SCSG-Epoch(xj,S, b)\nlet g(yj) = \u2207fS1 (x; \u03be) satisfy (4)\n(xj+1, vj) = AdaNCDmb(yj, \u03b1, \u03b4,S2, g(yj))\nif v(cid:62)\n\nj HS2 (yj)vj > \u2212\u00012/2 and (cid:107)g(yj)(cid:107) \u2264 \u00011 then\nReturn yj\n\n2 , 3\u00012, 5L1)\n\nd\n\n(cid:17)\n\n(cid:19)\n\nthe Al-\n\n+ 1\n\u00011\u00012\n\n(cid:20)(cid:18)\n\n(cid:16) 1\n\nlet \u00012 = \u0001\u03b1\n\nAdaNCD steps in AdaNCG and (cid:101)O\n\n(cid:19)\n1 . With probability at least 1 \u2212 \u03b4,\n\n6.1 Deterministic Objective\nFor deterministic objective, we consider the accelerated method proposed in [4], which relies on\nNCD to \ufb01nd a point around which the objective function is almost convex and then switches to an\naccelerated gradient method. We present an adaptive variant of [4]\u2019s method in Algorithm 5, where\nwe use our AdaNCG in place of NCD. The procedure Almost-Convex-AGD is the same as in [4].\nFor completeness, we present it in the supplement. The convergence guarantee is presented below.\nTheorem 3. For any \u03b1 \u2208 (0, 1],\n\ngorithm AdaNCG+ returns a vector (cid:98)xk such that (cid:107)\u2207f ((cid:98)xk)(cid:107) \u2264 \u00011 and \u03bbmin(\u22072f ((cid:98)xk)) \u2265\n(cid:21)\n(cid:18)(cid:18)\n\n\u2212\u00012 with at most O\n+ \u00011/2\n2\n\u00012\n1\ngradient steps in Almost-Convex-AGD, and each step j within AdaNCG+ requires time of\nTn(f, max(\u00012,(cid:107)\u2207f (xj)(cid:107)2/3)1/2, \u03b4(cid:48), d), and the worse-case time complexity of AdaNCG+ is\n\n(cid:19)\n\u00011, the worst-case time complexity of AdaNCG+ is (cid:101)O\n\n(cid:101)O\n(cid:19)\nRemark: First, when \u00012 \u2264 \u221a\n\u00011 it reduces to (cid:101)O(d/\u00017/4), which matches the best time complexity in\nSpecially, for \u00012 =\nsame guarantee as the NCD in [4] (see Corollary 1 in the supplement), i.e., returning a solution(cid:98)xj\nprevious studies. Second, we note that the subroutine AdaNCG(xj, \u00013\u03b1/2\nsatisfying \u03bbmin(\u22072f ((cid:98)xj)) \u2265 \u2212\u00012 with high probability. The number of iterations within AdaNCG\nIn particular, the number of Hessian-vector products of each NCD step in [4] is (cid:101)O(1/\nbecomes (cid:101)O(1/(cid:112)max(\u00012,(cid:107)\u2207f (xj)(cid:107)2/3)) for each AdaNCD step in AdaNCG+. Finally, we note that\n\nis similar to that in NCD employed by [4], and the number of iterations within Almost-Convex-\nAGD is similar to that in [4]. The improvement of AdaNCG+ over [4]\u2019s algorithm is brought\nby reducing the number of Hessian-vector products for performing each iteration of AdaNCG.\n\u00012), which\n\nwhen using the Lanczos method for NCS.\n\n, 2/3, \u03b4(cid:48)) provides the\n\n+ d\u00011/2\n2\n\u00012\n1\n\n+ d\n\u00017/2\n2\n\n+ d\n\u00017/2\n2\n\n+ 1\n\n(cid:18)\n\n\u00011\u00013/2\n\n2\n\n\u00011\u00013/2\n\n2\n\n1\n\n\u00017/2\n2\n\n\u00011\u00013/2\n\n2\n\nd\n\n.\n\n\u00013\n2\n\n\u221a\n\nAdaNCG+ has the same worse-case time complexity as AdaNCG for \u00012 \u2208 [\u00011, \u00012/3\nover AdaNCG+ for \u00012 \u2208 [\u00012/3\n\n, \u00011/2\n\n].\n\n1\n\n1\n\n1\n\n], but improves\n\n1\n\n\u221a\n\n7\n\n\f6.2 Stochastic Objective\nNext, we present a stochastic algorithm for tackling a stochastic objective f (x) = E[f (x; \u03be)] in order\nto achieve a state-of-the-art worse-case complexity for \ufb01nding a second-order stationary point. We\nconsider combining the proposed AdaNCDmb with an existing stochastic variance reduced gradient\nmethod for a stochastic objective, namely SCSG [12].\nThe detailed steps are shown in Algorithm 6, which is referred to as AdaNCD-SCSG and can be\nconsidered as an improvement of Algorithm 4. The sub-routine SCSG-Epoch is one epoch of SCSG,\nwhich is included in the supplement. It is worth mentioning that Algorithm 6 is based on the design\nof [19] that also combined a NCD step with SCSG to prove the second-order convergence. The\ndifference from [19] is that they studied how to use a \ufb01rst-order method without resorting to Hessian-\nvector products to extract the negative curvature direction, while we focus on reducing the time\ncomplexity of NCS using the proposed adaptive NCD. Our result below shows AdaNCD-SCSG has a\nworst-case time complexity that matches the state-of-the-art time complexity for \ufb01nding an (\u00011,\n\u00011)\nsecond-order stationary point.\nTheorem 4. For any \u03b1 \u2208 (0, 1], let \u00012 = \u0001\u03b1\n\n2 b1/2))), |S1| =\n2). With high probability, the Algorithm AdaNCD-SCSG returns a vector\ncalls of\n\n1) and |S2| = (cid:101)O(1/\u00012\n\n1 . Suppose |S| = (cid:101)O(max(1/\u00012\n(cid:101)O(1/\u00012\nyj such that (cid:107)\u2207f (yj)(cid:107) \u2264 2\u00011 and \u03bbmin(\u22072f (xj)) \u2265 \u22122\u00012 with at most (cid:101)O\n(cid:33)\n\nSCSG-Epoch and AdaNCDmb.\nRemark: The worst-case time complexity of AdaNCD-SCSG can be computed as\n\n(cid:32)(cid:32)\n\n1, 1/(\u00019/2\n\n(cid:33)\n\n+ 1\n\u00013\n2\n\nb1/3\n\u00014/3\n1\n\n(cid:18)\n\n(cid:19)\n\n\u221a\n\n(cid:101)O\n\nb1/3\n\u00014/3\n1\n\n+\n\n1\n\u00013\n2\n\n(|S|d + |S1|d + Tn(fS2, \u00012, \u03b4(cid:48), d))\n(cid:19)(cid:18)\n\n(cid:18)\n\n(cid:18)\n\n.\n\nIf we consider using randomized algorithms as in Lemma 6 in supplement to implement NCS in\n. Let\n\n+\n\nd\n\nb1/3\n\u00014/3\n1\n\n+ 1\n\u00013\n2\n\n1\n\u00012\n1\n\n1\n\u00019/2\nb1/2\n2\n\n+ 1\n\u00012\n2\n\nAdaNCDmb, the above time complexity reduces to (cid:101)O\n(cid:101)O(d/\u00013.5).\n\n. By setting b = 1/\u00011/2\n\nus consider \u00012 = \u00011/2\n\n1\n\n1\n\n, the worst-case time complexity of AdaNCD-SCSG is\n\n(cid:19)(cid:19)\n\n7 Empirical Studies\nIn this section, we report some experimental results to justify effectiveness of AdaNCD for both\ndeterministic and stochastic non-convex optimization. We consider three problems, namely, the\ncubic regularization, regularized non-linear least-square, and one hidden-layer neural network (NN)\nproblem.\n2, where A \u2208 R1000\u00d71000. For\nThe cubic regularization problem is: minw\ndeterministic optimization, we generate a diagonal A such that 100 randomly selected diagonal entry\nis \u22121 and the rest diagonal entries follow uniform distribution between [1, 2], and set b as a zero\nvector. For stochastic optimization, we let A = A(cid:48) +E[diag(\u03be)] and b = E[\u03be(cid:48)], where A(cid:48) is generated\nsimilarly, \u03be are uniform random variables from [\u22120.1, 0.1] and \u03be(cid:48) are uniform random variables from\n[\u22121, 1]. The parameter \u03c1 is set to 0.5 for both deterministic and stochastic experiments. It is clear\nthat zero is a saddle point of the problem. In order to test the capability of escaping from saddle point,\nwe let each algorithm start from a zero vector.\n\n2 w(cid:62)Aw + b(cid:62)w + \u03c1\n\n3(cid:107)w(cid:107)3\n\n1\n\n+(cid:80)d\n\n(cid:80)n\n\n(cid:0)yi \u2212 \u03c3(w(cid:62)xi)(cid:1)2\n(cid:80)n\n\n1\nn\n\ni=1\n\nThe regularized non-linear least-square problem is: minw\n,\nwhere xi \u2208 Rd, yi \u2208 {0, 1}, \u03c3(s) = 1/(1 + exp(\u2212s)) is a sigmoid function, and the second term is\na non-convex regularizer [15], which is to increase the negative curvature of the problem. We use\nw1a data (n = 2477, d = 300) from the libsvm website [8], and set \u03bb = 1.\nLearning a NN with one hidden layer is imposed as: minw\ni=1 (cid:96)(W2\u03c3(W1xi + b1) + b2, yi),\nwhere xi \u2208 Rd, yi \u2208 {1,\u22121} are input data, W1, W2, b1, b2 are parameters of the NN with appro-\npriate dimensions, and (cid:96)(z, y) is cross-entropy loss. We use 12, 665 examples from the MNIST\ndataset [11] that belong to two categories 0 and 1 as input data, where the input feature dimensionality\nis 784. The number of neurons in hidden layer is set to 10 so that the total number of parameters\nincluding bias terms is 7872.\n\n1+\u03b1w2\ni\n\ni=1\n\n1\nn\n\n\u03bbw2\ni\n\n8\n\n\fFigure 1: Comparison of different deterministic algorithms (upper) and stochastic algorithms (lower)\nfor solving cubic regularization, regularized nonlinear least square, and neural network (from left to\nright).\nFor deterministic experiments, we compare AdaNCG, AdaNCG+ with their non-adaptive counter-\nparts. In particular, the non-adaptive counterpart of AdaNCG named NCG uses NCS(f, x, \u00012/2, \u03b4).\nThe non-adaptive counterpart of AdaNCG+ is the algorithm proposed in [4], which is referred to\nas NCD-AG. For stochastic experiments, we compare S-AdaNCG, AdaNCD-SCSG, and the non-\nadaptive version of AdaNCD-SCSG named NCD-SCSG. For all experiments, we choose \u03b1 = 1\n2, i.e.,\n\u00011. The parameters L1, L2 are tuned for the non-adaptive algorithm NCG, and the same values\n\u00012 =\nare used in other algorithms. The searching range for L1 and L2 are from 10\u22125:1:5. The mini-batch\nsize used in S-AdaNCG and AdaNCD-SCSG is set to 50 for cubic regularization and 128 for other two\n\u221a\ntasks. We use the Lanczos method for NCS. For non-adaptive algorithms, the number of iterations in\neach call of the Lanczos method is set to min(C log(d)/\n\u00012, d); and for adaptive algorithms, the num-\n\nber of iterations in each call of the Lanczos method is set to min(C log(d)/(cid:112)max(\u00012,(cid:107)g(x)(cid:107)1/2), d),\n\n\u221a\n\n\u221a\n\nwhere g(x) is either a full gradient or a mini-batch stochastic gradient. The value of C is set to\nL1.\nWe set \u00011 = 10\u22122 for cubic regularization, and \u00011 = 10\u22124 for other two tasks. We report the objective\nvalue v.s. the number of oracle calls (including gradient evaluations and Hessian-vector productions)\nin Figure 1. From deterministic optimization results, we can see that the AdaNCD can greatly\nimprove the convergence of AdaNCG and AdaNCG+ compared to their non-adaptive counterparts. In\naddition, AdaNCG performs better than AdaNCG+ on the tested tasks. The reason is that AdaNCG\ncan guarantee the decrease of the objective values at every iteration, while AdaNCG+ that uses the\nAG method to optimize an almost convex functions does not have such guarantee. From stochastic\noptimization results, AdaNCD also makes AdaNCD-SCSG converge faster than its non-adaptive\ncounterpart NCD-SCSG. In addition, AdaNCD-SCSG is faster than S-AdaNCG. Finally, we note\nthat the \ufb01nal solution found by the proposed algorithms satisfy the prescribed optimality condition.\nFor example, on the solution found by AdaNCG on the cubic regularization problem the gradient\nnorm is 0.0085 and the minimum eigen-value of the Hessian is \u22120.0043.\n8 Conclusion\nIn this paper, we have developed several variants of adaptive negative curvature descent step that\nemploy a noisy negative curvature direction for non-convex optimization. The novelty of the\nproposed algorithms lie at that the noise level in approximating the negative curvature is adaptive\nto the magnitude of the current gradient instead of a prescribed small noise level, which could\ndramatically reduce the number of Hessian-vector products. Building on the adaptive negative\ncurvature descent step, we have developed several deterministic and stochastic algorithms and\nestablished their complexities. The effectiveness of adaptive negative curvature descent is also\ndemonstrated by empirical studies.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their helpful comments. M. Liu, T. Yang are partially\nsupported by National Science Foundation (IIS-1545995).\n\n9\n\n0200400600800100012001400Number of oracle calls-0.7-0.6-0.5-0.4-0.3-0.2-0.10objectiveNCD-AGAdaNCG+NCGAdaNCG00.511.522.53Number of oracle calls\u00d71040.230.2350.240.2450.25ObjectiveNCD-AGAdaNCG+NCGAdaNCG0246810Number of oracle calls\u00d710411.522.533.544.55ObjectiveNCD-AGAdaNCG+NCGAdaNCGNumber of oracle calls\u00d710402468Objective-0.7-0.6-0.5-0.4-0.3-0.2-0.10AdaNCD-SCSGNCD-SCSGS-AdaNCGNumber of oracle calls\u00d710400.511.522.533.5Objective0.2320.2340.2360.2380.240.2420.2440.2460.2480.25AdaNCD-SCSGNCD-SCSGS-AdaNCGNumber of oracle calls\u00d7105012345Objective11.522.533.544.55AdaNCD-SCSGNCD-SCSGS-AdaNCG\fReferences\n[1] Naman Agarwal, Zeyuan Allen Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate\nlocal minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on\nTheory of Computing (STOC), pages 1195\u20131199, 2017.\n\n[2] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. CoRR, /abs/1708.08694, 2017.\n\n[3] Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via \ufb01rst-order oracles. CoRR,\n\nabs/1711.06673, 2017.\n\n[4] Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for non-convex\n\noptimization. CoRR, abs/1611.00756, 2016.\n\n[5] Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods\nfor unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical\nProgramming, 127(2):245\u2013295, Apr 2011.\n\n[6] Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for un-\nconstrained optimization. part ii: worst-case function- and derivative-evaluation complexity. Mathematical\nProgramming, 130(2):295\u2013319, Dec 2011.\n\n[7] Frank E. Curtis and Daniel P. Robinson. Exploiting negative curvature in deterministic and stochastic\n\noptimization. CoRR, abs/1703.00412, 2017.\n\n[8] Rong-En Fan and Chih-Jen Lin. Libsvm data: Classi\ufb01cation, regression and multi-label. URL: http://www.\n\ncsie. ntu. edu. tw/cjlin/libsvmtools/datasets, 2011.\n\n[9] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points \u2014 online stochastic\ngradient for tensor decomposition. In Peter Gr\u00fcnwald, Elad Hazan, and Satyen Kale, editors, Proceedings\nof The 28th Conference on Learning Theory (COLT), volume 40, pages 797\u2013842. PMLR, 03\u201306 Jul 2015.\n\n[10] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. J. ACM, 60(6):45:1\u201345:39,\n\nNovember 2013.\n\n[11] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[12] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization via scsg\n\nmethods. In Advances in Neural Information Processing Systems, pages 2345\u20132355, 2017.\n\n[13] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. A\n\nWiley-Interscience publication. Wiley, 1983.\n\n[14] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance.\n\nMathematical Programming, 108(1):177\u2013205, 2006.\n\n[15] Sashank J Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alex Smola. Fast incremental method for smooth\nnonconvex optimization. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 1971\u2013\n1977. IEEE, 2016.\n\n[16] Sashank J. Reddi, Manzil Zaheer, Suvrit Sra, Barnab\u00e1s P\u00f3czos, Francis R. Bach, Ruslan Salakhutdinov,\n\nand Alexander J. Smola. A generic approach for escaping saddle points. CoRR, abs/1709.01434, 2017.\n\n[17] Cl\u00e9ment W. Royer and Stephen J. Wright. Complexity analysis of second-order line-search algorithms for\n\nsmooth nonconvex optimization. CoRR, abs/1706.03131, 2017.\n\n[18] Peng Xu, Farbod Roosta-Khorasani, and Michael W. Mahoney. Newton-type methods for non-convex\n\noptimization under inexact hessian information. CoRR, abs/1708.07164, 2017.\n\n[19] Yi Xu, Rong Jin, and Tianbao Yang. First-order stochastic algorithms for escaping from saddle points in\n\nalmost linear time. CoRR, abs/1711.01944, 2017.\n\n[20] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin\ndynamics. In Proceedings of the 30th Conference on Learning Theory (COLT), pages 1980\u20132022, 2017.\n\n10\n\n\f", "award": [], "sourceid": 2344, "authors": [{"given_name": "Mingrui", "family_name": "Liu", "institution": "The University of Iowa"}, {"given_name": "Zhe", "family_name": "Li", "institution": "Apple"}, {"given_name": "Xiaoyu", "family_name": "Wang", "institution": "NEC Labs America"}, {"given_name": "Jinfeng", "family_name": "Yi", "institution": "JD AI Research"}, {"given_name": "Tianbao", "family_name": "Yang", "institution": "The University of Iowa"}]}