{"title": "A Stochastic Composite Gradient Method with Incremental Variance Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 9078, "page_last": 9088, "abstract": "We consider the problem of minimizing the composition of a smooth (nonconvex) function and a smooth vector mapping, where the inner mapping is in the form of an expectation over some random variable or a finite sum. We propose a stochastic composite gradient method that employs incremental variance-reduced estimators for both the inner vector mapping and its Jacobian. We show that this method achieves the same orders of complexity as the best known first-order methods for minimizing expected-value and finite-sum nonconvex functions, despite the additional outer composition which renders the composite gradient estimator biased. This finding enables a much broader range of applications in machine learning to benefit from the low complexity of incremental variance-reduction methods.", "full_text": "A Stochastic Composite Gradient Method with\n\nIncremental Variance Reduction\n\nJunyu Zhang\n\nUniversity of Minnesota\n\nMinneapolis, Minnesota 55455\n\nzhan4393@umn.edu\n\nAbstract\n\nLin Xiao\n\nMicrosoft Research\n\nRedmond, Washington 98052\nlin.xiao@microsoft.com\n\nWe consider the problem of minimizing the composition of a smooth (nonconvex)\nfunction and a smooth vector mapping, where the inner mapping is in the form of\nan expectation over some random variable or a \ufb01nite sum. We propose a stochastic\ncomposite gradient method that employs an incremental variance-reduced estimator\nfor both the inner vector mapping and its Jacobian. We show that this method\nachieves the same orders of complexity as the best known \ufb01rst-order methods\nfor minimizing expected-value and \ufb01nite-sum nonconvex functions, despite the\nadditional outer composition which renders the composite gradient estimator biased.\nThis \ufb01nding enables a much broader range of applications in machine learning to\nbene\ufb01t from the low complexity of incremental variance-reduction methods.\n\nIntroduction\n\n1\nIn this paper, we consider stochastic composite optimization problems\n\nf(cid:0)E\u03be[g\u03be(x)](cid:1) + r(x) ,\n\nminimize\n\n(1)\nwhere f : Rp \u2192 R is a smooth and possibly nonconvex function, \u03be is a random variable,g\u03be : Rd \u2192 Rp\nis a smooth vector mapping for a.e. \u03be, and r is convex and lower-semicontinuous. A special case\nwe will consider separately is when \u03be is a discrete random variable with uniform distribution over\n{1, 2, . . . , n}. In this case the problem is equivalent to a deterministic optimization problem\n\nx\u2208Rd\n\nminimize\n\nx\u2208Rd\n\nf\n\ngi(x)\n\n+ r(x) .\n\n(2)\n\n(cid:19)\n\n(cid:18) 1\n\nn\n\nn(cid:88)\n\ni=1\n\nThe formulations in (1) and (2) cover a broader range of applications than classical stochastic\noptimization and empirical risk minimization (ERM) problems where each g\u03be is a scalar function\n(p = 1) and f is the scalar identity map. Interesting examples include the policy evaluation in\nreinforcement learning (RL) [e.g., 30], the risk-averse mean-variance optimization ([e.g., 28, 29],\nthrough a reformulation by [35]), the stochastic variational inequality ([e.g., 12, 15] through a\nreformulation in [10]), the 2-level composite risk minimization problems [7], etc.\nFor the ease of notation, we de\ufb01ne\ng(x) := E\u03be[g\u03be(x)],\n\n\u03a6(x) := F(x) + r(x).\n\nF(x) := f(g(x)),\n\nIn addition, let f (cid:48) and F(cid:48) denote the gradients of f and F respectively, and g(cid:48)\nJacobian matrix of g\u03be at x. Then we have\n\n(cid:16)E\u03be[g\n\n\u03be(x)](cid:17)T\n\n(cid:48)\n\nf (cid:48)(cid:0)E\u03be[g\u03be(x)](cid:1) .\n\nF(cid:48)(x) = \u2207(cid:16)\n\nf(cid:0)E\u03be[g\u03be(x)](cid:1)(cid:17)\n\n=\n\n(3)\n\u03be(x) \u2208 Rp\u00d7d denote the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Sample complexities of CIVR (Composite Incremental Variance Reduction)\nAssumptions (common: f and g\u03be Lipschitz and smooth, thus F smooth)\nF nonconvex\nF convex, r convex\n\nF \u03bd-gradient dominant\n\nProblem\n\nr convex\n\nO(cid:0)\u0001\u22123/2(cid:1)\nO(cid:0)min{\u0001\u22123/2\n\n, n1/2\n\nO(cid:0)(cid:0)\u03bd\u0001\u22121(cid:1) log \u0001\u22121(cid:1)\n\u0001\u22121}(cid:1) O(cid:0)(cid:0)n + \u03bdn1/2(cid:1) log \u0001\u22121(cid:1)\n\nr \u2261 0\n\n\u03a6 \u00b5-optimally strongly convex\n\nO(cid:0)(cid:0)\u00b5\u22121\n\u0001\u22121(cid:1) log \u0001\u22121(cid:1)\nO(cid:0)(cid:0)n + \u00b5\u22121n1/2(cid:1) log \u0001\u22121(cid:1)\n\n(1)\n(2)\n\nIn practice, computing F(cid:48)(x) exactly can be very costly if not impossible. Due to the nonlinearity of the\nouter composition, simply multiplying the unbiased estimators E[ \u02dcg(x)] = g(x) and E[ \u02dcg(cid:48)(x)] = g(cid:48)(x)\nresults in a biased estimator for F(cid:48)(x), namely, E[[ \u02dcg(cid:48)(x)]T f (cid:48)( \u02dcg(x))] (cid:44) F(cid:48)(x), see [e.g., 35]. This is in\ngreat contrast to the classical stochastic optimization problem\n\n(cid:2)g\u03be(x)(cid:3) + r(x) ,\n\nminimize\n\nx\u2208Rd\n\nE\u03be\n\n(4)\n\nwhere one can always get an unbiased gradient estimator for the smooth part. This fact makes such\ncomposition structure be of independent interest for research on stochastic and randomized algorithms.\nIn this paper, we develop an e\ufb03cient stochastic composite gradient method called CIVR (Composite\nIncremental Variance Reduction), for solving problems of the forms (1) and (2). We measure e\ufb03ciency\nby the sample complexity of the individual functions g\u03be and their Jacobian g(cid:48)\n, i.e., the total number\nof times they need to be evaluated at some point, in order to \ufb01nd an \u0001-approximate solution. For\nnonconvex functions, an \u0001-approximate solution is some random output of the algorithm \u00afx \u2208 Rd that\nsatis\ufb01es E[(cid:107)G( \u00afx)(cid:107)2] \u2264 \u0001, where G( \u00afx) is the proximal gradient mapping of the objective function \u03a6\nat \u00afx (see details in Section 2). If r \u2261 0, then G( \u00afx) = F(cid:48)( \u00afx) and the criteria for \u0001-approximation\nbecomes E[(cid:107)F(cid:48)( \u00afx)(cid:107)2] \u2264 \u0001.\nIf the objective \u03a6 is convex, we require E[\u03a6( \u00afx) \u2212 \u03a6(cid:63)] \u2264 \u0001 where\n\u03a6(cid:63) = inf x \u03a6(x). For smooth and convex functions, these two notions are compatible, meaning that\nthe dependence of the sample complexity on \u0001 in terms of both notions are of the same order.\nTable 1 summarizes the sample complexities of the CIVR method under di\ufb00erent assumptions\nobtained in this paper. We can de\ufb01ne a condition number \u03ba = O(\u03bd) for \u03bd-gradient dominant\nfunctions and \u03ba = O(1/\u00b5) for \u00b5-optimally strongly convex functions, then the complexities become\nour contributions, we next discuss related work and then putting these results into context.\n\nO(cid:0)(cid:0)\u03ba\u0001\u22121(cid:1) log \u0001\u22121(cid:1) and O(cid:0)(cid:0)n + \u03ban1/2(cid:1) log \u0001\u22121(cid:1) for (1) and (2) respectively. In order to better position\n\n\u03be\n\n1.1 Related Work\nWe \ufb01rst discuss the nonconvex stochastic optimization problem (4), which is a special cases of (1).\nWhen r \u2261 0 and g(x) = E\u03be[g\u03be(x)] is smooth, Ghadimi and Lan [9] developed a randomized stochastic\nsecond-order guarantee. There are also many recent works on solving its \ufb01nite-sum version\n\ngradient method with iteration complexity O(\u0001\u22122). Allen-Zhu [2] obtained O(cid:0)\u0001\u22121.625(cid:1) with additional\n\ngi(x) + r(x),\n\n(5)\n\nn(cid:88)\n\ni=1\n\nminimize\n\nx\u2208Rd\n\n1\nn\n\nBased on a new variance reduction technique called SARAH [21], Nguyen et al. [22] and Pham et al.\n\nwhich is also a special case of (2). By extending the variance reduction techniques SVRG [13, 34] and\nSAGA [6] to nonconvex optimization, Allen-Zhu and Hazan [3] and Reddi et al. [24, 25, 26] developed\nrandomized algorithms with sample complexity O(n + n2/3\n\u0001\u22121). Under additional assumptions of\ngradient dominance or strong convexity, they obtained sample complexity O((n+\u03ban2/3) log \u0001\u22121), where\n\n\u03ba is a suitable condition number. Allen-Zhu [1] and Lei et al. [17] obtained O(cid:0)min{\u0001\u22125/3\n[23] developed nonconvex extensions to obtain sample complexities O(cid:0)\u0001\u22123/2(cid:1) and O(cid:0)n + n1/2\nobtained sample complexities O(cid:0)\u0001\u22123/2(cid:1) and O(cid:0)min{\u0001\u22123/2\nthe same complexities with constant step sizes and O(cid:0)(n + \u03ba\n\n\u0001\u22121}(cid:1).\n\u0001\u22121(cid:1) for\n\u0001\u22121}(cid:1) for the two cases respectively,\n2) log \u0001\u22121(cid:1) under the gradient-dominant\n\nbut require small step sizes that are proportional to \u0001. Wang et al. [33] extended Spider to obtain\ncondition. In addition, Zhou et al. [36] obtained similar results using a nested SVRG approach.\n\nsolving the expectation and \ufb01nite-sum cases respectively. Fang et al. [8] introduced another variance\nreduction technique called Spider, which can be viewed as a more general variant of SARAH. They\n\n, n1/2\n\n, n2/3\n\n2\n\n\f(cid:2) f\u03bd\n\n(cid:0)E\u03be[g\u03be(x)](cid:1)(cid:3) + r(x) ,\n\nIn addition to the above works on solving special cases of (1) and (2), there are also considerable\nrecent works on a more general, two-layer stochastic composite optimization problem\n\nE\u03bd\n\nx\u2208Rd\n\nminimize\n\n(6)\nwhere f\u03bd is parametrized by another random variables \u03bd, which is independent of \u03be. For the case\nr \u2261 0, Wang et al. [31] derived algorithms to \ufb01nd an \u0001-approximate solution with sample complexities\nO(\u0001\u22124), O(\u0001\u22123.5) and O(\u0001\u22121.25) for the smooth nonconvex case, smooth convex case and smooth\nstrongly convex case respectively. For nontrivial convex r, Wang et al. [32] obtained improved sample\ncomplexity of O(\u0001\u22122.25), O(\u0001\u22122) and O(\u0001\u22121) for the three cases mentioned above respectively.\nAs a special case of (6), the following \ufb01nite-sum problem also received signi\ufb01cant attention:\n\ngi(x)\n\n+ r(x) .\n\nfj\n\ni=1\n\nj=1\n\nx\u2208Rd\n\nminimize\n\n(7)\nWhen r \u2261 0 and the overall objective function is strongly convex, Lian et al. [19] derived two\nalgorithms based on the SVRG scheme to attain sample complexities O((m + n + \u03ba\n3) log \u0001\u22121)) and\n4) log \u0001\u22121)) respectively, where \u03ba is some suitably de\ufb01ned condition number. Huo et al.\nO((m + n + \u03ba\n[11] also used the SVRG scheme to obtain an O(m + n + (m + n)2/3\n\u0001\u22121) complexity for the smooth\nnonconvex case and O((m + n + \u03ba\n3) log \u0001\u22121)) for strongly convex problems with nonsmooth r. More\nrecently, Zhang and Xiao [35] proposed a composite randomized incremental gradient method based\n\u0001\u22121) complexity when F\non the SAGA estimator [6], which matches the best known O(m+n+(m+n)2/3\nunder either gradient dominant or strongly convex assumptions. When applied to the special cases (1)\nand (2) we focus on in this paper (m = 1), these results are strictly worse than ours in Table 1.\n\nis smooth and nonconvex, and obtained an improved complexity O(cid:0)(m + n + \u03ba(m + n)2/3) log \u0001\u22121(cid:1)\n\nm(cid:88)\n\n1\nm\n\n(cid:19)\n\nn(cid:88)\n\n(cid:18) 1\n\nn\n\n1.2 Contributions and Outline\nWe develop the CIVR method by extending the variance reduction technique of SARAH [21\u201323] and\nSpider [8, 33] to solve the composite optimization problems (1) and (2). The complexities of CIVR\nin Table 1 match the best results for solving the non-composite problems (4) and (5), despite the\nadditional outer composition and the composite-gradient estimator always being biased. In addition:\n\u2022 By setting f and g\u03be\u2019s to be the identity mapping and scalar mappings respectively, problem\n(2) includes problem (5) as a special case. Therefore, the lower bounds in [8] for the non-\n\ncomposite \ufb01nite-sum optimization problem (5) indicates that our O(cid:0)min{\u0001\u22123/2\n\u0001\u22121}(cid:1)\n\u2022 Under the assumptions of gradient dominance or strong convexity, the O(cid:0)(cid:0)n + \u03ban1/2(cid:1) log \u0001\u22121(cid:1)\n\ncomplexity for solving the more general composite \ufb01nite-sum problem (2) is near-optimal.\n\n, n1/2\n\ncomplexity only appeared for the special case (5) in the recent work [18].\n\nOur results indicate that the additional smooth composition in (1) and (2) does not incur higher\ncomplexity compared with (4) and (5), despite the di\ufb03culty of dealing with biased estimators. We\nbelieve these results can also be extended to the two-layer problems (6) and (7), by replacing n with\nm + n in Table 1. But the extensions require quite di\ufb00erent techniques and we will address them in a\nseparate paper.\nThe rest of this paper is organized as follows. In Section 2, we introduce the CIVR method. In\nSection 3, we present convergence results of CIVR for solving the composite optimization problems (1)\nand (2) and the required parameter settings. Better complexities of CIVR under the gradient-dominant\nand optimally strongly convex conditions are given in Section 4. In Section 5, we present numerical\nexperiments for solving a risk-averse portfolio optimization problem (5) on real-world datasets.\n\n2 The composite incremental variance reduction (CIVR) method\nWith the notations in (3), we can write the composite stochastic optimization problem (1) as\n\nwhere F is smooth and r is convex. The proximal operator of r with parameter \u03b7 is de\ufb01ned as\n\n(cid:8)\u03a6(x) = F(x) + r(x)(cid:9) ,\n(cid:110)\n(cid:107) y \u2212 x(cid:107)2(cid:111)\n\nr(y) +\n\n.\n\n1\n2\u03b7\n\nminimize\n\nx\u2208Rd\n\nprox\u03b7\n\nr (x) := argmin\n\ny\n\n3\n\n(11)\n\n(12)\n\n\fAlgorithm 1: Composite Incremental Variance Reduction (CIVR)\n\ninput: initial point x1\n\n0, step size \u03b7 > 0, number of epochs T \u2265 1, and a set of triples {\u03c4t, Bt, St}\nfor t = 1, . . . , T, where \u03c4t is the epoch length and Bt and St are sample sizes in epoch t.\nSample a set Bt with size Bt from the distribution of \u03be, and construct the estimates\n\nfor t = 1, ..., T do\n\nyt0 =\n\n1\nBt\n\ng\u03be(xt0),\n\n1\nBt\nCompute \u02dc\u2207F(xt0) = (zt0)T f (cid:48)(yt0) and update: xt1 = prox\u03b7\nfor i = 1, ..., \u03c4t \u2212 1 do\nSample a set St\n\n\u03be\u2208Bt\n\nzt0 =\n\nr\n\nwith size St from the distribution of \u03be, and construct the estimates\n\ni\n\n(8)\n\n(9)\n\n(10)\n\n(cid:88)\n\ng\n\n\u03be\u2208Bt\n\n\u03be(xt0),\n(cid:48)\n\n(cid:88)\n(cid:0)xt0 \u2212 \u03b7 \u02dc\u2207F(xt0)(cid:1).\ni\u22121)(cid:1) ,\ni\u22121)(cid:1).\n\ni) \u2212 g\u03be(xt\n\u03be(xt\ni) \u2212 g\n(cid:48)\ni \u2212 \u03b7 \u02dc\u2207F(xt\n\n(cid:0)xt\n\ni)(cid:1).\n\n(cid:88)\n(cid:88)\n\n\u03be\u2208St\n\ni\n\n(cid:0)g\u03be(xt\n(cid:0)g\n\n\u03be(xt\n(cid:48)\n\n1\nSt\n1\nSt\n\n= yt\n\ni\u22121 +\n\nyt\ni\n\nzt\ni\n\n= zt\n\ni\u22121 +\n\u03be\u2208St\ni) and update: xt\n\ni\n\ni+1 = prox\u03b7\n\nr\n\nCompute \u02dc\u2207F(xt\n\ni) = (zt\n\ni)T f (cid:48)(yt\n\nend\nSet xt+1\n\noutput: \u00afx randomly chosen from(cid:8)xt\n\n0 = xt\n\u03c4t\n\nend\n\n.\n\n(cid:9)t=1,...,T\n\ni=0,...,\u03c4t\u22121.\n\ni\n\nWe assume that r is relatively simple, meaning that its proximal operator has a closed-form solution\nor can be computed e\ufb03ciently. The proximal gradient method [e.g., 20, 4] for solving problem (11) is\n(13)\n\nxt+1 = prox\u03b7\n\nr\n\nwhere \u03b7 is the step size. The proximal gradient mapping of \u03a6 is de\ufb01ned as\n\n(cid:0)xt \u2212 \u03b7F(cid:48)(xt)(cid:1) ,\n(cid:0)x \u2212 \u03b7F(cid:48)(x)(cid:1)(cid:17)\n\n(cid:16)\n\nG\u03b7(x) (cid:44) 1\n\n\u03b7\n\nx \u2212 prox\u03b7\n\nr\n\n.\n\n(14)\n\nE(cid:2)(cid:107)G\u03b7( \u00afx)(cid:107)2(cid:3) \u2264 \u0001 .\n\nr (\u00b7) becomes the identity mapping and we have G\u03b7(x) \u2261 F(cid:48)(x) for any \u03b7 > 0.\n\nAs a result, the proximal gradient method (13) can be written as xt+1 = xt \u2212 \u03b7 G(xt). Notice that\nwhen r \u2261 0, prox\u03b7\nSuppose \u00afx is generated by a randomized algorithm. We call \u00afx an \u0001-stationary point in expectation if\n(15)\n(We assume that \u03b7 is a constant that does not depend on \u0001.) As we mentioned in the introduction, we\nmeasure the e\ufb03ciency of an algorithm by its sample complexity of g\u03be and their Jacobian g(cid:48)\n, i.e., the\ntotal number of times they need to be evaluated, in order to \ufb01nd a point \u00afx that satis\ufb01es (15). Our goal\nis to develop a randomized algorithm that has low sample complexity.\nWe present in Algorithm 1 the Composite Incremental Variance Reduction (CIVR) method. This\nmethods employs a two time-scale variance-reduced estimator for both the inner function value of\ng(\u00b7) = E\u03be[g\u03be(\u00b7)] and its Jacobian g(cid:48)(\u00b7). At the beginning of each outer iteration t (each called an\nepoch), we construct a relatively accurate estimate yt0 for g(xt0) and zt0 for g(cid:48)(xt0) respectively, using\na relatively large sample size Bt. During each inner iteration i of the tth epoch, we construct an\ni) respectively, using a smaller sample size St and incremental\nestimate yt\ncorrections from the previous iterations. Note that the epoch length \u03c4t and the sample sizes Bt and St\nare all adjustable for each epoch t. Therefore, besides setting a constant set of parameters, we can\nalso adjust them gradually in order to obtain better theoretical properties and practical performance.\nThis variance-reduction technique was \ufb01rst proposed as part of SARAH [21] where it is called\nrecursive variance reduction. It was also proposed in [8] in the form of a Stochastic Path-Integrated\nDi\ufb00erential EstimatoR (Spider). Here we simply call it incremental variance reduction. A distinct\n\ni) and zt\n\nfor g(cid:48)(xt\n\nfor g(xt\n\n\u03be\n\ni\n\ni\n\n4\n\n\f(cid:26) E[yt\n\nfeature of this incremental estimator is that the inner-loop estimates yt\ni\u22121 (cid:44) g(xt\ni) ,\ni\u22121 (cid:44) g(cid:48)(xt\ni) .\n\ni) \u2212 g(xt\ni) \u2212 g(cid:48)(xt\n\ni] = g(xt\ni] = g(cid:48)(xt\n\ni\u22121) + yt\ni\u22121) + zt\n\ni |xt\ni |xt\n\nE[zt\n\ni\n\nand zt\n\ni\n\nare biased, i.e.,\n\n(16)\n\nThis is in contrast to two other popular variance-reduction techniques, namely, SVRG [13] and SAGA\n[6], whose gradient estimators are always unbiased. Note that unbiased estimators for g(xt\ni) and\ng(cid:48)(xt\ni) is always biased.\nTherefore the our main task is to control the variance and bias altogether for the proposed estimator.\n\ni) are not essential here, because the composite estimator \u02dc\u2207F(xt\n\ni)T f (cid:48)(yt\n\ni) = (zt\n\n3 Convergence Analysis\nIn this section, we present theoretical results on the convergence properties of CIVR (Algorithm 1)\nwhen the composite function F is smooth. More speci\ufb01cally, we make the following assumptions.\nAssumption 1. The following conditions hold concerning problems (1) and (2):\n\n\u2022 f : Rp \u2192 R is a C1 smooth and (cid:96)f -Lipschitz function and its gradient f (cid:48) is L f -Lipschitz.\n\u2022 Each g\u03be : Rd \u2192 Rp is a C1 smooth and (cid:96)g-Lipschitz vector mapping and its Jacobian g(cid:48)\nis\nLg-Lipschtiz. Consequently, g in (3) is (cid:96)g-Lipschitz and its Jacobian g(cid:48) is Lg-Lipschitz.\n\u2022 r : Rd \u2192 R \u222a {\u221e} is a convex and lower-semicontinuous function.\n\u2022 The overall objective function \u03a6 is bounded below, i.e., \u03a6\u2217 = inf x \u03a6(x) > \u2212\u221e.\n\n\u03be\n\nAssumption 2. For problem (1), we further assume that there exist constants \u03c3g and \u03c3g(cid:48) such that\n(17)\n\nAs a result of Assumption 1, F(x) = f(cid:0)g(x)(cid:1) is smooth and F(cid:48) is LF-Lipschitz continuous with\n\nE\u03be[(cid:107)g\u03be(x) \u2212 g(x)(cid:107)2] \u2264 \u03c3\n2\ng ,\n\n(cid:48)(x)(cid:107)2] \u2264 \u03c3\n\n\u03be(x) \u2212 g\n(cid:48)\n\nE\u03be[(cid:107)g\n\n2\ng(cid:48) .\n\n2\ngL f + (cid:96)f Lg\n\nLF = (cid:96)\n\nG0 := 2(cid:0)(cid:96)\n\n(see proof in the supplementary materials). For convenience, we also de\ufb01ne two constants\n\n(cid:1) ,\ng(cid:48)(cid:1) .\nf L2\n2\n(18)\nF), hence the step size used later is \u03b7 = \u0398(1/\u221a\nIt is important to notice that G0 = O(L2\nG0) = \u0398(1/LF).\nWe are allowed to use this constant step size mainly due to the assumption that each g\u03be(\u00b7) is smooth,\ninstead of the weaker assumption that E\u03be[g\u03be(x)] is smooth as in classical stochastic optimization.\nIn the next two subsections, we present complexity analysis of CIVR for solving problem (1) and (2)\nrespectively. Due to the space limitation, all proofs are provided in the supplementary materials.\n\n0 := 2(cid:0)(cid:96)\n\ngL2\n2\n2\nf \u03c3\ng\n\ngL2\n4\n\n2\nf \u03c3\n\nand\n\n+ (cid:96)\n\n+ (cid:96)\n\n\u03c3\n\n2\n\nf\n\ng\n\n2\n\n3.1 The composite expectation case\nThe following results for solving problem (1) are presented with notations de\ufb01ned in (3), (14) and (18).\nTheorem 1. Suppose Assumptions 1 and 2 hold. Given any \u0001 > 0, we set T = (cid:100)1/\u221a\n\n\u0001(cid:101) and\nt = 1, . . . , T .\n\n\u03c4t = \u03c4 = (cid:100)1/\u221a\n\nThen as long as \u03b7 \u2264\n\n0/\u0001(cid:101),\n2\n\n\u0001(cid:101), Bt = B = (cid:100)\u03c3\n\u221a\n\nE(cid:2)(cid:107)G\u03b7( \u00afx)(cid:107)2(cid:3) \u2264(cid:16)8(cid:0)\u03a6(x1\n\n+12G0\n\n4\n2\nL\nF\n\nfor\n\n\u0001(cid:101),\n\nSt = S = (cid:100)1/\u221a\n\u22121 + 6(cid:17) \u00b7 \u0001 = O(\u0001).\n0) \u2212 \u03a6\u2217(cid:1)\u03b7\n\n, the output \u00afx of Algorithm 1 satis\ufb01es\n\nLF +\n\nAs a result, the sample complexity of obtaining an \u0001-approximate solution is T B + 2T \u03c4S = O(cid:0)\u0001\u22123/2(cid:1).\n\n(19)\n\nNote that in the above scheme, the epoch lengths \u03c4t and all the batch sizes Bt and St are set to be\nconstant (depending on a pre-\ufb01xed \u0001) without regard of t. Intuitively, we do not need as many samples\nin the early stage of the algorithm as in the later stage. In addition, it will be useful in practice to have\na variant of the algorithm that can adaptively choose \u03c4t, Bt and St throughout the epochs without\ndependence on a pre-\ufb01xed precision. This is done in the following theorem.\n\n5\n\n\fTheorem 2. Suppose Assumptions 1 and 2 hold. We set \u03c4t = St = (cid:100)at + b(cid:101) and Bt = (cid:100)\u03c3\nwhere a > 0 and b \u2265 0. Then as long as \u03b7 \u2264\n\n(cid:18) aT + b\nE(cid:2)(cid:107)G\u03b7( \u00afx)(cid:107)2(cid:3) \u2264\nln\ncomplexity of \u02dcO(cid:0)\u0001\u22123/2(cid:1), where the \u02dcO(\u00b7) notation hides logarithmic factors.\nAs a result, obtaining an \u0001-approximate solution requires T = \u02dcO(1/\u221a\n\n, we have for any T \u2265 1,\n\n(cid:32) 8(cid:0)\u03a6(x1\n\n0) \u2212 \u03a6\u2217(cid:1)\n\naT2 + (a + 2b)T\n\n= O(cid:16) ln T\n\n+12G0\n6\n\n(cid:19)(cid:33)\n\na + b\n\n4\n2\nL\nF\n\nT2\n\n6\na\n\nLF +\n\n\u221a\n\n2\n\n+\n\n+\n\n\u03b7\n\n2\n\na + b\n\u0001) epochs and a total sample\n\n0(at + b)2(cid:101)\n\n(20)\n\n(cid:17)\n\n.\n\n3.2 The composite \ufb01nite-sum case\nIn this section, we consider the composite \ufb01nite-sum optimization problem (2). In this case, the\nrandom variable \u03be has a uniform distribution over the \ufb01nite index set {1, ..., n}. At the beginning\nof each epoch in Algorithm 1, we use the full sample size Bt = {1, . . . , n} to compute yt0 and zt0.\nTherefore Bt = n for all t and Equation (8) in Algorithm 1 becomes\n1\n(cid:48)(xt0) =\nn\n\nyt0 = g(xt0) =\n\nn(cid:88)\n\nn(cid:88)\n\ngj(xt0) ,\n\nj(xt0) .\n(cid:48)\n\nzt0 = g\n\n(21)\n\n1\nn\n\ng\n\nj=1\n\nj=1\n\nAlso in this case, we no longer need Assumption 2.\nTheorem 3. Suppose Assumptions 1 holds. Let the parameters in Algorithm 1 be set as Bt = {1, . . . , n}\nand \u03c4t = St = (cid:100)\u221a\n(cid:18) 1\u221a\nn(cid:101) for all t. Then as long as \u03b7 \u2264\nAs a result, obtaining an \u0001-approximate solution requires T = O(cid:0)1/(\u221a\ncomplexity of T B + 2T \u03c4S = O(cid:0)n +\n\n(cid:19)\nn\u0001)(cid:1) epochs and a total sample\n\nE(cid:2)(cid:107)G\u03b7( \u00afx)(cid:107)2(cid:3) \u2264 8(cid:0)\u03a6(x1\n\n, we have for any T \u2265 1,\n\n0) \u2212 \u03a6\u2217(cid:1)\n\nn\u0001\u22121(cid:1).\n\n+12G0\n= O\n\n(22)\n\n4\n2\nL\nF\n\nLF +\n\n\u221a\n\n\u221a\n\n\u221a\n\nnT\n\nnT\n\n\u03b7\n\n,\n\nSimilar to the previous section, we can also choose the epoch lengths and sample sizes adaptively to\nsave the sampling cost in the early stage of the algorithm. However, due to the \ufb01nite-sum structure of\nthe problem, when the batch size Bt reaches n, we will start to take the full batch at the beginning of\neach epoch to get the exact g(xt0) and g(cid:48)(xt0). This leads to the following theorem.\n\u221a\nTheorem 4. Suppose Assumptions 1 holds. For some positive constants a > 0 and 0 \u2264 b <\nn,\n\u221a\nBt = (cid:100)at + b(cid:101);\n\u221a\n4\n2\n+12G0\nL\nF\n\ndenote T0 :=(cid:6)\u221a\nwhen t > T0, we set Bt = {1, . . . , n} and \u03c4t = St =(cid:6)\u221a\n(cid:40)O(cid:0) ln T\n2(cid:1)\nO(cid:0)\n\u02dcO(cid:0)min(cid:8)\u221a\n\n(cid:7) = O(cid:0)\u221a\nn(cid:1). When t \u2264 T0 we set the parameters to be \u03c4t = St =\nn(cid:7). Then as long as \u03b7 \u2264\nE(cid:2)(cid:107)G\u03b7( \u00afx)(cid:107)2(cid:3) \u2264\n(cid:1)\n, \u0001\u22123/2(cid:9)(cid:1), where \u02dcO(\u00b7) hides logarithmic factors.\n\nAs a result, the total sample complexity of Algorithm 1 for obtaining an \u0001-approximate solution is\n\nif T \u2264 T0 ,\nif T > T0 .\n\nT\n\u221a\nn(T\u2212T0+1)\n\nn\u0001\u22121\n\nn\u2212b\na\n\n(23)\n\nLF +\n\nln n\n\n,\n\n4 Fast convergence rates under stronger conditions\nIn this section we consider two cases where fast linear convergence can be guaranteed for CIVR.\n4.1 Gradient-dominant function\nThe \ufb01rst case is when r \u2261 0 and F is \u03bd-gradient dominant, i.e., there is some \u03bd > 0 such that\n\n2(cid:107)F(cid:48)(x)(cid:107)2\n\n,\n\nF(x) \u2212 inf\n\nF(y) \u2264 \u03bd\n\ny\n\n(24)\nNote that a \u00b5-strongly convex function is (1/\u00b5)-gradient dominant by this de\ufb01nition. Hence strong\nconvexity is a special case of the gradient dominant condition, which in turn is a special case of the\nPolyak-\u0141ojasiewicz condition with the \u0141ojasiewicz exponent equal to 2 [see, e.g., 14].\nIn order to solve (1) with a pre-\ufb01xed precision \u0001, we use a periodic restart strategy as depicted in\nAlgorithm 2. For this restarted version of CIVR, we have the following results.\n\n\u2200 x \u2208 Rd.\n\n6\n\n\fAlgorithm 2: Restarted CIVR\n\nset of triples {\u03c4t, Bt, St} for t = 1, . . . , T.\nGenerate \u00afxk+1 by Algorithm 1, with parameters T, \u03b7, {\u03c4t, Bt, St} and initial point \u00afxk.\n\ninput: initial point \u00afx0, step size \u03b7 > 0, number of restarts K, number of epochs T \u2265 1, and a\nfor k = 0, ..., K \u2212 1 do\nend\noutput: \u00afxK.\n\n(cid:7), Bt =(cid:6) 12\u03bd\u03c3\n\n\u0001\n\n2\n0\n\n(cid:7) and T =(cid:6) 16\u03bd\n\n\u221a\n\u0001\n\n(cid:7). Then\n\nTheorem 5. Consider (1) with r \u2261 0. Suppose Assumptions 1 and 2 hold and F is \u03bd-gradient\n\ndominant. For Algorithm 2, given any \u0001 > 0, let \u03c4t = St =(cid:6) 1\u221a\n\n\u0001\n\n\u03b7\n\n,\n\n2\n\n\u221a\n\nLF +\n\n4\n2\nL\nF\n\n+12G0\n\n1\n2 \u0001 .\n\nas long as \u03b7 \u2264\n\n(cid:0)F( \u00afxk) \u2212 F\u2217(cid:1) +\n\nConsequently, E[F( \u00afxk) \u2212 F\u2217] converges linearly to \u0001 with a factor of 1\n\nE(cid:2)F( \u00afxk+1) \u2212 F\u2217(cid:3) \u2264 1\ncomplexity for \ufb01nding an \u0001-solution is O(cid:0)(cid:0)\u03bd\u0001\u22121(cid:1) ln \u0001\u22121(cid:1).\nBt = (cid:100)\u221a\n(cid:0)F( \u00afxk) \u2212 F\u2217(cid:1) .\nAs a result, the sample complexity for \ufb01nding an \u0001-solution is O(cid:0)(cid:0)n + \u03bd\n\nE(cid:2)F( \u00afxk+1) \u2212 F\u2217| \u00afxk(cid:3) \u2264 1\n\nThe restart strategy also applies to the \ufb01nite-sum case.\nTheorem 6. Consider problem (2) with r \u2261 0. Suppose Assumption 1 hold and F is \u03bd-gradient\n\u221a\ndominant.\n\u221a\n\u03b7 \u2264\n\nn(cid:101) and T = (cid:6) 16\u03bd\u221a\n(cid:1).\n(cid:1) ln 1\n\n(cid:7), then as long as\n\nIn Algorithm 2, if we set \u03c4t = St =\n\n2 per period. The sample\n\n+12G0\n\n(25)\n\n(26)\n\n4\n2\nL\nF\n\nLF +\n\n2\n\n\u221a\n\nn\u03b7\n\n,\n\nn\n\n\u03b7\n\n\u0001\n\nIt is worth noting that for both cases, the number of epochs T \u221d \u03b7\u22121. When we take more conservative\nvalues \u03b7, it will directly result in worse complexity results. This comment also applies to the optimally\nstrongly convex objective function case in the next section.\n\n4.2 Optimally strongly convex function\nIn this part, we assume a \u00b5-optimally strongly convex condition on the function \u03a6(x) = F(x) + r(x),\ni.e., there exists a \u00b5 > 0 such that\n\n\u03a6(x) \u2212 \u03a6(x\u2217) \u2265 \u00b5\n\n2 (cid:107)x \u2212 x\u2217(cid:107)2\n\n,\n\n\u2200x \u2208 Rd.\n\n(27)\n\n(cid:7), Bt =(cid:6) 9\u03c3\n(cid:0)\u03a6( \u00afxk) \u2212 \u03a6\u2217(cid:1) +\n\n2\n0\n2\u00b5\u0001\n\n1\n2 \u0001 .\n\nWe have the following two results for solving problems (1) and (2) respectively.\nTheorem 7. Consider problem (1). Suppose Assumptions 1 and 2 hold and \u03a6 is \u00b5-optimally strongly\n\u00b5\u03b7 (cid:101). Then if we choose\n\n(cid:7) and T = (cid:100) 5\u221a\n\n\u0001\n\nconvex. In Algorithm 2, let us set \u03c4t = St =(cid:6) 1\u221a\nE(cid:2)\u03a6( \u00afxk+1) \u2212 \u03a6\u2217(cid:3) \u2264 1\n\u0001-solution is O(cid:0)\u00b5\u22121\n\n\u0001\u22121 ln \u0001\u22121(cid:1).\n\n+36G0\n\n2\n2\nL\nF\n\n\u03b7 <\n\nLF +\n\n\u221a\n\n,\n\n\u0001\n\n2\n\n(28)\nConsquently, E[\u03a6( \u00afxk) \u2212 \u03a6\u2217] converges linearly to \u0001. The total sample complexity for \ufb01nding an\n\nTheorem 8. Consider the \ufb01nite-sum problem (2). Suppose Assumption 1 hold and \u03a6 is \u00b5-optimally\nstrongly convex. In Algorithm 2, let us set \u03c4t = St =\n\u03b7 <\n\nn(cid:101) and T =(cid:6)\nBt = (cid:100)\u221a\nE(cid:2)\u03a6( \u00afxk+1) \u2212 \u03a6\u2217(cid:3) \u2264 1\n(cid:0)\u03a6( \u00afxk) \u2212 \u03a6\u2217(cid:1).\n(cid:1) ln 1\n(cid:1).\nThe sample complexity of \ufb01nding an \u0001-solution is O(cid:0)(cid:0)n +\n\n(cid:7). Then if we choose\n\n5\u221a\nn\u00b5\u03b7\n\n+36G0\n\n(29)\n\n2\n2\nL\nF\n\nLF +\n\n\u221a\n\n\u221a\n\n2\n\n\u221a\n\n,\n\nn\n\u00b5\u03b7\n\n\u0001\n\n7\n\n\fFigure 1: Experiments on the risk-averse portfolio optimization problem.\n\nabove complexities become O(cid:0)(cid:0)\u03ba\u0001\u22121(cid:1) ln \u0001\u22121(cid:1) and O(cid:0)(cid:0)n + \u03ban1/2(cid:1) ln \u0001\u22121(cid:1).\n\nIf we de\ufb01ne a condition number \u03ba = LF/\u00b5, then since \u03b7 = \u0398(1/LF), we have 1/(\u00b5\u03b7) = O(\u03ba) and the\n\nFor Algorithm 2 in both gradient-dominant and strongly convex cases, we have the following remarks.\nRemark 1. In Algorithm 2, each run of Algorithm 1 includes a random selection of output. In average\nit wastes half of the iterates. This waste can be prevented by pre-generating the \u201cstop times\u201d or the\noutput indeces. We can stop Algorithm 1 and output the last iterate whenever the method hits this time.\nRemark 2. In Algorithm 2, the linear-convergence is achieved by restarting. This strategy is proposed\npartly due to the epoch structure of Algorithm 1. Therefore, if we break this epoch structure by the\nloopless variance reduction techniques introduced in [16], the restarts may be avoided.\n\n5 Numerical Experiments\nIn this section, we present numerical experiments for a risk-averse portfolio optimization problem.\nSuppose there are d assets that one can invest during n time periods labeled as {1, ..., n}. Let Ri, j\nbe the return or payo\ufb00 per unit of asset j at time i, and Ri be the vector consists of Ri,1, . . . , Ri,d.\nLet x \u2208 Rd be the decision variable, where each component xj represent the amount of investment\nor percentage of the total investment allocated to asset j, for j = 1, . . . , d. The same allocations\nor percentages of allocations are repeated over the n time periods. We would like to maximize the\naverage return over the n periods, but with a penalty on the variance of the returns across the n periods\n(in other words, we would like di\ufb00erent periods to have similar returns).\nThis problem is formulated as a mean-variance trade-o\ufb00:\nmaximize\n,\nwhere the random variable \u03be \u2208 {1, . . . , n} takes discrete values uniformly at random and hence\nmakes the problem a \ufb01nite-sum. The functions hi(x) = (cid:104)Ri, x(cid:105) for i = 1, . . . , n are the rewards. The\nfunction r can be chosen as the indicator function of an (cid:96)1 ball, or a soft (cid:96)1 regularization term. We\nchoose the latter one in our experiments to obtain a sparse asset allocation. By using the mappings\n\n(cid:110) E(cid:2)h\u03be(x)(cid:3) \u2212 \u03bbVar(cid:0)h\u03be(x)(cid:1) + r(x) \u2261 E(cid:2)h\u03be(x)(cid:3) \u2212 \u03bb\ng\u03be(x) : Rd \u2192 R2 =(cid:2)h\u03be(x) h2\n\u03be(x)(cid:3)T\n\n(cid:16)E(cid:2)h2\n\u03be(x)(cid:3) \u2212 E(cid:2)h\u03be(x)(cid:3)2(cid:17)\n\n+ r(x)(cid:111)\n\nit can be further transformed into the composite \ufb01nite-sum problem (2), hence readily solved by the\nCIVR method. Here, the intermediate dimension is very low, i.e., p = 2. This leads to very little\noverhead in computation compared with stochastic optimization without composition.\nFor comparison, we implement the C-SAGA algorithm [35] as a benchmark. As another benchmark,\nthis problem can also be formulated as a two-layer composite \ufb01nite-sum problem (7), which was done\n\nx\u2208Rd\n\n,\n\nf(y, z) : R2 \u2192 R = \u2212y + \u03bby\n\n2 \u2212 \u03bbz,\n\n8\n\n051015# of samples10510-10100|| F(xk) + r(xk)||Industrial-30 datasetC-SAGAVRSC-PGASC-PGCIVRCIVR-adp02468# of samples105100102|| F(xk) + r(xk)||Industrial-38 datasetC-SAGAVRSC-PGASC-PGCIVRCIVR-adp01234# of samples10610-2100102104|| F(xk) + r(xk)||Industrial-49 datasetC-SAGAVRSC-PGASC-PGCIVRCIVR-adp051015# of samples10510-2010-101001010(F(xk)+r(xk)) - (F(x*)+r(x*))Industrial-30 datasetC-SAGAVRSC-PGASC-PGCIVRCIVR-adp02468# of samples10510-2100102104(F(xk)+r(xk)) - (F(x*)+r(x*))Industrial-38 datasetC-SAGAVRSC-PGASC-PGCIVRCIVR-adp01234# of samples10610-2100102104(F(xk)+r(xk)) - (F(x*)+r(x*))Industrial-49 datasetC-SAGAVRSC-PGASC-PGCIVRCIVR-adp\fin [11] and [19]. We solve the two-layer formulation by ASC-PG [32] and VRSC-PG [11]. Finally,\nwe also implemented CIVR-adp, which is the adaptive sampling variant described in Theorem 4.\nWe test these algorithms on three real world portfolio datasets, which contain 30, 38 and 49 industrial\nportfolios respectively, from the Keneth R. French Data Library1. For the three datasets, the daily data\nof the most recent 24452, 10000 and 24400 days are extracted respectively to conduct the experiments.\nWe set the parameter \u03bb = 0.2 in (5) and use an (cid:96)1 regularization r(x) = 0.01(cid:107)x(cid:107)1. The experiment\nresults are shown in Figure 1. The curves are averaged over 20 runs and are plotted against the number\nof samples of the component functions (the horizontal axis).\nthe adaptive batch size St =(cid:6)min{10t + 1,\nThroughout the experiments, VRSC-PG and C-SAGA algorithms use the batch size S = (cid:100)n2/3(cid:101) while\nCIVR uses the batch size S = (cid:100)\u221a\nn(cid:101), all dictated by their complexity theory. CIVR-adp employs\nVRSC-PG, C-SAGA, CIVR and CIVR-adp use the same step size \u03b7 = 0.1. They are chosen from\nthe set \u03b7 \u2208 {1, 0.1, 0.01, 0.001, 0.0001} by experiments. And \u03b7 = 0.1 works best for all four tested\nmethods simultaneously. Similarly, \u03b7 = 0.001 is chosen for the Industrial-38 dataset and \u03b7 = 0.0001\nis chosen for the Industrial-49 dataset. For ASC-PG, we set its step size parameters \u03b1k = 0.001/k\nand \u03b2k = 1/k [see details in 32]. They are hand-tuned to ensure ASC-PG converges fast among a\nrange of tested parameters. Overall, CIVR and CIVR-adp outperform other methods.\n\nn}(cid:7) for t = 1, ..., T. For Industrial-30 dataset, all of\n\n\u221a\n\nReferences\n[1] Zeyuan Allen-Zhu. Natasha: Faster non-convex stochastic optimization via strongly non-convex\nparameter. In Proceedings of the 34th International Conference on Machine Learning (ICML),\nvolume 70 of Proceedings of Machine Learning Research, pages 89\u201397, Sydney, Australia,\n2017.\n\n[2] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Advances in\nNeural Information Processing Systems 31, pages 2675\u20132686. Curran Associates, Inc., 2018.\n[3] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In\nProceedings of the 33rd International Conference on Machine Learning (ICML), pages 699\u2013707,\n2016.\n\n[4] Amir Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. SIAM,\n\n2017.\n\n[5] Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal di\ufb00erences:\n\na survey and comparison. Journal of Machine Learning Research, 15(1):809\u2013883, 2014.\n\n[6] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems 27, pages 1646\u20131654, 2014.\n\n[7] Darinka Dentcheva, Spiridon Penev, and Andrzej Ruszczy\u0144ski. Statistical estimation of\ncomposite risk functionals and risk optimization problems. Annals of the Institute of Statistical\nMathematics, 69(4):737\u2013760, 2017.\n\n[8] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-\nconvex optimization via stochastic path-integrated di\ufb00erential estimator. In Advances in Neural\nInformation Processing Systems 31, pages 689\u2013699. Curran Associates, Inc., 2018.\n\n[9] Saeed Ghadimi and Guanghui Lan. Stochastic \ufb01rst- and zeroth-order methods for nonconvex\n\nstochastic programming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[10] Saeed Ghadimi, Andrzej Ruszczy\u0144ski, and Mengdi Wang. A single time-scale stochastic\napproximation method for nested stochastic optimization. Preprint, arXiv:1812.01094, 2018.\n[11] Zhouyuan Huo, Bin Gu, Ji Jiu, and Heng Huang. Accelerated method for stochastic composition\noptimization with nonsmooth regularization. In Proceedings of the 32nd AAAI Conference on\nArti\ufb01cial Intelligence, pages 3287\u20133294, 2018.\n\n1http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html\n\n9\n\n\f[12] A. N. Iusem, A. Jofr\u00e9, R. I. Oliveira, and P. Phompson. Extragradient method with variance\nreduction for stochastic variational inequalities. SIAM Journal on Optimization, 27(2):686\u2013724,\n2017.\n\n[13] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems 26, pages 315\u2013323, 2013.\n\n[14] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient method and\nproximal-gradient methods under the Polyak-\u0141ojasiewicz condition. In Machine Learning and\nKnowledge Discovery in Database - European Conference, Proceedings, pages 795\u2013811, 2016.\n\n[15] J. Koshal, A. Nedi\u0107, and U. B. Shanbhag. Regularized iterative stochastic approximation methods\nfor stochastic variational inequality problems. IEEE Transactions on Automatic Control, 58(3):\n594\u2013609, 2013.\n\n[16] Dmitry Kovalev, Samuel Horv\u00e1th, and Peter Richt\u00e1rik. Don\u2019t jump through hoops and\nremove those loops: SVRG and Katyusha are better without the outer loop. arXiv preprint\narXiv:1901.08689, 2019.\n\n[17] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex \ufb01nite-sum optimization\nvia SCSG methods. In Advances in Neural Information Processing Systems 30, pages 2348\u20132358.\nCurran Associates, Inc., 2017.\n\n[18] Zhize Li and Jian Li. A simple proximal stochastic gradient method for nonsmooth nonconvex\noptimization. In Advances in Neural Information Processing Systems 31, pages 5564\u20135574.\nCurran Associates, Inc., 2018.\n\n[19] Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum composition optimization via variance\nreduced gradient descent. In Proceedings of the 20th International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), pages 1159\u20131167, 2017.\n\n[20] Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Program-\n\nming, 140(1):125\u2013161, 2013.\n\n[21] Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak\u00e1\u010d. SARAH: A novel method for\nmachine learning problems using stochastic recursive gradient. In Doina Precup and Yee Whye\nTeh, editors, Proceedings of the 34th International Conference on Machine Learning (ICML),\nvolume 70 of Proceedings of Machine Learning Research (PMLR), pages 2613\u20132621, Sydney,\nAustralia, 2017.\n\n[22] Lam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha Nguyen, Tsui-Wei Weng,\nand Jayant R. Kalagnanam. Finite-sum smooth optimization with sarah. arXiv preprint,\narXiv:1901.07648, 2019.\n\n[23] Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran-Dinh. ProxSARAH: An\ne\ufb03cient algorithmic framework for stochastic composite nonconvex optimization. arXiv preprint,\narXiv:1902.05679, 2019.\n\n[24] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic\nvariance reduction for nonconvex optimization. In Proceedings of The 33rd International\nConference on Machine Learning, volume 48 of Proceedings of Machine Learning Research,\npages 314\u2013323, New York, New York, USA, 2016.\n\n[25] Sashank J Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alex Smola. Fast incremental method\nfor smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control\n(CDC), pages 1971\u20131977. IEEE, 2016.\n\n[26] Sashank J Reddi, Suvrit Sra, Barnab\u00e1s P\u00f3czos, and Alexander J Smola. Proximal stochastic\nmethods for nonsmooth nonconvex \ufb01nite-sum optimization. In Advances in Neural Information\nProcessing Systems 29, pages 1145\u20131153, 2016.\n\n[27] R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.\n\n10\n\n\f[28] R. Tyrrell Rockafellar. Coherent approaches to risk in optimization under uncertainty. INFORMS\n\nTutORials in Operations Research, 2007.\n\n[29] Andrzej Ruszczy\u0144ski. Advances in risk-averse optimization. INFORMS TutORials in Operation\n\nResearch, 2013.\n\n[30] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press,\n\nCambridge, MA, 1998.\n\n[31] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: algo-\nrithms for minimizing compositions of expected-value functions. Mathematical Programming,\n161(1-2):419\u2013449, 2017.\n\n[32] Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stochastic composition optimization.\n\nJournal of Machine Learning Research, 18(105):1\u201323, 2017.\n\n[33] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. SpiderBoost: A class of faster\nvariance-reduced algorithms for nonconvex optimization. arXiv preprint, arXiv:1810.10690,\n2018.\n\n[34] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[35] Junyu Zhang and Lin Xiao. A composite randomized incremental gradient method.\n\nIn\nProceedings of the 36th International Conference on Machine Learning (ICML), number 97 in\nProceedings of Machine Learning Research (PMLR), Long Beach, California, 2019.\n\n[36] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent\nfor nonconvex optimization. In Advances in Neural Information Processing Systems 31, pages\n3921\u20133932. Curran Associates, Inc., 2018.\n\n11\n\n\f", "award": [], "sourceid": 4864, "authors": [{"given_name": "Junyu", "family_name": "Zhang", "institution": "University of Minnesota"}, {"given_name": "Lin", "family_name": "Xiao", "institution": "Microsoft Research"}]}