{"title": "Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 6929, "page_last": 6937, "abstract": "Stochastic compositional optimization arises in many important machine learning tasks such as reinforcement learning and portfolio management. The objective function is the composition of two expectations of stochastic functions, and is more challenging to optimize than vanilla stochastic optimization problems. In this paper, we investigate the stochastic compositional optimization in the general smooth non-convex setting. We employ a recently developed idea of \\textit{Stochastic Recursive Gradient Descent} to design a novel algorithm named SARAH-Compositional, and prove a sharp Incremental First-order Oracle (IFO) complexity upper bound for stochastic compositional optimization: $\\mathcal{O}((n+m)^{1/2} \\varepsilon^{-2})$ in the finite-sum case and $\\mathcal{O}(\\varepsilon^{-3})$ in the online case. Such a complexity is known to be the best one among IFO complexity results for non-convex stochastic compositional optimization. Numerical experiments validate the superior performance of our algorithm and theory.", "full_text": "Ef\ufb01cient Smooth Non-Convex Stochastic\n\nCompositional Optimization via Stochastic Recursive\n\nGradient Descent\n\nMissouri University of Science and Techology\n\nWenqing Hu\n\nhuwen@mst.edu\n\nChris Junchi Li \nTencent AI Lab\n\njunchi.li.duke@gmail.com\n\nXiangru Lian\n\nUniversity of Rochester\n\nadmin@mail.xrlian.com\n\nJi Liu\n\nUniversity of Rochester & Kwai Inc.\n\nji.liu.uwisc@gmail.com\n\nHuizhuo Yuan\u2217\nPeking University\n\nhzyuan@pku.edu.cn\n\nAbstract\n\nStochastic compositional optimization arises in many important machine learning\napplications. The objective function is the composition of two expectations of\nstochastic functions, and is more challenging to optimize than vanilla stochastic\noptimization problems. In this paper, we investigate the stochastic compositional\noptimization in the general smooth non-convex setting. We employ a recently\ndeveloped idea of Stochastic Recursive Gradient Descent to design a novel algo-\nrithm named SARAH-Compositional, and prove a sharp Incremental First-order\nOracle (IFO) complexity upper bound for stochastic compositional optimization:\nO((n + m)1/2\u03b5\u22122) in the \ufb01nite-sum case and O(\u03b5\u22123) in the online case. Such\na complexity is known to be the best one among IFO complexity results for\nnon-convex stochastic compositional optimization. Numerical experiments on risk-\nadverse portfolio management validate the superiority of SARAH-Compositional\nover a few rival algorithms.\n\n1\n\nIntroduction\n\nWe consider the general smooth, non-convex compositional optimization problem of minimizing the\ncomposition of two expectations of stochastic functions:\n\n{\u03a6(x) \u2261 (f \u25e6 g)(x)} ,\n\nmin\nx\u2208Rd\n\n(1)\nwhere the outer and inner functions f : Rl \u2192 R, g : Rd \u2192 Rl are de\ufb01ned as f (y) := Ev[fv(y)],\ng(x) := Ew[gw(y)], v and w are index random variables, and each component fv, gw are smooth\nbut not necessarily convex. Compositional optimization can be used to formulate many important\nmachine learning problems, e.g. reinforcement learning (Sutton and Barto, 1998), risk management\n(Dentcheva et al., 2017), multi-stage stochastic programming (Shapiro et al., 2009), deep neural net\n(Yang et al., 2019), etc. We list a speci\ufb01c application instance that can be written in the stochastic\ncompositional form of (1), namely the risk-adverse portfolio management problem, formulated as\n\nmin\nx\u2208RN\n\n\u2212 1\nT\n\n(cid:104)rt, x(cid:105) +\n\n1\nT\n\n(cid:104)rt, x(cid:105) \u2212 1\nT\n\n(cid:104)rs, x(cid:105)\n\n,\n\n(2)\n\n\u2217Partial work was performed when the author was an intern at Tencent AI Lab. Full version of this paper is\n\navailable at: http://arxiv.org/abs/1912.13515\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\nT(cid:88)\n\nt=1\n\n(cid:32)\n\nT(cid:88)\n\nt=1\n\n(cid:33)2\n\nT(cid:88)\n\ns=1\n\n\fAlgorithm\nSCGD (Wang et al., 2017a)\nAcc-SCGD (Wang et al., 2017a)\nSCGD (Wang et al., 2017b)\nSCVR / SC-SCSG (Liu et al., 2017)\nVRSC-PG (Huo et al., 2018)\nSARAH-Compositional (this work)\n\nFinite-sum\nunknown\nunknown\nunknown\n\n(n + m)4/5\u03b5\u22122\n(n + m)2/3\u03b5\u22122\n(n + m)1/2\u03b5\u22122\n\nOnline\n\u03b5\u22128\n\u03b5\u22127\n\u03b5\u22124.5\n\u03b5\u22123.6\nunknown\n\n\u03b5\u22123\n\nTable 1: Comparison of IFO complexities with different algorithms for general non-convex problem.\n\nwhere rt \u2208 RN denotes the returns of N assets at time t, and x \u2208 RN denotes the investment quantity\ncorresponding to N assets. The goal is to maximize the return while controlling the variance of the\nportfolio. (2) can be written as a compositional optimization problem with two functions\n\n(cid:34)\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\ns=1\n\n1\nT\n\n(cid:104)rs, x(cid:105)\n\n(cid:35)(cid:62)\nT(cid:88)\n(cid:0)(cid:104)rt, w\\(N +1)(cid:105) \u2212 wN +1\n\n,\n\n(cid:1)2\n\n,\n\n(3)\n\n(4)\n\ng(x) =\n\nx1, x2, . . . , xN ,\n\nf (w) = \u2212 1\nT\n\n(cid:104)rt, w\\(N +1)(cid:105) +\n\n1\nT\n\nt=1\n\nwhere w\\(N +1) denotes the (column) subvector consisting of the \ufb01rst N coordinates of w, and wN +1\ndenotes the (N + 1)-th coordinate of w.\nCompared with vanilla stochastic optimization problem where the optimizer is allowed to access\nthe stochastic gradients, stochastic compositional problem (1) is more dif\ufb01cult to solve. Classical\nalgorithms for solving (1) are often more computationally challenging. This is mainly due to the\nnonlinear structure of the composition function with respect to the random index pair (v, w). Treating\nthe objective function as an expectation Evfv(g(x)), computing each iterate of the gradient estimation\ninvolves recalculating g(x) = Ewgw(x), which is either time-consuming or impractical. To tackle\nsuch weakness in practice, Wang et al. (2017a) \ufb01rstly introduce a two-time-scale algorithm called\nStochastic Compositional Gradient Descent (SCGD) along with its accelerated (in Nesterov\u2019s sense)\nvariant Acc-SCGD, and provide a \ufb01rst convergence rate analysis to that problem. Subsequently,\nWang et al. (2017b) proposed accelerated stochastic compositional proximal gradient algorithm\n(ASC-PG) which improves over the upper bound complexities in Wang et al. (2017a). Furthermore,\nvariance reduced gradient methods designed speci\ufb01cally for compositional optimization on non-\nconvex settings arises from Liu et al. (2017) and later generalized to the nonsmooth setting (Huo et al.,\n2018). These approaches aim at getting variance reduced estimators of g, \u2202g and \u2202g(x)\u2207f (g(x)),\nrespectively. Such success signals the necessity and possibility of designing a special algorithm for\nnon-convex objectives with better convergence rates.\nIn this paper, we propose an ef\ufb01cient algorithm called SARAH-Compositional for the stochastic\ncompositional optimization problem (1). For notational simplicity, we let n, m \u2265 1 and the index\npair (v, w) be uniformly distributed over the product set [1, n] \u00d7 [1, m], i.e.\n\n\uf8eb\uf8ed 1\n\nm\n\nm(cid:88)\n\nj=1\n\nn(cid:88)\n\ni=1\n\nfi\n\n\uf8f6\uf8f8 .\n\ngj(x)\n\n\u03a6(x) =\n\n1\nn\n\n(5)\n\nWe use the same notation for the online case, in which case either n or m can be in\ufb01nite.\nA fundamental theoretical question for stochastic compositional optimization is the Incremental\nFirst-order Oracle (IFO) (the number of individual gradient and function evaluations; see De\ufb01nition 1\nin \u00a72 for a precise de\ufb01nition) complexity bounds for stochastic compositional optimization. Our new\nSARAH-Compositional algorithm is developed by integrating the iteration of Stochastic Recursive\nGradient Descent (Nguyen et al., 2017), shortened as SARAH,2 with the stochastic compositional\noptimization formulation (Wang et al., 2017a). The motivation of this approach is that SARAH\n\n2This is also referred to as stochastic recursive variance reduction method, incremental variance reduction\nmethod or SPIDER-BOOST in various recent literatures. We stick to name the algorithm after SARAH to\nrespect to our best knowledge the earliest discovery of that algorithm.\n\n2\n\n\fwith speci\ufb01c choice of stepsizes is known to be optimal in stochastic optimization and regarded as\na cutting-edge variance reduction technique, with signi\ufb01cantly reduced oracle access complexities\nthan earlier variance reduction method (Fang et al., 2018). We prove that SARAH-Compositional\n\ncan reach an IFO computational complexity of O(min(cid:0)(n + m)1/2\u03b5\u22122, \u03b5\u22123(cid:1)), improving the best\nknown result of O(min(cid:0)(n + m)2/3\u03b5\u22122, \u03b5\u22123.6(cid:1)) in non-convex compositional optimization. See\n\nTable 1 for detailed comparison.\nRelated Works Classical \ufb01rst-order methods such as gradient descent (GD), accelerated gradient\ndescent (AGD) and stochastic gradient descent (SGD) have received intensive attetions in both\nconvex and non-convex optimization (Nesterov, 2004; Ghadimi and Lan, 2016; Li and Lin, 2015).\nWhen the objective can be written in a \ufb01nite-sum or online/expectation structure, variance-reduced\ngradient (a.k.a. variance reduction) techniques including SAG (Schmidt et al., 2017), SVRG (Xiao\nand Zhang, 2014; Allen-Zhu and Hazan, 2016; Reddi et al., 2016), SDCA (Shalev-Shwartz and\nZhang, 2013, 2014), SAGA (Defazio et al., 2014), SCSG (Lei et al., 2017), SNVRG (Zhou et al.,\n2018), SARAH/SPIDER (Nguyen et al., 2017; Fang et al., 2018; Wang et al., 2019; Nguyen et al.,\n2019), etc., can be employed to improve the theoretical convergence properties of classical \ufb01rst-order\nalgorithms. Notably in the smooth nonconvex setting, Fang et al. (2018) recently proposed the\nSPIDER-SFO algorithm which non-trivially hybrids the iteration of stochastic recursive gradient\ndescent (SARAH) (Nguyen et al., 2017) with the normalized gradient descent. In the representative\ncase of batch-size 1, SPIDER-SFO adopts a small step-length that is proportional to \u03b52 \u2227 \u03b5n\u22121/2\nwhere \u03b5 is the squared targeted accuracy, and (by rebooting the SPIDER tracking iteration once every\nn \u2227 O(\u03b5\u22122) iterates) the variance of the stochastic estimator can be constantly controlled by O(\u03b52).\nFor \ufb01nding \u03b5-accurate solution purposes, recent works Wang et al. (2019); Nguyen et al. (2019)\ndiscovered two variants of the SARAH algorithm that achieve the same complexity as SPIDER-\nSFO (Fang et al., 2018) and SNVRG (Zhou et al., 2018).3 The theoretical convergence property of\nSARAH/SPIDER methods in the smooth non-convex case outperforms that of SVRG, and is provably\noptimal under a set of mild assumptions (Arjevani et al., 2019; Fang et al., 2018; Nguyen et al., 2019;\nWang et al., 2019).\nIt turns out that when solving compositional optimization problem (1), classical \ufb01rst-order methods\nfor optimizing a single objective function can either be non-applicable or it brings at least O(m)\nqueries to calculate the inner function g. To remedy this issue, Wang et al. (2017a,b) considered the\nstochastic setting and proposed the SCGD algorithm to calculate or estimate the inner \ufb01nite-sum\nmore ef\ufb01ciently, achieving a polynomial rate that is independent of m. Later on, Lian et al. (2017);\nLiu et al. (2017); Huo et al. (2018) and Lin et al. (2018) merged SVRG method into the compositional\noptimization framework to do variance reduction on all three steps of the estimation. In stark contrast,\nour work adopts the SARAH/SPIDER method which is theoretically more ef\ufb01cient than the SVRG\nmethod in the non-convex compositional optimization setting.\nContributions This work makes two contributions as follows. First, we propose a new algo-\nrithm for stochastic compositional optimization called SARAH-Compositional, which operates\nSARAH/SPIDER-type recursive variance reduction to estimate relevant quantities. Second, we\nconduct theoretical analysis for both online and \ufb01nite-sum cases, which veri\ufb01es the superiority of\nSARAH-Compositional over the best known previous results. In the \ufb01nite-sum case, we obtain\na complexity of (n + m)1/2\u03b5\u22122 which improves over the best known complexity (n + m)2/3\u03b5\u22122\nachieved by Huo et al. (2018). In the online case we obtain a complexity of \u03b5\u22123 which improves the\nbest known complexity \u03b5\u22123.6 obtained in Liu et al. (2017).\nNotational Conventions Throughout the paper, we treat the parameters Lg, Lf , L\u03a6, Mg, Mf , \u2206 and\n\u03c3 as global constants. Let (cid:107)\u2022(cid:107) denote the Euclidean norm of a vector or the operator norm of a matrix\ninduced by Euclidean norm, and let (cid:107) \u2022 (cid:107)F denotes the Frobenius norm. For \ufb01xed T \u2265 t \u2265 0 let xt:T\ndenote the sequence {xt, ..., xT}. Let Et[\u2022] denote the conditional expectation E[\u2022|x0, x1, ..., xt].\nLet [1, n] = {1, ..., n} and S denote the cardinality of a multi-set S \u2286 [1, n] of samples (a generic\nset that permits repeated instances). The averaged sub-sampled stochastic estimator is denoted as\nAi where the summation counts repeated instances. We denote pn = O(qn) if\nthere exist some constants 0 < c < C < \u221e such that cqn \u2264 pn \u2264 Cqn as n becomes large. Other\nnotations are explained at their \ufb01rst appearances.\n\nAS = (1/S)(cid:80)\n\ni\u2208S\n\n3Wang et al. (2019) names their algorithm SPIDER-BOOST since it can be seen as the SPIDER-SFO\n\nalgorithm with relaxed step-length restrictions.\n\n3\n\n\fOrganization The rest of our paper is organized as follows. \u00a72 formally poses our algorithm and\nassumptions. \u00a73 presents the convergence rate theorem and \u00a74 presents numerical experiments that\napply our algorithm to the task of portfolio management. We conclude our paper in \u00a75. Proofs of\nconvergence results for \ufb01nite-sum and online cases and auxiliary lemmas are deferred to \u00a7A and \u00a7B\nin the supplementary material.\n\n2 SARAH for Stochastic Compositional Optimization\n\nRecall our goal is to solve the compositional optimization problem (1), i.e. to minimize \u03a6(x) =\nf (g(x)) where\n\nn(cid:88)\n\ni=1\n\nm(cid:88)\n\nj=1\n\nf (y) :=\n\n1\nn\n\nfi(y),\n\ng(x) :=\n\n1\nm\n\ngj(x).\n\nHere for each j \u2208 [1, m] and i \u2208 [1, n] the functions gj : Rd \u2192 Rl and fi : Rl \u2192 R. We can\nformally take the derivative to the function \u03a6(x) and obtain (via the chain rule) the gradient descent\niteration\n\nxt+1 = xt \u2212 \u03b7[\u2202g(xt)](cid:62)\u2207f (g(xt)) ,\n\n(6)\nwhere the \u2202 operator computes the Jacobian matrix of the smooth mapping, and the gradient operator\n\u2207 is only taken with respect to the \ufb01rst-level variable. As discussed in \u00a71, it can be either impossible\n(online case) or time-consuming (\ufb01nite-sum case) to estimate the terms \u2202g(xt) =\n\nm(cid:88)\n\n\u2202gj(xt)\n\n1\nm\n\nj=1\n\ngj(xt) in the iteration scheme (6). In this paper, we design a novel algorithm\n\nand g(xt) =\n\n1\nm\n\nm(cid:88)\n\nj=1\n\n(SARAH-Compositional) based on Stochastic Compositional Variance Reduced Gradient method (see\nLin et al. (2018)) yet hybriding with the stochastic recursive gradient method Nguyen et al. (2017).\nAs the readers see later, our SARAH-Compositional is more ef\ufb01cient than all existing algorithms for\nnon-convex compositional optimization.\nWe introduce some de\ufb01nitions and assumptions. First, we assume the algorithm has accesses to\nan Incremental First-order Oracle (IFO) in our black-box environment (Lin et al., 2018); also see\n(Agarwal and Bottou, 2015; Woodworth and Srebro, 2016) for vanilla optimization case:\nDe\ufb01nition 1 (IFO). (Lin et al., 2018) The Incremental First-order Oracle (IFO) returns, when some\nx \u2208 Rd and j \u2208 [1, m] are inputted, the vector-matrix pair [gj(x), \u2202gj(x)] or when some y \u2208 Rl and\ni \u2208 [1, n] are inputted, the scalar-vector pair [fi(y),\u2207fi(y)].\nSecond, our goal in this work is to \ufb01nd an \u03b5-accurate solution, de\ufb01ned as\nDe\ufb01nition 2 (\u03b5-accurate solution). We call x \u2208 Rd an \u03b5-accurate solution to problem (1), if\n\n(7)\nIt is worth remarking here that the inequality (7) can be modi\ufb01ed to (cid:107)\u2207\u03a6(x)(cid:107) \u2264 C\u03b5 for some global\nconstant C > 0 without hurting the magnitude of IFO complexity bounds.\nLet us \ufb01rst make some assumptions regarding to each component of the (compositional) objective\nfunction. Analogous to Assumption 1(i) in Fang et al. (2018), we make the following \ufb01nite gap\nassumption:\nAssumption 1 (Finite gap). We assume that the algorithm is initialized at x0 \u2208 Rd with\n\n(cid:107)\u2207\u03a6(x)(cid:107) \u2264 \u03b5.\n\n\u2206 := \u03a6(x0) \u2212 \u03a6\u2217 < \u221e ,\n\nwhere \u03a6\u2217 denotes the global minimum value of \u03a6(x).\nWe make the following smoothness and boundedness assumptions, which are standard in recent\ncompositional optimization literatures (e.g. Lian et al. (2017); Huo et al. (2018); Lin et al. (2018)).\nAssumption 2 (Smoothness). There exist Lipschitz constants Lg, Lf , L\u03a6 > 0 such that for i \u2208 [1, n],\nj \u2208 [1, m] we have\n\n(cid:107)\u2202gj(x) \u2212 \u2202gj(x(cid:48))(cid:107)F\n(cid:107)\u2207fi(y) \u2212 \u2207fi(y(cid:48))(cid:107)\n\n(cid:13)(cid:13)[\u2202gj(x)](cid:62)\u2207fi(g(x)) \u2212 [\u2202gj(x(cid:48))](cid:62)\u2207fi(g(x(cid:48)))(cid:13)(cid:13) \u2264 L\u03a6(cid:107)x \u2212 x(cid:48)(cid:107)\n\n\u2264 Lg(cid:107)x \u2212 x(cid:48)(cid:107)\n\u2264 Lf(cid:107)y \u2212 y(cid:48)(cid:107)\n\nfor x, x(cid:48) \u2208 Rd,\nfor y, y(cid:48) \u2208 Rl,\nfor x, x(cid:48) \u2208 Rd.\n\n(8)\n\n(9)\n\n4\n\n\fAlgorithm 1 SARAH-Compositional, Online Case (resp. Finite-Sum Case)\n\nInput: T, q, x0, \u03b7, SL\nfor t = 0 to T \u2212 1 do\n\n1 , SL\n\n2 , SL\n3\n\nif mod (t, q) = 0 then\n\nDraw SL\n\n1 indices with replacement S L\n\n1,t \u2286 [1, m] and let gt =\n\nDraw SL\n\n2 indices with replacement S L\n\n2,t \u2286 [1, m] and let Gt =\n\n(resp. gt = g (xt) in \ufb01nite-sum case)\n\n1\nSL\n1\n\n1\nSL\n2\n\n(cid:80)\n(cid:80)\n(cid:34)\n\ngj(xt)\n\nj\u2208S L\n\n1,t\n\n\u2202gj(xt)\n\nj\u2208S L\n\n2,t\n\n(cid:80)\n\n1\nSL\n3\n\ni\u2208S L\n\n3,t\n\n(cid:62)\n3,t \u2286 [1, n] and let Ft = (Gt)\n\n(resp. Gt = \u2202g (xt) in \ufb01nite-sum case)\n\u2207fi(gt)\n\n(resp. Ft = (Gt)\n\n(cid:62) \u2207f (gt) in \ufb01nite-sum case)\n\n(cid:35)\n\nDraw SL\n\n3 indices with replacement S L\n\nelse\n\nDraw one index jt \u2208 [1, m] and let gt = gjt(xt) \u2212 gjt(xt\u22121) + gt\u22121 and\n\nGt = \u2202gjt(xt) \u2212 \u2202gjt(xt\u22121) + Gt\u22121\n\nDraw one index it \u2208 [1, n] and let\n\nFt = (Gt)\n\n(cid:62) \u2207fit (gt) \u2212 (Gt\u22121)\nreturn Output(cid:101)x chosen uniformly at random from {xt}T\u22121\n\nend if\nUpdate xt+1 = xt \u2212 \u03b7Ft\n\nend for\n\nt=0\n\n(cid:62) \u2207fit (gt\u22121) + Ft\u22121\n\nHere for the purpose of using stochastic recursive estimation of \u2202g(x), we slightly strengthen the\nsmoothness assumption by adopting the Frobenius norm on the left hand of the \ufb01rst line of (9).\nAssumption 3 (Boundedness). There exist boundedness constants Mg, Mf > 0 such that for\ni \u2208 [1, n], j \u2208 [1, m] we have\n\n(cid:107)\u2202gj(x)(cid:107) \u2264 Mg\n(cid:107)\u2207fi(y)(cid:107) \u2264 Mf\n\nfor x \u2208 Rd,\nfor y \u2208 Rl.\n\n(10)\n\nNotice that applying mean-value theorem for vector-valued functions to (10) gives another Lipschitz\ncondition\n\n(cid:107)gj(x) \u2212 gj(x(cid:48))(cid:107) \u2264 Mg(cid:107)x \u2212 x(cid:48)(cid:107)\n\n(11)\nand analogously for fi(y). It turns out that under the above two assumptions, a choice of L\u03a6 in (9)\ncan be expressed as a polynomial of Lf , Lg, Mf , Mg. For clarity purposes in the rest of this paper,\nwe adopt the following typical choice of L\u03a6\n\nfor x, x(cid:48) \u2208 Rd ,\n\nL\u03a6 \u2261 Mf Lg + M 2\n\ng Lf ,\n\n(12)\n\nwhose applicability can be veri\ufb01ed via a simple application of the chain rule. We integrate both\n\ufb01nite-sum and online cases into one algorithm SARAH-Compositional and write it in Algorithm 1.\n\n3 Convergence Rate Analysis\n\nIn this section, we aim to justify that our proposed SARAH-Compositional algorithm provides IFO\ncomplexities of O((n + m)1/2\u03b5\u22122) in the \ufb01nite-sum case and O(\u03b5\u22123) in the online case, which\nsupersedes the concurrent and comparative algorithms (see more in Table 1).\nLet us \ufb01rst analyze the convergence in the \ufb01nite-sum case. In this case we have S L\nS L\n2 = [1, m], S L\n\n3 = [1, n]. Involved analysis leads us to conclude\n\n1 = [1, m],\n\n5\n\n\fTheorem 1 (Finite-sum case). Suppose Assumptions 1, 2 and 3 in \u00a72 hold, let S L\nS L\n3 = [1, n], q = (2m + n)/3, and set the stepsize\n\n1 = S L\n\n2 = [1, m],\n\n(cid:114)\n\n\u03b7 =\n\n(cid:16)\n\n1\n\n(cid:17) .\n\nThen for the \ufb01nite-sum case, SARAH-Compositional Algorithm 1 outputs an (cid:101)x satisfying\nE(cid:107)\u2207\u03a6((cid:101)x)(cid:107)2 \u2264 \u03b52 in\n\n6(2m + n)\n\nf + M 2\n\ng L2\n\nf L2\ng\n\nM 4\n\niterates. The IFO complexity to achieve an \u03b5-accurate solution is bounded by\n\n\u221a\n\n2m + n \u00b7(cid:113)\n2m + n \u00b7(cid:113)\n\nM 4\n\n\u221a\n\n2m + n +\n\n\u221a\n\n24[\u03a6(x0) \u2212 \u03a6\u2217]\n\ng L2\n\nf + M 2\n\ng \u00b7\nf L2\n\n\u03b52\n\n\u221a\n\nM 4\n\ng L2\n\nf + M 2\n\ng \u00b7\nf L2\n\n1944[\u03a6(x0) \u2212 \u03a6\u2217]\n\n\u03b52\n\n.\n\n(13)\n\n(14)\n\n(15)\n\nTheorem 1 allows us to achieve an \u03b5-accurate solution, and a simple application of Markov\u2019s\ninequality allows us to derive high-probability results for achieving \u03b5-accurate solutions. Compared\nwith Fang et al. (2018), one observes that Theorem 1 indicates an IFO complexity upper bound of\nO(m + n + (m + n)1/2\u03b5\u22122) to achieve an \u03b5-accurate solution, sharing a similar form with that of\nSARAH/SPIDER for non-convex stochastic optimization when m + n is regarded as the number of\nindividual functions.4 SPIDER-SFO (as a SARAH variant) is optimal in both \ufb01nite-sum and online\ncases, in the sense that it matches the theoretical lower bound (Fang et al., 2018; Arjevani et al.,\n2019), which makes it tempting to claim that our proposed SARAH-Compositional as its extension is\nalso optimal. We emphasize that the set of assumptions for compositional optimization is different\nfrom vanilla optimization, and claiming optimality of the IFO complexity requires a corresponding\nlower bound result, left as a future direction to explore.\nLet us then analyze the convergence in the online case, where we sample minibatches S L\n3 of\nrelevant quantities instead of the ground truth once every q iterates. To characterize the estimation\nerror, we put in one additional \ufb01nite variance assumption:\nAssumption 4 (Finite Variance). We assume that there exists H1, H2 and H3 as the upper bounds\non the variance of the functions f (y), \u2202g(x), and g(x), respectively, such that\n\n1 ,S L\n\n2 ,S L\n\nE(cid:107)gi(x) \u2212 g(x)(cid:107)2 \u2264 H1\nE(cid:107)\u2202gi(x) \u2212 \u2202g(x)(cid:107)2 \u2264 H2\nE(cid:107)\u2207fi(y) \u2212 \u2207f (y)(cid:107)2 \u2264 H3\n\nfor x \u2208 Rd,\nfor x \u2208 Rd,\nfor y \u2208 Rl.\n\n(16)\n\nFrom Assumptions 2 and 3 we can easily verify, via triangle inequality and convexity of norm, that\nf . On the contrary, H1 cannot be represented\nH2 can be chosen as 4M 2\nas a function of boundedness and smoothness constants. We conclude the following theorem for the\nonline case:\n\ng and H3 can be chosen as 4M 2\n\nTheorem 2 (Online case). Suppose Assumptions 1, 2, 3 and 4 in \u00a72 hold, let SL\n\n1 =\n\nSL\n\n2 =\n\n3H2M 2\nf\n\n\u03b52\n\n, SL\n\n3 =\n\nand set the stepsize\n\n3H3M 2\ng\n\n\u03b52\n\n, let q =\n\nD0 := 3(cid:0)H1M 2\n(cid:114)\n\nf + H2M 2\n\nD0\n3\u03b52 where we denote the noise-relevant parameter\n(cid:16)\n\nf + H3M 2\ng\n\n(cid:1) ,\n\n(cid:17) .\n\ng L2\n\n\u03b7 =\n\n\u03b5\n\n(17)\n\n(18)\n\nThen for the online case, SARAH-Compositional Algorithm 1 outputs an(cid:101)x satisfying E(cid:107)\u2207\u03a6((cid:101)x)(cid:107)2 \u2264\n\nf + M 2\n\ng L2\n\nf L2\ng\n\n6D0\n\nM 4\n\ng L2\nf\n\n3H1M 2\n\u03b52\n\n,\n\n2\u03b52 in\n\n(19)\n4Here and in below, the smoothness and boundedness parameters and \u03a6(x0) \u2212 \u03a6\u2217 are treated as constants.\n\nf + M 2\n\ng L2\n\nM 4\n\n\u03b53\n\n(cid:112)\n\nD0 \u00b7(cid:113)\n\n\u221a\n\ng \u00b7\nf L2\n\n24[\u03a6(x0) \u2212 \u03a6\u2217]\n\n6\n\n\fFigure 1: Experiment on the portfolio management. The x-axis is the number of gradients calculations\ndivided by the number of samples, the y-axis is the function value gap.\n\niterates. The IFO complexity to achieve an \u03b5-accurate solution is bounded by\n1944[\u03a6(x0) \u2212 \u03a6\u2217]\n\n\u221a\n\nM 4\n\ng L2\n\nf + M 2\n\ng \u00b7\nf L2\n\n(cid:112)\n\nD0 \u00b7(cid:113)\n\nD0\n\u03b52 +\n\n\u03b53\n\n.\n\n(20)\n\nWe see that in the online case, the IFO complexity to achieve an \u03b5-accurate solution is upper bounded\nby O(\u03b5\u22123). Due to space limits, the detailed proofs of Theorems 1 and 2 are deferred to the\nsupplementary material.\n\n4 Experiments\n\nIn this section, we study performance of our algorithm to risk-adverse portfolio management problem\nand conduct numerical experiments to support our theory.5 We follow the setups in Huo et al. (2018);\nLiu et al. (2017) and compare with existing algorithms for compositional optimization. Readers are\nreferred to Wang et al. (2017a) for more tasks our algorithm can be potentially applied for.\nRecall that in \u00a71, we formulate our portfolio management problem as a mean-variance optimization\nproblem (2), which can be formulated as a compositional optimization problem (1). As it satis\ufb01es\nAssumptions 1\u20134 in a bounded domain of optimization, it serves as a good example to validate our\ntheory. For convenience we repeat the display here:\n\nT(cid:88)\n\nt=1\n\n(cid:32)\n\nT(cid:88)\n\nt=1\n\n(cid:33)2\n\nT(cid:88)\n\ns=1\n\nmin\nx\u2208RN\n\n\u2212 1\nT\n\n(cid:104)rt, x(cid:105) +\n\n1\nT\n\n(cid:104)rt, x(cid:105) \u2212 1\nT\n\n(cid:104)rs, x(cid:105)\n\n,\n\n(2)\n\nwhere x = {x1, x2, . . . , xN} \u2208 RN denotes the quantities invested at every asset i = 1, . . . , N.\nWhen applying SARAH-Compositional we adopt the online case where we pick SL\n3 as the\nmini-batch sizes once every q steps. Datasets include different portfolio datas formed on Size and\nOperating Pro\ufb01tability.6 We choose to use 6 different 25-portfolio datasets where N = 25 and T =\n7240, same as the ones adopted by Lin et al. (2018). Speci\ufb01cally, we choose SL\n3 = 2000\n(roughly optimized to improve the numerical performance). The results are shown in Figure 1.\n\n2 = SL\n\n1 = SL\n\n1 , SL\n\n2 , SL\n\n5The source code can be found at http://github.com/angeoz/SCGD. Space limiting, we refer the readers\nto the full version of this paper for the experiment studies of other applications including reinforcement learning\nand stochastic neighborhood embedding.\n\n6http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html\n\n7\n\n\fWe demonstrate the comparison among our algorithm SARAH-Compositional, SCGD (Wang\net al., 2017a), ASC-PG (Wang et al., 2017b) and VRSC-PG (Huo et al., 2018) (serving as a\nbaseline for variance-reduced stochastic compositional optimization methods). We plot the ob-\njective function value gap and gradient norm against IFO complexity (measured by gradients\ncalculation) for all four algorithms in two covariance settings and six real-world datasets. We\nobserve that SARAH-Compositional outperforms all comparable algorithms. Our range of stepsize is\n\n(cid:8)1 \u00d7 10\u22125, 1 \u00d7 10\u22124, 2 \u00d7 10\u22124, 5 \u00d7 10\u22124, 1 \u00d7 10\u22123, 1 \u00d7 10\u22122(cid:9), and we plot the learning curve for\n\neach algorithm corresponding to their individually optimized stepsize. For SCGD and ASC-PG algo-\nrithms, we \ufb01x the extrapolation parameter \u03b2 as 0.9. The q-parameters in both SARAH-Compositional\nand VRSC-PG algorithms are set as 50.\nThe toy experiment provides evidence that our proposed SARAH-Compositional algorithm applied to\nrisk-adverse portfolio management problem achieves state-of-the art performance. Moreover, we note\nthat due to the small mini-batch sizes, basic SCGD achieves a less satisfactory result, a phenomenon\nalso shown by Huo et al. (2018); Lian et al. (2017).\n\n5 Conclusion\n\nIn this paper, we propose a novel algorithm called SARAH-Compositional for solving stochastic\ncompositional optimization problems using the idea of a recently proposed variance reduced gradient\nmethod. Our algorithm achieves both outstanding theoretical and experimental results. Theoretically,\nwe show that the SARAH-Compositional algorithm can achieve desirable ef\ufb01ciency and IFO upper\nbound complexities for \ufb01nding an \u03b5-accurate solution of non-convex compositional problems in\nboth \ufb01nite-sum and online cases. Theoretically, we show that the SARAH-Compositional algorithm\ncan achieve improved convergence rates and IFO complexities for \ufb01nding an \u03b5-accurate solution\nto non-convex compositional problems in both \ufb01nite-sum and online cases. Experimentally, we\ncompare our new compositional optimization method with a few rival algorithms for the task of\nportfolio management and demonstrate its superior performance. Future directions include handling\nthe non-smooth case and the theory of lower bounds for stochastic compositional optimization. We\nhope this work can provide new perspectives to both optimization and machine learning communities\ninterested in compositional optimization.\n\nReferences\nAgarwal, A. and Bottou, L. (2015). A lower bound for the optimization of \ufb01nite sums. In International\n\nConference on Machine Learning, pages 78\u201386.\n\nAllen-Zhu, Z. and Hazan, E. (2016). Variance reduction for faster non-convex optimization. In\n\nInternational Conference on Machine Learning, pages 699\u2013707.\n\nArjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. (2019). Lower\n\nbounds for non-convex stochastic optimization. arXiv preprint arXiv:1912.02365.\n\nDefazio, A., Bach, F., and Lacoste-Julien, S. (2014). Saga: A fast incremental gradient method\nwith support for non-strongly convex composite objectives. In Advances in neural information\nprocessing systems, pages 1646\u20131654.\n\nDentcheva, D., Penev, S., and Ruszczy\u00b4nski, A. (2017). Statistical estimation of composite risk\nfunctionals and risk optimization problems. Annals of the Institute of Statistical Mathematics,\n69(4):737\u2013760.\n\nFang, C., Li, C. J., Lin, Z., and Zhang, T. (2018). Spider: Near-optimal non-convex optimization via\nstochastic path-integrated differential estimator. In Advances in Neural Information Processing\nSystems, pages 689\u2013699.\n\nGhadimi, S. and Lan, G. (2016). Accelerated gradient methods for nonconvex nonlinear and stochastic\n\nprogramming. Mathematical Programming, 156(1-2):59\u201399.\n\nHuo, Z., Gu, B., Liu, J., and Huang, H. (2018). Accelerated method for stochastic composition\noptimization with nonsmooth regularization. In Thirty-Second AAAI Conference on Arti\ufb01cial\nIntelligence.\n\n8\n\n\fLei, L., Ju, C., Chen, J., and Jordan, M. I. (2017). Non-convex \ufb01nite-sum optimization via scsg\n\nmethods. In Advances in Neural Information Processing Systems, pages 2345\u20132355.\n\nLi, H. and Lin, Z. (2015). Accelerated proximal gradient methods for nonconvex programming. In\n\nAdvances in neural information processing systems, pages 379\u2013387.\n\nLian, X., Wang, M., and Liu, J. (2017). Finite-sum composition optimization via variance reduced\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n\ngradient descent.\n1159\u20131167.\n\nLin, T., Fan, C., Wang, M., and Jordan, M. I. (2018). Improved oracle complexity for stochastic\n\ncompositional variance reduced gradient. arXiv preprint arXiv:1806.00458.\n\nLiu, L., Liu, J., and Tao, D. (2017). Variance reduced methods for non-convex composition optimiza-\n\ntion. arXiv preprint arXiv:1711.04416.\n\nNesterov, Y. (2004). Introductory lectures on convex optimization: A basic course, volume 87.\n\nSpringer.\n\nNguyen, L. M., Liu, J., Scheinberg, K., and Tak\u00e1\u02c7c, M. (2017). Sarah: A novel method for machine\nlearning problems using stochastic recursive gradient. In International Conference on Machine\nLearning, pages 2613\u20132621.\n\nNguyen, L. M., van Dijk, M., Phan, D. T., Nguyen, P. H., Weng, T.-W., and Kalagnanam, J. R. (2019).\nOptimal \ufb01nite-sum smooth non-convex optimization with sarah. arXiv preprint arXiv:1901.07648.\n\nReddi, S. J., Hefny, A., Sra, S., Poczos, B., and Smola, A. (2016). Stochastic variance reduction for\n\nnonconvex optimization. In International conference on machine learning, pages 314\u2013323.\n\nSchmidt, M., Le Roux, N., and Bach, F. (2017). Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Mathematical Programming, 162(1-2):83\u2013112.\n\nShalev-Shwartz, S. and Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. Journal of Machine Learning Research, 14(Feb):567\u2013599.\n\nShalev-Shwartz, S. and Zhang, T. (2014). Accelerated proximal stochastic dual coordinate ascent for\nregularized loss minimization. In International Conference on Machine Learning, pages 64\u201372.\n\nShapiro, A., Dentcheva, D., and Ruszczy\u00b4nski, A. (2009). Lectures on stochastic programming:\n\nmodeling and theory. SIAM.\n\nSutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction.\n\nWang, M., Fang, E. X., and Liu, H. (2017a). Stochastic compositional gradient descent: Algorithms\nfor minimizing compositions of expected-value functions. Mathematical Programming, 161(1-\n2):419\u2013449.\n\nWang, M., Liu, J., and Fang, E. X. (2017b). Accelerating stochastic composition optimization.\n\nJournal of Machine Learning Research, 18:1\u201323.\n\nWang, Z., Ji, K., Zhou, Y., Liang, Y., and Tarokh, V. (2019). Spiderboost and momentum: Faster\nvariance reduction algorithms. In Advances in Neural Information Processing Systems, pages\n2403\u20132413.\n\nWoodworth, B. E. and Srebro, N. (2016). Tight complexity bounds for optimizing composite\n\nobjectives. In Advances in Neural Information Processing Systems, pages 3639\u20133647.\n\nXiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075.\n\nYang, S., Wang, M., and Fang, E. X. (2019). Multilevel stochastic gradient methods for nested\n\ncomposition optimization. SIAM Journal on Optimization, 29(1):616\u2013659.\n\nZhou, D., Xu, P., and Gu, Q. (2018). Stochastic nested variance reduced gradient descent for\nnonconvex optimization. In Advances in Neural Information Processing Systems, pages 3921\u2013\n3932.\n\n9\n\n\f", "award": [], "sourceid": 3751, "authors": [{"given_name": "Wenqing", "family_name": "Hu", "institution": "Missouri S&T"}, {"given_name": "Chris Junchi", "family_name": "Li", "institution": "Tecent AI Lab"}, {"given_name": "Xiangru", "family_name": "Lian", "institution": "University of Rochester"}, {"given_name": "Ji", "family_name": "Liu", "institution": "University of Rochester, Tencent AI lab"}, {"given_name": "Huizhuo", "family_name": "Yuan", "institution": "Peking University"}]}