{"title": "Accelerating Stochastic Composition Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1714, "page_last": 1722, "abstract": "Consider the stochastic composition optimization problem where the objective is a composition of two expected-value functions. We propose a new stochastic first-order method, namely the accelerated stochastic compositional proximal gradient (ASC-PG) method, which updates based on queries to the sampling oracle using two different timescales. The ASC-PG is the first proximal gradient method for the stochastic composition problem that can deal with nonsmooth regularization penalty. We show that the ASC-PG exhibits faster convergence than the best known algorithms, and that it achieves the optimal sample-error complexity in several important special cases. We further demonstrate the application of ASC-PG to reinforcement learning and conduct numerical experiments.", "full_text": "Accelerating Stochastic Composition Optimization\n\nMengdi Wang\u21e4, Ji Liu\u21e4, and Ethan X. Fang\n\nPrinceton University, University of Rochester, Pennsylvania State University\nmengdiw@princeton.edu, ji.liu.uwisc@gmail.com, xxf13@psu.edu\n\nAbstract\n\nConsider the stochastic composition optimization problem where the objective is a\ncomposition of two expected-value functions. We propose a new stochastic \ufb01rst-\norder method, namely the accelerated stochastic compositional proximal gradient\n(ASC-PG) method, which updates based on queries to the sampling oracle using\ntwo different timescales. The ASC-PG is the \ufb01rst proximal gradient method for\nthe stochastic composition problem that can deal with nonsmooth regularization\npenalty. We show that the ASC-PG exhibits faster convergence than the best known\nalgorithms, and that it achieves the optimal sample-error complexity in several\nimportant special cases. We further demonstrate the application of ASC-PG to\nreinforcement learning and conduct numerical experiments.\n\n1\n\nIntroduction\n\nThe popular stochastic gradient methods are well suited for minimizing expected-value objective\nfunctions or the sum of a large number of loss functions. Stochastic gradient methods \ufb01nd wide\napplications in estimation, online learning, and training of deep neural networks. Despite their\npopularity, they do not apply to the minimization of a nonlinear function involving expected values or\na composition between two expected-value functions.\nIn this paper, we consider the stochastic composition problem, given by\n+R(x)\n\nH(x) := Ev(fv(Ew(gw(x))))\n\n(1)\n\nmin\nx2 0 is a\nregularization parameter. Its batch version takes the form\n\nmin\n\nx\n\n1\nN\n\nh(x; ai, bi) +\n\n\nN\n\nNXi=1\n\nNXi=1 h(x; ai, bi) \n\n1\nN\n\nNXi=1\n\nh(x; ai, bi)!2\n\n.\n\nHere the variance term is the composition of the mean square function and an expected loss function.\nAlthough the stochastic composition problem (1) was barely studied, it actually \ufb01nds a broad spectrum\nof emerging applications in estimation and machine learning (see Wang et al. [2016] for a list of\napplications). Fast optimization algorithms with theoretical guarantees will lead to new computation\ntools and online learning methods for a broader problem class, no longer limited to the expectation\nminimization problem.\n\n1.2 Related Works and Contributions\nContrary to the expectation minimization problem, \u201cunbiased\" gradient samples are no longer\navailable for the stochastic composition problem (1). The objective is nonlinear in the joint probability\ndistribution of (w, v), which substantially complicates the problem. In a recent work by Dentcheva\net al. [2015], a special case of the stochastic composition problem, i.e., risk-averse optimization,\nhas been studied. A central limit theorem has been established, showing that the K-sample batch\nproblem converges to the true problem at the rate of O(1/pK) in a proper sense. For the case\nwhere R(x) = 0, Wang et al. [2016] has proposed and analyzed a class of stochastic compositional\ngradient/subgradient methods (SCGD). The SCGD involves two iterations of different time scales,\none for estimating x\u21e4 by a stochastic quasi-gradient iteration, the other for maintaining a running\nestimate of g(x\u21e4). Wang and Liu [2016] studies the SCGD in the setting where samples are corrupted\nwith Markov noises (instead of i.i.d. zero-mean noises). Both works establish almost sure convergence\nof the algorithm and several convergence rate results, which are the best-known convergence rate\nprior to the current paper.\nThe idea of using two-timescale quasi-gradient traced back to the earlier work Ermoliev [1976]. The\nincremental treatment of proximal gradient iteration has been studied extensively for the expectation\nminimization problem, see for examples Beck and Teboulle [2009], Bertsekas [2011], Ghadimi and\nLan [2015], Gurbuzbalaban et al. [2015], Nedi\u00b4c [2011], Nedi\u00b4c and Bertsekas [2001], Nemirovski\net al. [2009], Rakhlin et al. [2012], Shamir and Zhang [2013], Wang and Bertsekas [2016], Wang et al.\n[2015]. However, except for Wang et al. [2016] and Wang and Liu [2016], all of these works focus\non variants of the expectation minimization problem and do not apply to the stochastic composition\nproblem (1). Another work partially related to this paper is by Dai et al. [2016]. They consider a\nspecial case of problem (1) arising in kernel estimation, where they assume that all functions fv\u2019s are\nconvex and their conjugate functions f ?\nv \u2019s can be easily obtained/sampled. Under these additional\nassumptions, they essentially rewrite the problem into a saddle point optimization involving functional\nvariables.\n\n2\n\n\fIn this paper, we propose a new accelerated stochastic compositional proximal gradient (ASC-PG)\nmethod that applies to the penalized problem (1), which is a more general problem than the one\nconsidered in Wang et al. [2016]. We use a coupled martingale stochastic analysis to show that\nASC-PG achieves signi\ufb01cantly better sample-error complexity in various cases. We also show that\nASC-PG exhibits optimal sample-error complexity in two important special cases: the case where the\nouter function is linear and the case where the inner function is linear.\nOur contributions are summarized as follows:\n1. We propose the \ufb01rst stochastic proximal-gradient method for the stochastic composition problem.\nThis is the \ufb01rst algorithm that is able to address the nonsmooth regularization penalty R(\u00b7) without\ndeteriorating the convergence rate.\n2. We obtain a convergence rate O(K4/9) for smooth optimization problems that are not necessarily\nconvex, where K is the number of queries to the stochastic \ufb01rst-order oracle. This improves the best\nknown convergence rate and provides a new benchmark for the stochastic composition problem.\n3. We provide a comprehensive analysis and results that apply to various special cases. In particular,\nour results contain as special cases the known optimal rate results for the expectation minimization\nproblem, i.e., O(1/pK) for general objectives and O(1/K) for strongly convex objectives.\n4. In the special case where the inner function g(\u00b7) is a linear mapping, we show that it is suf\ufb01cient\nto use one timescale to guarantee convergence. Our result achieves the non-improvable rate of\nconvergence O(1/K) for optimal strongly convex optimization and O(1/pK) for nonconvex\nsmooth optimization. It implies that the inner linearity does not bring fundamental dif\ufb01culty to the\nstochastic composition problem.\n5. We show that the proposed method leads to a new on-policy reinforcement learning algorithm.\nThe new learning algorithm achieves the optimal convergence rate O(1/pK) for solving Bellman\nequations (or O(1/K) for solving the least square problem) based on K observations of state-to-\nstate transitions.\n\nIn comparison with Wang et al. [2016], our analysis is more succinct and leads to stronger results.\nTo the best of our knowledge, Theorems 1 and 2 in this paper provide the best-known rates for the\nstochastic composition problem.\n\nPaper Organization. Section 2 states the sampling oracle and the accelerated stochastic composi-\ntional proximal gradient algorithm (ASC-PG). Section 3 states the convergence rate results in the case\nof general nonconvex objective and in the case of strongly convex objective, respectively. Section 4\ndescribes an application of ASC-PG to reinforcement learning and gives numerical experiments.\nNotations and De\ufb01nitions. For x 2 0 such that kykk\uf8ff ckzkk for each k. We denote by Ivalue\ncondition the indicator function,\nwhich returns \u201cvalue\u201d if the \u201ccondition\u201d is satis\ufb01ed; otherwise 0. We denote by H\u21e4 the optimal\nobjective function value of problem (1), denote by X\u21e4 the set of optimal solutions, and denote by\nPS(x) the Euclidean projection of x onto S for any convex set S. We also denote by short that\nf (y) = Ev[fv(y)] and g(x) = Ew[gw(x)].\n\n2 Algorithm\n\nWe focus on the black-box sampling environment. Suppose that we have access to a stochastic\n\ufb01rst-order oracle, which returns random realizations of \ufb01rst-order information upon queries. This\nis a typical simulation oracle that is available in both online and batch learning. More speci\ufb01cally,\nassume that we are given a Sampling Oracle (SO) such that\n\u2022 Given some x 2 wk (xk)rfvk (yk) .\n\n5:\n\nUpdate auxillary iterates by an extrapolation-smoothing scheme:\n\nzk+1 = \u27131 \n\n1\n\nk\u25c6 xk +\n\n1\nk\n\nxk+1,\n\nyk+1 = (1 k)yk + kgwk+1(zk+1),\n\nwhere the sample gwk+1(zk+1) is obtained via querying the SO.\n\n6: end for\n\ng(xk) := Ew[gw(xk)]. At iteration k, the running estimate yk of g(xk) is obtained using a weighted\nsmoothing scheme, corresponding to the y-step; while the new query point zk+1 is obtained through\nextrapolation, corresponding to the z-step. The updates are constructed in a way such that yk is a\nnearly unbiased estimate of g(xk). To see how the extrapolation-smoothing scheme works, we let the\nweights be\n\n\u21e0(k)\n\nt =(tQk\n\nk,\n\ni=t+1(1 i),\n\nif k > t 0\nif k = t 0.\n\n(3)\n\nThen we can verify the following important relations:\n\nxk+1 =\n\nkXt=0\n\n\u21e0(k)\nt zt+1,\n\nyk+1 =\n\n\u21e0(k)\nt gwt+1(zt+1),\n\nkXt=0\n\nwhich essentially say that xk+1 is a damped weighted average of {zt+1}k+1\nweighted average of {gwt+1(zt+1)}k+1\nAn Analytical Example of the Extrapolation-Smooth Scheme Now consider the special case\nwhere gw(\u00b7) is always a linear mapping gw(z) = Awz + bz and k = 1/(k + 1). We can verify that\n\u21e0(k)\nt = 1/(k + 1) for all t. Then we have\n\nand yk+1 is a damped\n\n.\n\n0\n\n0\n\ng(xk+1) =\n\n1\n\nk + 1\n\nkXt=0\n\nIn this way, we can see that the scaled error\n\nE[Aw]zt+1 +E[bw],\n\nyk+1 =\n\n1\n\nk + 1\n\nkXt=0\n\nAwt+1zt+1 +\n\n1\n\nk + 1\n\nkXt=0\n\nbwt+1.\n\nk(yk+1 g(xk+1)) =\n\n(bwt+1 E[bw])\n\nis a zero-mean and zero-drift martingale. Under additional technical assumptions, we have\n\n(Awt+1 E[Aw])zt+1 +\n\nkXt=0\nkXt=0\nE[kyk+1 g(xk+1)k2] \uf8ff O (1/k) .\n\nNote that the zero-drift property of the error martingale is the key to the fast convergence rate. The\nzero-drift property comes from the near-unbiasedness of yk, which is due to the special construction\nof the extrapolation-smoothing scheme. In the more general case where gw is not necessarily linear,\nwe can use a similar argument to show that yk is a nearly unbiased estimate of g(xk). As a result, the\nextrapolation-smoothing (y, z)-step ensures that yk tracks the unknown quantity g(xk) ef\ufb01ciently.\n\n4\n\n\f3 Main Results\n\nWe present our main theoretical results in this section. Let us begin by stating our assumptions. Note\nthat all assumptions involving random realizations of v, w hold with probability 1.\nAssumption 1. The samples generated by the SO are unbiased in the following sense:\n8x,8y.\n\n1. E{wk,vk}(rg>wk (x)rfvk (y)) = rg>(x)rf (y) 8k = 1, 2,\u00b7\u00b7\u00b7 , K,\n2. Ewk (gwk (x)) = g(x) 8x.\n\nNote that wk and vk are not necessarily independent.\nAssumption 2. The sample gradients and values generated by the SO satisfy\n\nAssumption 3. The sample gradients generated by the SO are uniformly bounded, and the penalty\nfunction R has bounded gradients.\n\nEw(kgw(x) g(x)k2) \uf8ff 2 8x.\n\nkrgw(x)k\uf8ff \u21e5(1),\nAssumption 4. There exist LF , Lf , Lg > 0 such that\n\nkrfv(x)k\uf8ff \u21e5(1),\n\nk@R(x)k\uf8ff \u21e5(1) 8x,8w,8v\n\n1. F (z) F (x) \uf8ff hrF (x), z xi + LF\n2. krfv(y) rfv(w)k\uf8ff Lfky wk 8y 8w 8v.\n3. kg(x) g(z) rg(z)>(x z)k\uf8ff Lg\n\n2 kz xk2 8x 8z.\n\n2 kx zk2 8x 8z.\n\nOur \ufb01rst main result concerns with general optimization problems which are not necessarily convex.\nTheorem 1 (Smooth (Nonconvex) Optimization). Let Assumptions 1, 2, 3, and 4 hold. Denote\nby F (x) := (Ev(fv) Ew(gw))(x) for short and suppose that R(x) = 0 in (1) and E(F (xk))\nis bounded from above. Choose \u21b5k = ka and k = 2kb where a 2 (0, 1) and b 2 (0, 1) in\nAlgorithm 1. Then we have\n\nPK\nk=1 E(krF (xk)k2)\n\nK\n\n\uf8ff O(Ka1 + L2\n\nf LgK4b4aIlog K\n\n4a4b=1 + L2\n\nf Kb + Ka).\n\nIf Lg 6= 0 and Lf 6= 0, choose a = 5/9 and b = 4/9, yielding\n\n(4)\n\n(5)\n\n(6)\n\nIf Lg = 0 or Lf = 0, choose a = b = 1/2, yielding\n\n1\nK\n\n1\nK\n\nKXk=1\nKXk=1\n\nE(krF (xk)k2) \uf8ff O(K4/9).\n\nE(krF (xk)k2) \uf8ff O(K1/2).\n\nThe result of Theorem 1 strictly improves the best-known results given by Wang et al. [2016]. First\nthe result of (5) improves the \ufb01nite-sample error bound from O(k2/7) to O(k4/9) for general\nconvex and nonconvex optimization. This improves the best known convergence rate and provides a\nnew benchmark for the stochastic composition problem. Note that it is possible to relax the condition\n\u201cE(F (xk)) is bounded from above\" in Theorem 1. However, it would make the analysis more\ncumbersome and yield an additional term log K in the error bound.\nOur second main result concerns strongly convex objective functions. We say that the objective\nfunction H is optimally strongly convex with parameter > 0 if\n\nH(x) H(PX\u21e4(x)) kx P X\u21e4(x)k2 8x.\n\n(7)\n(see Liu and Wright [2015]). Note that any strongly convex function is optimally strongly convex, but\nthe reverse does not hold. For example, the objective function (2) in on-policy reinforcement learning\nis always optimally strongly convex (even if E(A) is a rank de\ufb01cient matrix), but not necessarily\nstrongly convex.\n\n5\n\n\fTheorem 2. (Strongly Convex Optimization) Suppose that the objective function H(x) in (1) is\noptimally strongly convex with parameter > 0 de\ufb01ned in (7). Set \u21b5k = Caka and k = Cbkb\nwhere Ca > 4, Cb > 2, a 2 (0, 1], and b 2 (0, 1] in Algorithm 1. Under Assumptions 1, 2, 3, and 4,\nwe have\n(8)\n\nE(kxK P X\u21e4(xK)k2) \uf8ff OKa + L2\nIf Lg 6= 0 and Lf 6= 0, choose a = 1 and b = 4/5, yielding\n\nf LgK4a+4b + L2\n\nf Kb .\n\nIf Lg = 0 or Lf = 0, choose a = 1 and b = 1, yielding\n\nE(kxK P X\u21e4(xK)k2) \uf8ff O(K4/5).\n\n(9)\n\nE(kxK P X\u21e4(xK)k2) \uf8ff O(K1).\n\n(10)\nLet us discuss the results of Theorem 2. In the general case where Lf 6= 0 and Lg 6= 0, the\nconvergence rate in (9) is consistent with the result of Wang et al. [2016]. Now consider the special\ncase where Lg = 0, i.e., the inner mapping is linear. This result \ufb01nds an immediate application to\nBellman error minimization problem (2) which arises from reinforcement learning problem in (and\nwith `1 norm regularization). The proposed ASC-PG algorithm is able to achieve the optimal rate\nO(1/K) without any extra assumption on the outer function fv. To the best of our knowledge, this is\nthe best (also optimal) sample-error complexity for on-policy reinforcement learning.\nRemarks Theorems 1 and 2 give important implications about the special cases where Lf = 0\nor Lg = 0. In these cases, we argue that our convergence rate (10) is \u201coptimal\" with respect to the\nsample size K. To see this, it is worth pointing out the O(1/K) rate of convergence is optimal for\nstrongly convex expectation minimization problem. Because the expectation minimization problem\nis a special case of the problem (1), the O(1/K) convergence rate must be optimal for the stochastic\ncomposition problem too.\n\u2022 Consider the case where Lf = 0, which means that the outer function fv(\u00b7) is linear with\nprobability 1. Then the stochastic composition problem (1) reduces to an expectation minimization\nproblem since (EvfvEwgw)(x) = Ev(fv(Ewgw(x))) = EvEw(fvgw)(x). Therefore, it makes\na perfect sense to obtain the optimal convergence rate.\n\u2022 Consider the case where Lg = 0, which means that the inner function g(\u00b7) is a linear mapping.\nThe result is quite surprising. Note that even g(\u00b7) is a linear mapping, it does not reduce problem\n(1) to an expectation minimization problem. However, the ASC-PG still achieves the optimal\nconvergence rate. This suggests that, when inner linearity holds, the stochastic composition\nproblem (1) is not fundamentally more dif\ufb01cult than the expectation minimization problem.\n\nThe convergence rate results unveiled in Theorems 1 and 2 are the best known results for the\ncomposition problem. We believe that they provide important new result which provides insights into\nthe complexity of the stochastic composition problem.\n\n4 Application to Reinforcement Learning\n\nIn this section, we apply the proposed ASC-PG algorithm to conduct policy value evaluation in\nreinforcement learning through attacking Bellman equations. Suppose that there are in total S states.\nLet the policy of interest be \u21e1. Denote the value function of states by V \u21e1 2