{"title": "Stochastic Variance Reduced Primal Dual Algorithms for Empirical Composition Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 9882, "page_last": 9892, "abstract": "We consider a generic empirical composition optimization problem, where there are empirical averages present both outside and inside nonlinear loss functions. Such a problem is of interest in various machine learning applications, and cannot be directly solved by standard methods such as stochastic gradient descent (SGD). We take a novel approach to solving this problem by reformulating the original minimization objective into an equivalent min-max objective, which brings out all the empirical averages that are originally inside the nonlinear loss functions. We exploit the rich structures of the reformulated problem and develop a stochastic primal-dual algorithms, SVRPDA-I, to solve the problem efficiently. We carry out extensive theoretical analysis of the proposed algorithm, obtaining the convergence rate, the total computation complexity and the storage complexity. In particular, the algorithm is shown to converge at a linear rate when the problem is strongly convex. Moreover, we also develop an approximate version of the algorithm, named SVRPDA-II, which further reduces the memory requirement. Finally, we evaluate the performance of our algorithms on several real-world benchmarks and experimental results show that they significantly outperform existing techniques.", "full_text": "Stochastic Variance Reduced Primal Dual Algorithms\n\nfor Empirical Composition Optimization\n\nAdithya M. Devraj\u2217 and\n\nJianshu Chen\u2020\n\nAbstract\n\nWe consider a generic empirical composition optimization problem, where there are\nempirical averages present both outside and inside nonlinear loss functions. Such\na problem is of interest in various machine learning applications, and cannot be\ndirectly solved by standard methods such as stochastic gradient descent. We take a\nnovel approach to solving this problem by reformulating the original minimization\nobjective into an equivalent min-max objective, which brings out all the empirical\naverages that are originally inside the nonlinear loss functions. We exploit the\nrich structures of the reformulated problem and develop a stochastic primal-dual\nalgorithm, SVRPDA-I, to solve the problem ef\ufb01ciently. We carry out extensive\ntheoretical analysis of the proposed algorithm, obtaining the convergence rate, the\ncomputation complexity and the storage complexity. In particular, the algorithm is\nshown to converge at a linear rate when the problem is strongly convex. Moreover,\nwe also develop an approximate version of the algorithm, named SVRPDA-II,\nwhich further reduces the memory requirement. Finally, we evaluate our proposed\nalgorithms on several real-world benchmarks, and experimental results show that\nthe proposed algorithms signi\ufb01cantly outperform existing techniques.\n\n1\n\nIntroduction\n\nIn this paper, we consider the following regularized empirical composition optimization problem:\n\nnX\u22121(cid:88)\n\ni=0\n\n\u03c6i\n\nmin\n\n\u03b8\n\n1\nnX\n\n(cid:18) 1\n\nnYi\u22121(cid:88)\n\nnYi\n\nj=0\n\n(cid:19)\n\nf\u03b8(xi, yij)\n\n+ g(\u03b8),\n\n(1)\n\nwhere (xi, yij) \u2208 Rmx \u00d7 Rmy is the (i, j)-th data sample, f\u03b8 : Rmx \u00d7 Rmy \u2192 R(cid:96) is a function\nparameterized by \u03b8 \u2208 Rd, \u03c6i : R(cid:96) \u2192 R+ is a convex merit function, which measures a certain loss of\nthe parametric function f\u03b8, and g(\u03b8) is a \u00b5-strongly convex regularization term.\nProblems of the form (1) widely appear in many machine learning applications such as reinforcement\nlearning [5, 3, 2, 13], unsupervised sequence classi\ufb01cation [12, 21] and risk-averse learning [15, 18,\n9, 10, 19] \u2014 see our detailed discussion in Section 2. Note that the cost function (1) has an empirical\naverage (over xi) outside the (nonlinear) merit function \u03c6i(\u00b7) and an empirical average (over yij)\ninside the merit function, which makes it different from the empirical risk minimization problems\nthat are common in machine learning [17]. Problem (1) can be understood as a generalized version\nof the one considered in [9, 10].3 In these prior works, yij and nYi are assumed to be independent of\n\u2217Department of Electrical and Computer Engineering, University of Florida, Gainesville, USA. Email:\n\u2020Tencent AI Lab, Bellevue, WA, USA. Email: jianshuchen@tencent.com.\n3In addition to the term in (2), the cost function in [10] also has another convex regularization term.\n\nadithyamdevraj@ufl.edu. The work was done during an internship at Tencent AI Lab, Bellevue, WA.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fi and f\u03b8 is only a function of yj so that problem (1) can be reduced to the following special case:\n\nnX\u22121(cid:88)\n\ni=0\n\n\u03c6i\n\n(cid:18) 1\n\nnY \u22121(cid:88)\n\nnY\n\nj=0\n\n(cid:19)\n\nf\u03b8(yj)\n\n.\n\nmin\n\n\u03b8\n\n1\nnX\n\n(2)\n\nOur more general problem formulation (1) encompasses wider applications (see Section 2). Fur-\nthermore, different from [2, 19, 18], we focus on the \ufb01nite sample setting, where we have empirical\naverages (instead of expectations) in (1). As we shall see below, the \ufb01nite-sum structures allows us to\ndevelop ef\ufb01cient stochastic gradient methods that converges at linear rate.\nWhile problem (1) is important in many machine learning applications, there are several key chal-\nlenges in solving it ef\ufb01ciently. First, the number of samples (i.e., nX and nYi) could be extremely\nlarge: they could be larger than one million or even one billion. Therefore, it is unrealistic to use\nbatch gradient descent algorithm to solve the problem, which requires going over all the data samples\nat each gradient update step. Moreover, since there is an empirical average inside the nonlinear merit\nfunction \u03c6i(\u00b7), it is not possible to directly apply the classical stochastic gradient descent (SGD) algo-\nrithm. This is because sampling from both empirical averages outside and inside \u03c6i(\u00b7) simultaneously\nwould make the stochastic gradients intrinsically biased (see Appendix A for a discussion).\nTo address these challenges, in this paper, we \ufb01rst reformulate the original problem (1) into an\nequivalent saddle point problem (i.e., min-max problem), which brings out all the empirical averages\ninside \u03c6i(\u00b7) and exhibits useful dual decomposition and \ufb01nite-sum structures (Section 3.1). To fully\nexploit these properties, we develop a stochastic primal-dual algorithm that alternates between a dual\nstep of stochastic variance reduced coordinate ascent and a primal step of stochastic variance reduced\ngradient descent (Section 3.2). In particular, we develop a novel variance reduced stochastic gradient\nestimator for the primal step, which achieves better variance reduction with low complexity (Section\n3.3). We derive the convergence rate, the \ufb01nite-time complexity bound, and the storage complexity of\nour proposed algorithm (Section 4). In particular, it is shown that the proposed algorithms converge at\na linear rate when the problem is strongly convex. Moreover, we also develop an approximate version\nof the algorithm that further reduces the storage complexity without much performance degradation\nin experiments. We evaluate the performance of our algorithms on several real-world benchmarks,\nwhere the experimental results show that they signi\ufb01cantly outperform existing methods (Section 5).\nFinally, we discuss related works in Section 6 and conclude our paper in Section 7.\n\n2 Motivation and Applications\n\nTo motivate our composition optimization problem (1), we discuss several important machine learning\napplications where cost functions of the form (1) arise naturally.\n\nUnsupervised sequence classi\ufb01cation: Developing algorithms that can learn classi\ufb01ers from unla-\nbeled data could bene\ufb01t many machine learning systems, which could save a huge amount of human\nlabeling costs. In [12, 21], the authors proposed such unsupervised learning algorithms by exploiting\nthe sequential output structures. The developed algorithms are applied to optical character recognition\n(OCR) problems and automatic speech recognition (ASR) problems. In these works, the learning\nalgorithms seek to learn a sequence classi\ufb01er by optimizing the empirical output distribution match\n(Empirical-ODM) cost, which is in the following form (written in our notation):\n\n(cid:26)\n\n\u2212nX\u22121(cid:88)\n\ni=0\n\nmin\n\n\u03b8\n\n(cid:18) 1\n\nnY \u22121(cid:88)\n\nnY\n\nj=0\n\n(cid:19)(cid:27)\n\npLM(xi) log\n\nf\u03b8(xi, yj)\n\n,\n\n(3)\n\nwhere pLM is a known language model (LM) that describes the distribution of output sequence (e.g.,\nxi represents different n-grams), and f\u03b8 is a functional of the sequence classi\ufb01er to be learned, with\n\u03b8 being its model parameter vector. The key idea is to learn the classi\ufb01er so that its predicted output\nn-gram distribution is close to the prior n-gram distribution pLM (see [12, 21] for more details).\nThe cost function (3) can be viewed as a special case of (1) by setting nYi = nY , yij = yj and\n\u03c6i(u) = \u2212pLM (xi) log(u). Note that the formulation (2) cannot be directly used here, because of\nthe dependency of the function f\u03b8 on both xi and yj.\n\nRisk-averse learning: Another application where (1) arises naturally is the risk-averse learning\nproblem, which is common in \ufb01nance [15, 18, 9, 10, 19, 20]. Let xi \u2208 Rd be a vector consisting of\n\n2\n\n\fthe rewards from d assets at the i-th instance, where 0 \u2264 i \u2264 n \u2212 1. The objective in risk-averse\nlearning is to \ufb01nd the optimal weights of the d assets so that the average returns are maximized while\nthe risk is minimized. It could be formulated as the following optimization problem:\n\nn\u22121(cid:88)\n\ni=0\n\n(cid:18)\n\nn\u22121(cid:88)\n\ni=0\n\n(cid:19)2\n\nn\u22121(cid:88)\n\nj=0\n\n\u2212 1\nn\n\nmin\n\n\u03b8\n\n(cid:104)xi, \u03b8(cid:105)+\n\n1\nn\n\n(cid:104)xi, \u03b8(cid:105)\u2212 1\nn\n\n(cid:104)xj, \u03b8(cid:105)\n\n,\n\n(4)\n\nwhere \u03b8 \u2208 Rd denotes the weight vector. The objective function in (4) seeks a tradeoff between the\nmean (the \ufb01rst term) and the variance (the second term). It can be understood as a special case of (2)\n(which is a further special case of (1)) by making the following identi\ufb01cations:\nnX = nY = n, yi\u2261 xi, f\u03b8(yj) = [\u03b8T, \u2212(cid:104)yj, \u03b8(cid:105)]T, \u03c6i(u) = ((cid:104)xi, u0:d\u22121(cid:105)+ud)2\u2212(cid:104)xi, u0:d\u22121(cid:105),\n(5)\nwhere u0:d\u22121 denotes the subvector constructed from the \ufb01rst d elements of u, and ud denotes the\nd-th element. An alternative yet simpler way of dealing with (4) is to treat the second term in (4) as a\nspecial case of (1) by setting\n\nnX = nYi = n, yij \u2261 xj, f\u03b8(xi, yij) = (cid:104)xi \u2212 yij, \u03b8(cid:105), \u03c6i(u) = u2, u \u2208 R.\n\n(6)\nIn addition, we observe that the \ufb01rst term in (4) is in standard empirical risk minimization form,\nwhich can be dealt with in a straightforward manner. This second formulation leads to algorithms\nwith lower complexity due to the lower dimension of the functions: (cid:96) = 1 instead of (cid:96) = d + 1 in the\n\ufb01rst formulation. Therefore, we will adopt this formulation in our experiment section (Section 5).\n\nOther applications: Cost functions of the form (1) also appear in reinforcement learning [5, 2, 3]\nand other applications [18]. In Appendix D, we demonstrate its applications in policy evaluation.\n\n3 Algorithms\n\n3.1 Saddle point formulation\nRecall from (1) that there is an empirical average inside each (nonlinear) merit function \u03c6i(\u00b7), which\nprevents the direct application of stochastic gradient descent to (1) due to the inherent bias (see\nAppendix A for more discussions). Nevertheless, we will show that minimizing the original cost\nfunction (1) can be transformed into an equivalent saddle point problem, which brings out all the\nempirical averages inside \u03c6i(\u00b7). In what follows, we will use the machinery of convex conjugate\nfunctions [14]. For a function \u03c8 : R(cid:96) \u2192 R, its convex conjugate function \u03c8\u2217 : R(cid:96) \u2192 R is de\ufb01ned as\n\u03c8\u2217(y) = supx \u2208 R(cid:96)((cid:104)x, y(cid:105)\u2212 \u03c8(x)). Under certain mild conditions on \u03c8(x) [14], one can also express\n\u03c8(x) as a functional of its conjugate function: \u03c8(x) = supy \u2208 R(cid:96) ((cid:104)x, y(cid:105)\u2212 \u03c8\u2217(y)). Let \u03c6\u2217\ni (wi) denote\nthe conjugate function of \u03c6i(u). Then, we can express \u03c6i(u) as\n\n\u03c6i(u) = sup\nwi\u2208R(cid:96)\n\n((cid:104)u, wi(cid:105) \u2212 \u03c6\u2217\n\ni (wi)),\n\n(7)\n\nwhere wi is the corresponding dual variable. Substituting (7) into the original minimization problem\n(1), we obtain its equivalent min-max problem as:\n\nnX\u22121(cid:88)\n\n(cid:104)(cid:68) 1\n\nnYi\u22121(cid:88)\n\ni=0\n\nnYi\n\nj=0\n\n(cid:69) \u2212 \u03c6\u2217\n\n(cid:105)\n\n(cid:27)\n\nf\u03b8(xi, yij), wi\n\ni (wi)\n\n+ g(\u03b8)\n\n,\n\n(8)\n\nmin\n\n\u03b8\n\nmax\n\nw\n\nL(\u03b8, w) + g(\u03b8) (cid:44) 1\nnX\n\n(cid:26)\n\nwhere w(cid:44){w0, . . . , wnX\u22121}, is a collection of all dual variables. We note that the transformation of\nthe original problem (1) into (8) brings out all the empirical averages that are present inside \u03c6i(\u00b7).\nThis new formulation allows us to develop stochastic variance reduced algorithms below.\n\n3.2 Stochastic variance reduced primal-dual algorithm\n\nOne common solution for the min-max problem (8) is to alternate between the step of minimization\n(with respect to the primal variable \u03b8) and the step of maximization (with respect to the dual variable\nw). However, such an approach generally suffers from high computation complexity because each\nminimization/maximization step requires a summation over many components and requires a full\n\n3\n\n\fpass over all the data samples. The complexity of such a batch algorithm would be prohibitively\nhigh when the number of data samples (i.e., nX and nYi) is large (e.g., they could be larger than one\nmillion or even one billion in applications like unsupervised speech recognition [21]). On the other\nhand, problem (8) indeed has rich structures that we can exploit to develop more ef\ufb01cient solutions.\nTo this end, we make the following observations. First, expression (8) implies that when \u03b8 is \ufb01xed, the\nmaximization over the dual variable w can be decoupled into a total of nX individual maximizations\nover different wi\u2019s. Second, the objective function in each individual maximization (with respect to\nwi) contains a \ufb01nite-sum structure over j. Third, by (8), for a \ufb01xed w, the minimization with respect\nto the primal variable \u03b8 is also performed over an objective function with a \ufb01nite-sum structure. Based\non these observations, we will develop an ef\ufb01cient stochastic variance reduced primal-dual algorithm\n(named SVRPDA-I). It alternates between (i) a dual step of stochastic variance reduced coordinate\nascent and (ii) a primal step of stochastic variance reduced gradient descent. The full algorithm is\nsummarized in Algorithm 1, with its key ideas explained below.\n\nDual step: stochastic variance reduced coordinate ascent. To exploit the decoupled dual maxi-\nmization over w in (8), we can randomly sample an index i, and update wi according to:\n\nw(k)\n\ni = arg min\nwi\n\nf\u03b8(k\u22121)(xi, yij), wi\n\n+ \u03c6\u2217\n\ni (wi) +\n\n1\n\n2\u03b1w\n\n(cid:107)wi \u2212 w(k\u22121)\n\ni\n\nwhile keeping all other wj\u2019s (j (cid:54)= i) unchanged, where \u03b1w denotes a step-size. Note that each step\nof recursion (9) still requires a summation over nYi components. To further reduce the complexity,\nwe approximate the sum over j by a variance reduced stochastic estimator de\ufb01ned in (12) (to be\ndiscussed in Section 3.3). The dual step in our algorithm is summarized in (13), where we assume\nthat the function \u03c6\u2217\ni (wi) is in a simple form so that the argmin could be solved in closed-form. Note\nthat we \ufb02ip the sign of the objective function to change maximization to minimization and apply\ncoordinate descent. We will still refer to the dual step as \u201ccoordinate ascent\u201d (instead of descent).\n\nPrimal step: stochastic variance reduced gradient descent We now consider the minimization\nin (8) with respect to \u03b8 when w is \ufb01xed. The gradient descent step for minimizing L(\u03b8, w) is given by\n\n(cid:69)\n\n(cid:27)\n\n(cid:110) \u2212(cid:68) 1\n\nnYi\u22121(cid:88)\n\nnYi\n\nj=0\n\n(cid:69)\n\n(cid:107)2(cid:111)\n\n,\n\n(9)\n\n(cid:26)(cid:68) nX\u22121(cid:88)\n\nnYi\u22121(cid:88)\n\ni=0\n\nj=0\n\n\u03b8(k) = arg min\n\n\u03b8\n\n1\n\nnX nYi\n\nf(cid:48)\n\u03b8(k\u22121)(xi, yij)w(k)\n\ni\n\n, \u03b8\n\n+\n\n1\n2\u03b1\u03b8\n\n(cid:107)\u03b8 \u2212 \u03b8(k\u22121)(cid:107)2\n\n,\n\n(10)\n\nwhere \u03b1\u03b8 denotes a step-size. It is easy to see that the update equation (10) has high complexity, it\n\u03b8(\u00b7,\u00b7) at every data sample. To reduce the complexity,\nrequires evaluating and averaging the gradient f(cid:48)\nwe use a variance reduced gradient estimator, de\ufb01ned in (15), to approximate the sums in (10) (to be\ndiscussed in Section 3.3). The primal step in our algorithm is summarized in (16) in Algorithm 1.\n\n3.3 Low-complexity stochastic variance reduced estimators\n\nWe now proceed to explain the design of the variance reduced gradient estimators in both the dual\nand the primal updates. The main idea is inspired by the stochastic variance reduced gradient (SVRG)\nalgorithm [7]. Speci\ufb01cally, for a vector-valued function h(\u03b8) = 1\ni=0 hi(\u03b8), we can construct its\nn\nSVRG estimator \u03b4k at each iteration step k by using the following expression:\n\n(cid:80)n\u22121\n\n\u03b4k = hik (\u03b8) \u2212 hik (\u02dc\u03b8) + h(\u02dc\u03b8),\n\n(17)\nwhere ik is a randomly sampled index from {0, . . . , n \u2212 1}, and \u02dc\u03b8 is a reference variable that is\nupdated periodically (to be explained below). The \ufb01rst term hi(\u03b8) in (17) is an unbiased estimator\nof h(\u03b8) and is generally known as the stochastic gradient when h(\u03b8) is the gradient of a certain\ncost function. The last two terms in (17) construct a control variate that has zero mean and is\nnegatively correlated with hi(\u03b8), which keeps \u03b4k unbiased while signi\ufb01cantly reducing its variance.\nThe reference variable \u02dc\u03b8 is usually set to be a delayed version of \u03b8: for example, after every M\nupdates of \u03b8, it can be reset to the most recent iterate of \u03b8. Note that there is a trade-off in the choice\nof M: a smaller M further reduces the variance of \u03b4k since \u02dc\u03b8 will be closer to \u03b8 and the \ufb01rst two\nterms in (17) cancel more with each other; on the other hand, it will also require more frequent\nevaluations of the costly batch term h(\u02dc\u03b8), which has a complexity of O(n).\n\n4\n\n\fAlgorithm 1 SVRPDA-I\n1: Inputs: data {(xi, yij) : 0\u2264 i < nX , 0\u2264 j < nYi}; step-sizes \u03b1\u03b8 and \u03b1w; # inner iterations M.\n2: Initialization: \u02dc\u03b80 \u2208 Rd and \u02dcw0 \u2208 R(cid:96)nX .\n3: for s = 1, 2, . . . do\n4:\n\nSet \u02dc\u03b8 = \u02dc\u03b8s\u22121, \u03b8(0) = \u02dc\u03b8, \u02dcw = \u02dcws\u22121, w(0) = \u02dcws\u22121, and compute the batch quantities (for each 0\u2264 i < nX):\n\n\u22121(cid:88)\n\nnYi\n\nnX\u22121(cid:88)\n\n\u22121(cid:88)\n\nnYi\n\nU0 =\n\nf(cid:48)\n\u02dc\u03b8(xi, yij)w(0)\n\ni\n\nnX nYi\n\n, f i(\u02dc\u03b8) (cid:44)\n\nf \u02dc\u03b8(xi, yij)\n\n(cid:48)\ni(\u02dc\u03b8) =\n\n, f\n\nnYi\n\ni=0\n\nj=0\nfor k = 1 to M do\nRandomly sample ik \u2208 {0, . . . , nX\u22121} and then jk \u2208 {0, . . . , nYik\nCompute the stochastic variance reduced gradient for dual update:\n\nj=0\n\nk = f\u03b8(k\u22121) (xik , yikjk ) \u2212 f \u02dc\u03b8(xik , yikjk ) + f ik\n\u03b4w\nUpdate the dual variables:\n\nk , wi(cid:105) + \u03c6\n\n\u2217\ni (wi) +\n\n1\n\n2\u03b1w\n\n(cid:107)wi \u2212 w(k\u22121)\n\ni\n\n(cid:104) \u2212 (cid:104)\u03b4w\n\n\uf8f1\uf8f2\uf8f3arg min\n\nw(k\u22121)\n\nwi\n\ni\n\nw(k)\n\ni =\n\n\u22121(cid:88)\n\nnYi\n\nf(cid:48)\n\u02dc\u03b8(xi, yij)\n\nj=0\n\nnYi\n\n.\n\n(11)\n\n\u22121} at uniform.\n\n(\u02dc\u03b8).\n\n(cid:107)2(cid:105)\n\n(12)\n\n(13)\n\nif i = ik\nif i (cid:54)= ik\n\n.\n\n5:\n6:\n7:\n\n8:\n\n9:\n\n10:\n\nUpdate Uk (primal batch gradient at \u02dc\u03b8 and w(k)) according to the following recursion:\n\nUk = Uk\u22121 +\n\n(cid:48)\nik\n\nf\n\n1\nnX\n\n(cid:1).\n\n(\u02dc\u03b8)(cid:0)w(k)\n\nik\n\n\u2212 w(k\u22121)\nk \u2208 {0, . . . , nYi(cid:48)\n\nik\n\n(14)\n\u2212 1}, independent of ik and jk,\n\nRandomly sample i(cid:48)\nand compute the stochastic variance reduced gradient for primal update:\n\nk \u2208 {0, . . . , nX \u2212 1} and then j(cid:48)\n\nk\n\n\u03b4\u03b8\nk = f\n\n11:\n\nUpdate the primal variable:\n\n(cid:48)\n\u03b8(k\u22121) (xi(cid:48)\n\nk\n\n)w(k)\ni(cid:48)\n\nk\n\n\u2212 f\n\n(cid:48)\n\u02dc\u03b8(xi(cid:48)\n\nk\n\nkj(cid:48)\n\nk\n\n, yi(cid:48)\n\n(cid:104)(cid:104)\u03b4\u03b8\n\n, yi(cid:48)\n\nkj(cid:48)\n\nk\n\n)w(k)\ni(cid:48)\n\n+ Uk.\n\nk\n\n(cid:107)\u03b8 \u2212 \u03b8(k\u22121)(cid:107)2(cid:105)\n\n.\n\n(15)\n\n(16)\n\n\u03b8(k) = arg min\n\n\u03b8\n\nk, \u03b8(cid:105) + g(\u03b8) +\n\n1\n2\u03b1\u03b8\n\nend for\n\n12:\n13: Option I: Set \u02dcws = w(M ) and \u02dc\u03b8s = \u03b8(M ).\n14: Option II: Set \u02dcws = w(M ) and \u02dc\u03b8s = \u03b8(t) for randomly sampled t \u2208 {0, . . . , M\u22121}.\n15: end for\n16: Output: \u02dc\u03b8s at the last outer-loop iteration.\n\nBased on (17), we develop two stochastic variance reduced estimators, (12) and (15), to approximate\nthe \ufb01nite-sums in (9) and (10), respectively. The dual gradient estimator \u03b4w\nk in (12) is constructed in a\nstandard manner using (17), where the reference variable \u02dc\u03b8 is a delayed version of \u03b8(k)4. On the other\nk in (15) is constructed by using reference variables (\u02dc\u03b8, w(k));\nhand, the primal gradient estimator \u03b4\u03b8\nthat is, we uses the most recent w(k) as the dual reference variable, without any delay. As discussed\nearlier, such a choice leads to a smaller variance in the stochastic estimator \u03b4k\n\u03b8 at a potentially higher\ncomputation cost (from more frequent evaluation of the batch term). Nevertheless, we are able to\nshow that, with the dual coordinate ascent structure in our algorithm, the batch term Uk in (15), which\nis the summation in (10) evaluated at (\u02dc\u03b8, w(k)), can be computed ef\ufb01ciently. To see this, note that,\nafter each dual update step in (13), only one term inside this summation in (10), has been changed,\ni.e., the one associated with i = ik. Therefore, we can correct Uk for this term by using recursion\n(14), which only requires an extra O(d(cid:96))-complexity per step (same complexity as (15)).\n\n(cid:48)\ni(\u02dc\u03b8) in (11), which is\nNote that SVRPDA-I (Algorithm 1) requires to compute and store all the f\nO(nX d(cid:96))-complexity in storage and could be expensive in some applications. To avoid the cost,\nwe develop a variant of Algorithm 1, named as SVRPDA-II (see Algorithm 1 in the supplementary\nmaterial), by approximating f ik\nk is another randomly sampled\nindex from {0, . . . , nYi \u2212 1}, independent of all other indexes. By doing this, we can signi\ufb01cantly\n\n(\u02dc\u03b8) in (14) with f(cid:48)\n(xik , yikj(cid:48)(cid:48)\n\u02dc\u03b8\n\n), where j(cid:48)(cid:48)\n\nk\n\n4As in [7], we also consider Option II wherein \u02dc\u03b8 is randomly chosen from the previous M \u03b8(k)\u2019s.\n\n5\n\n\fTable 1: The total complexities of different stochastic composition optimization algorithms. For C-\nSAGA, \u03b1 = 2/3 in the minibatch setting and \u03b1 = 1 when batch-size=1. In the bound for ASCVRG,\nthe dependency on \u03ba has been dropped since it was not reported in [10].\n\nMethods\n\nSVRPDA-I (Ours)\n\nComp-SVRG [9]\n\nMSPBE-SVRG/SAGA [5]\n\nASCVRG [10]\n\nGeneral: problem (1)\nSpecial: problem (2)\nSpecial: (2) & nX = 1\n\n(nX nY +nX \u03ba)ln 1\n\u0001\n(nX+nY +nX \u03ba)ln 1\n\n(nY +\u03ba) ln 1\n\u0001\n\n\u0001 (nX+nY +\u03ba3)ln 1\n(nY +\u03ba3) ln 1\n\u0001\n\n\u0018\u0018\u0018\n\n\u0001 (nX+nY +(nX+nY )\u03b1\u03ba)ln 1\n\n\u0001\n\n(nY +n\u03b1\n\nY \u03ba) ln 1\n\u0001\n\n(nY +\u03ba2) ln 1\n\u0001\n\n\u0018\u0018\u0018\n\u0018\u0018\u0018\n\n\u0018\u0018\u0018\n(nX+nY )ln 1\n\u0001 + 1\n\u00013\n\nnY ln 1\n\n\u0001+ 1\n\u00013\n\nC-SAGA [22]\n\n\u0018\u0018\u0018\n\nreduce the memory requirement from O(nX d(cid:96)) in SVRPDA-I to O(d + nX (cid:96)) in SVRPDA-II (see\nSection 4.2). In addition, experimental results in Section 5 will show that such an approximation only\ncause slight performance loss compared to that of SVRPDA-I algorithm.\n\n4 Theoretical Analysis\n\n4.1 Computation complexity\n\nWe now perform convergence analysis for the SVRPDA-I algorithm and also derive their complexities\nin computation and storage. To begin with, we \ufb01rst introduce the following assumptions.\nAssumption 4.1. The function g(\u03b8) is \u00b5-strongly convex in \u03b8, and each \u03c6i is 1/\u03b3-smooth.\nAssumption 4.2. The merit functions \u03c6i(u) are Lipschitz with a uniform constant Bw:\n\n|\u03c6i(u) \u2212 \u03c6i(u(cid:48))| \u2264 Bw(cid:107)u \u2212 u(cid:48)(cid:107),\n\n\u2200u, u(cid:48); \u2200i = 0, . . . , nX \u2212 1.\n\nAssumption 4.3. f\u03b8(xi, yij) is B\u03b8-smooth in \u03b8, and has bounded gradients with constant Bf :\n\u2200\u03b8, \u03b81, \u03b82, \u2200i, j.\n\n(xi, yij)(cid:107) \u2264 B\u03b8(cid:107)\u03b81 \u2212 \u03b82(cid:107),\n\n\u03b8(xi, yij)(cid:107) \u2264 Bf ,\n\n(xi, yij) \u2212 f(cid:48)\n\n(cid:107)f(cid:48)\n\n(cid:107)f(cid:48)\n\n\u03b81\n\n\u03b82\n\nAssumption 4.4. For each given w in its domain, the function L(\u03b8, w) de\ufb01ned in (8) is convex in \u03b8:\n\nL(\u03b81, w) \u2212 L(\u03b82, w) \u2265 (cid:104)L(cid:48)\n\n\u03b8(\u03b82, w), \u03b81 \u2212 \u03b82(cid:105),\n\n\u2200\u03b81, \u03b82.\n\nThe above assumptions are commonly used in existing compositional optimization works [9, 10, 18,\n19, 22]. Based on these assumptions, we establish the non-asymptotic error bounds for SVRPDA-\nI (using either Option I or Option II in Algorithm 1). The main results are summarized in the\nfollowing theorems, and their proofs can be found in Appendix E.\nTheorem 4.5. Suppose Assumptions 4.1\u20134.4 hold. If in Algorithm 1 (with Option I) we choose\n\n\u03b1\u03b8, M =(cid:6)78.8nX \u03ba+1.3nX +1.3(cid:7)\n\n1\n\n\u03b1\u03b8 =\n\nnX \u00b5(64\u03ba + 1)\n\n, \u03b1w =\n\nnX \u00b5\n\n\u03b3\n\n64\u03ba+3\n\nwhere (cid:100)x(cid:101) denotes the roundup operation and \u03ba = B2\nPs := E(cid:107)\u02dc\u03b8s \u2212 \u03b8\u2217(cid:107)2 + \u03b3\n\u00b5 \u00b7\noverall computational cost (in number of oracle calls5) for reaching Ps \u2264 \u0001 is upper bounded by\n\n\u03b8 /\u00b52, then the Lyapunov function\n64nX \u03ba+nX +1 E(cid:107) \u02dcws \u2212 w\u2217(cid:107)2 satis\ufb01es Ps \u2264 (3/4)sP0. Furthermore, the\n\nf /\u03b3\u00b5 + B2\n\nO(cid:0)(nX nY + nX \u03ba + nX ) ln(1/\u0001)(cid:1).\n\nwhere, with a slight abuse of notation, nY is de\ufb01ned as nY = (nY0 + \u00b7\u00b7\u00b7 + nYnX \u22121 )/nX.\nTheorem 4.6. Suppose Assumptions 4.1\u20134.4 hold. If in Algorithm 1 (with Option II) we choose\n\nwB2\n\n(18)\n\n(cid:17)\u22121\n\n(cid:18) 10\n\u03b1\u03b8\u00b5\nnX \u00b5 E(cid:107) \u02dcws\u2212w\u2217(cid:107)2 \u2264 (5/8)sP0. Furthermore, let \u03ba =\n\n80B2\nwB2\n\u03b8\n\u00b5\n\n, M = max\n\n, \u03b1w =\n\n40B2\nf\n\n\u00b5\n\n,\n\n2nX\n\u03b1w\u03b3\n\u03b3\u00b5 + B2\n\n(cid:19)\n\n, 4nX\n\n,\n\nthen Ps := E(cid:107)\u02dc\u03b8s\u2212\u03b8\u2217(cid:107)2 + \u03b3\n. Then,\nthe overall computational cost (in number of oracle calls) for reaching Ps \u2264 \u0001 is upper bounded by\n(19)\n\nO(cid:0)(nX nY + nX \u03ba + nX ) ln(1/\u0001)(cid:1).\n\nwB2\n\u00b52\n\nB2\nf\n\n\u03b8\n\n(cid:16) 25B2\n\nf\n\n\u03b3\n\n\u03b1\u03b8 =\n\n+10B\u03b8Bw +\n\nThe above theorems show that the Lyapunov function Ps for SVRPDA-I converges to zero at a linear\nrate when either Option I or II is used. Since E(cid:107)\u02dc\u03b8s \u2212 \u03b8\u2217(cid:107)2 \u2264 Ps, they imply that the computational\ncost (in number of oracle calls) for reaching E(cid:107)\u02dc\u03b8s \u2212 \u03b8\u2217(cid:107)2 \u2264 \u0001 is also upper bounded by (18) and (19).\n\n5One oracle call is de\ufb01ned as querying f\u03b8, f(cid:48)\n\n\u03b8, or \u03c6i(u) for any 0 \u2264 i < n and u \u2208 R(cid:96).\n\n6\n\n\fTable 2: The storage complexity of SVRPDA-I and SVRPDA-II.\n\nU0\n\n\u02dc\u03b8\n\n(cid:48)\ni}\nMethods\n\u03b4w\nk\nSVRPDA-I O(d) O(nX (cid:96)) O(nX d(cid:96)) O(d) O(d) O(nX (cid:96)) O(d) O((cid:96))\nSVRPDA-II O(d) O(nX (cid:96)) \u0018\u0018\u0018\n\nO(nX d(cid:96))\nO(d) O(d) O(nX (cid:96)) O(d) O((cid:96)) O(d+nX (cid:96))\n\n{w(k)\ni }\n\n{f i}\n\nTotal\n\n\u03b8(k)\n\n{f\n\n\u03b4\u03b8\nk\n\nComparison with existing composition optimization algorithms Table 1 summarizes the com-\nplexity bounds for our SVRPDA-I algorithm and compares them with existing stochastic composition\noptimization algorithms. First, to our best knowledge, none of the existing methods consider the\ngeneral objective function (1) as we did. Instead, they consider its special case (2), and even in this\nspecial case, our algorithm still has better (or comparable) complexity bound than other methods. For\nexample, our bound is better than that of [9] since \u03ba2 > nX generally holds, and it is better than that\nof ASCVRG, which does not achieve linear convergence rate (as no strong convexity is assumed).\nIn addition, our method has better complexity than C-SAGA algorithm when nX = 1 (regardless\nof mini-batch size in C-SAGA), and it is better than C-SAGA for (2) when the mini-batch size is\n1.6 However, since we have not derived our bound for mini-batch setting, it is unclear which one\nis better in this case, and is an interesting topic for future work. One notable fact from Table 1 is\nthat in this special case (2), the complexity of SVRPDA-I is reduced from O((nX nY +nX \u03ba) ln 1\n\u0001 ) to\n\u0001 ). This is because the complexity for evaluating the batch quantities in (11)\nO((nX +nY +nX \u03ba) ln 1\n(Algorithm 1) can be reduced from O(nX nY ) in the general case (1) to O(nX + nY ) in the special\ncase (2). To see this, note that f\u03b8 and nYi = nY become independent of i in (2) and (11), meaning\nthat we can factor U0 in (11) as U0 = 1\n, where the two sums can be\nevaluated independently with complexity O(nY ) and O(nX ), respectively. The other two quantities\nin (11) need only O(nY ) due to their independence of i. Second, we consider the further special case\nof (2) with nX = 1, which simpli\ufb01es the objective function (1) so that there is no empirical average\noutside \u03c6i(\u00b7). This takes the form of the unsupervised learning objective function that appears in [12].\nNote that our results O((nY +\u03ba) log 1\n\u0001 ) enjoys a linear convergence rate (i.e., log-dependency on \u0001)\ndue to the variance reduction technique. In contrast, stochastic primal-dual gradient (SPDG) method\nin [12], which does not use variance reduction, can only have sublinear convergence rate (i.e., O( 1\n\u0001 )).\nRelation to SPDC [23] Lastly, we consider the case where nYi = 1 for all 1 \u2264 i \u2264 nX and f\u03b8 is a\nlinear function in \u03b8. This simpli\ufb01es (1) to the problem considered in [23], known as the regularized\nempirical risk minimization of linear predictors. It has applications in support vector machines,\nregularized logistic regression, and more, depending on how the merit function \u03c6i is de\ufb01ned. In this\nspecial case, the overall complexity for SVRPDA-I becomes (see Appendix F):\n\n(cid:80)nY \u22121\nj=0 f(cid:48)\n\n(yj)(cid:80)nX\n\ni=0 w(0)\n\nnX nY\n\n\u02dc\u03b8\n\ni\n\n\u221a\n\n(20)\nf /\u00b5\u03b3. In comparison, the authors in [23] propose a stochastic\nwhere the condition number \u03ba = B2\nprimal dual coordinate (SPDC) algorithm for this special case and prove an overall complexity of\n\n(cid:1)(cid:1) to achieve an \u0001-error solution. It is interesting to note that the complexity\n\nnX \u03ba(cid:1) ln(cid:0) 1\n\nO(cid:0)(cid:0)nX +\n\nresult in (20) and the complexity result in [23] only differ in their dependency on \u03ba. This difference\nis most likely due to the acceleration technique that is employed in the primal update of the SPDC\nalgorithm. We conjecture that the dependency on the condition number of SVRPDA-I can be further\nimproved using a similar acceleration technique.\n\n\u0001\n\nO(cid:0)(nX + \u03ba) ln(1/\u0001)(cid:1) ,\n\n4.2 Storage complexity\n\nWe now brie\ufb02y discuss and compare the storage complexities of both SVRPDA-I and SVRPDA-II. In\nTable 2, we report the itemized and total storage complexities for both algorithms, which shows that\nSVRPDA-II signi\ufb01cantly reduces the memory footprint. We also observe that the batch quantities\n(cid:48)\ni(\u02dc\u03b8), dominates the storage complexity in SVRPDA-I. On the other hand, the\nin (11), especially f\nmemory usage in SVRPDA-II is more uniformly distributed over different quantities. Furthermore,\nalthough the total complexity of SVRPDA-II, O(d + nX (cid:96)), grows with the number of samples nX,\nthe nX (cid:96) term is relatively small because the dimension (cid:96) is small in many practical problems (e.g.,\n(cid:96) = 1 in (3) and (4)). This is similar to the storage requirement in SPDC [23] and SAGA [4].\n\n6In Appendix D, we also show that our algorithms outperform C-SAGA in experiments.\n\n7\n\n\fFigure 1: Performance of different algorithms on the risk-averse learning for portfolio management\noptimization problem. The performance is measured in terms of the number of oracle calls required\nto achieve a certain objective gap.\n\n5 Experiments\n\nIn this section we consider the problem of risk-averse learning for portfolio management optimization\n[9, 10], introduced in Section 2.7 Speci\ufb01cally, we want to solve the optimization problem (4) for a\ngiven set of reward vectors {xi \u2208 Rd : 0 \u2264 i \u2264 n \u2212 1}. As we discussed in Section 2, we adopt\nthe alternative formulation (6) for the second term so that it becomes a special case of our general\nproblem (1). Then, we rewrite the cost function into a min-max problem by following the argument in\nSection 3.1 and apply our SVRPDA-I and SVRPDA-II algorithms (see Appendix C.1 for the details).\nWe evaluate our algorithms on 18 real-world US Research Returns datasets obtained from the\nCenter for Research in Security Prices (CRSP) website8, with the same setup as in [10]. In each\nof these datasets, we have d = 25 and n = 7240. We compare the performance of our proposed\nSVRPDA-I and SVRPDA-II algorithms9 with the following state-of-the art algorithms designed\nto solve composition optimization problems: (i) Compositional-SVRG-1 (Algorithm 2 of [9]), (ii)\nCompositional-SVRG-2 (Algorithm 3 of [9]), (iii) Full batch gradient descent, and (iv) ASCVRG\nalgorithm [10]. For the compositional-SVRG algorithms, we follow [9] to formulate it as a special\ncase of the form (2) by using the identi\ufb01cation (5). Note that we cannot use the identi\ufb01cation (6) for\nthe compositional SVRG algorithms because it will lead to the more general formulation (1) with f\u03b8\ndepending on both xi and yij \u2261 xj. For further details, the reader is referred to [9].\nAs in previous works, we compare different algorithms based on the number of oracle calls required\nto achieve a certain objective gap (the difference between the objective function evaluated at the\ncurrent iterate and at the optimal parameters). One oracle call is de\ufb01ned as accessing the function f\u03b8,\n\u03b8, or \u03c6i(u) for any 0 \u2264 i < n and u \u2208 R(cid:96). The results are shown in Figure 1, which\nits derivative f(cid:48)\nshows that our proposed algorithms signi\ufb01cantly outperform the baseline methods on all datasets. In\naddition, we also observe that SVRPDA-II also converges at a linear rate, and the performance loss\ncaused by the approximation is relatively small compared to SVRPDA-I.\n\n7Additional experiments on the application to policy evaluation in MDPs can be found in Appendix D.\n8The processed data in the form of .mat \ufb01le was obtained from https://github.com/tyDLin/SCVRG\n9The choice of the hyper-parameters can be found in Appendix C.2, and the code will be released publicly.\n\n8\n\nOP datasetsME datasetsINV datasets\f6 Related Works\n\nComposition optimization have attracted signi\ufb01cant attention in optimization literature. The stochastic\nversion of the problem (2), where the empirical averages are replaced by expectations, is studied\nin [18]. The authors propose a two-timescale stochastic approximation algorithm known as SCGD,\nand establish sublinear convergence rates. In [19], the authors propose the ASC-PG algorithm by\nusing a proximal gradient method to deal with nonsmooth regularizations. The works that are more\nclosely related to our setting are [9] and [10], which consider a \ufb01nite-sum minimization problem (2)\n(a special case of our general formulation (1)). In [9], the authors propose the compositional-SVRG\nmethods, which combine SCGD with the SVRG technique from [7] and obtain linear convergence\nrates. In [10], the authors propose the ASCVRG algorithms that extends to convex but non-smooth\nobjectives. Recently, the authors in [22] propose a C-SAGA algorithm to solve the special case of (2)\nwith nX = 1, and extend to general nX. Different from these works, we take an ef\ufb01cient primal-dual\napproach that fully exploits the dual decomposition and the \ufb01nite-sum structures.\nOn the other hand, problems similar to (1) (and its stochastic versions) are also examined in different\nspeci\ufb01c machine learning problems. [16] considers the minimization of the mean square projected\nBellman error (MSPBE) for policy evaluation, which has an expectation inside a quadratic loss.\nThe authors propose a two-timescale stochastic approximation algorithm, GTD2, and establish its\nasymptotic convergence. [11] and [13] independently showed that the GTD2 is a stochastic gradient\nmethod for solving an equivalent saddle-point problem. In [2] and [3], the authors derived saddle-\npoint formulations for two other variants of costs (MSBE and MSCBE) in the policy evaluation and\nthe control settings, and develop their stochastic primal-dual algorithms. All these works consider\nthe stochastic version of the composition optimization and the proposed algorithms have sublinear\nconvergence rates. In [5], different variance reduction methods are developed to solve the \ufb01nite-sum\nversion of MSPBE and achieve linear rate even without strongly convex regularization. Then the\nauthors in [6] extends this linear convergence results to the general convex-concave problem with\nlinear coupling and without strong convexity. Besides, problem of the form (1) was also studied in\nthe context of unsupervised learning [12, 21] in the stochastic setting (with expectations in (1)).\nFinally, our work is inspired by the stochastic variance reduction techniques in optimization [8, 7, 4,\n1, 23], which considers the minimization of a cost that is a \ufb01nite-sum of many component functions.\nDifferent versions of variance reduced stochastic gradients are constructed in these works to achieve\nlinear convergence rate. In particular, our variance reduced stochastic estimators are constructed\nbased on the idea of SVRG [7] with a novel design of the control variates. Our work is also related to\nthe SPDC algorithm [23], which also integrates dual coordinate ascent with variance reduced primal\ngradient. However, our work is different from SPDC in the following aspects. First, we consider a\nmore general composition optimization problem (1) while SPDC focuses on regularized empirical\nrisk minimization with linear predictors, i.e., nYi \u2261 1 and f\u03b8 is linear in \u03b8. Second, because of the\ncomposition structures in the problem, our algorithms also needs SVRG in the dual coordinate ascent\nupdate, while SPDC does not. Third, the primal update in SPDC is speci\ufb01cally designed for linear\npredictors. In contrast, our work is not restricted to that by using a novel variance reduced gradient.\n\n7 Conclusions and Future Work\n\nWe developed a stochastic primal-dual algorithms, SVRPDA-I to ef\ufb01ciently solve the empirical\ncomposition optimization problem. This is achieved by fully exploiting the rich structures inherent in\nthe reformulated min-max problem, including the dual decomposition and the \ufb01nite-sum structures.\nIt alternates between (i) a dual step of stochastic variance reduced coordinate ascent and (ii) a primal\nstep of stochastic variance reduced gradient descent. In particular, we proposed a novel variance\nreduced gradient for the primal update, which achieves better variance reduction with low complexity.\nWe derive a non-asymptotic bound for the error sequence and show that it converges at a linear\nrate when the problem is strongly convex. Moreover, we also developed an approximate version\nof the algorithm named SVRPDA-II, which further reduces the storage complexity. Experimental\nresults on several real-world benchmarks showed that both SVRPDA-I and SVRPDA-II signi\ufb01cantly\noutperform existing techniques on all these tasks, and the approximation in SVRPDA-II only caused a\nslight performance loss. Future extensions of our work include the theoretical analysis of SVRPDA-II,\nthe generalization of our algorithms to Bregman divergences, and applying it to large-scale machine\nlearning problems with non-convex cost functions (e.g., unsupervised sequence classi\ufb01cations).\n\n9\n\n\fReferences\n[1] P. Balamurugan and F. Bach. Stochastic variance reduction methods for saddle-point problems.\n\nIn Advances in Neural Information Processing Systems, pages 1416\u20131424, 2016.\n\n[2] B. Dai, N. He, Y. Pan, B. Boots, and L. Song. Learning from conditional distributions via dual\n\nembeddings. In Arti\ufb01cial Intelligence and Statistics, pages 1458\u20131467, 2017.\n\n[3] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song. SBEED: Convergent rein-\nforcement learning with nonlinear function approximation. In Proc. International Conference\non Machine Learning, pages 1133\u20131142, 2018.\n\n[4] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in neural information\nprocessing systems, pages 1646\u20131654, 2014.\n\n[5] S. S. Du, J. Chen, L. Li, L. Xiao, and D. Zhou. Stochastic variance reduction methods for policy\nevaluation. In Proc. International Conference on Machine Learning, pages 1049\u20131058, 2017.\n\n[6] S. S. Du and W. Hu. Linear convergence of the primal-dual gradient method for convex-concave\nsaddle point problems without strong convexity. In Proc. International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 196\u2013205, 2019.\n\n[7] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[8] N. Le Roux, M. W. Schmidt, F. R. Bach, et al. A stochastic gradient method with an exponential\nconvergence rate for \ufb01nite training sets. In Advances in Neural Information Processing Systems,\npages 2672\u20132680, 2012.\n\n[9] X. Lian, M. Wang, and J. Liu. Finite-sum composition optimization via variance reduced\ngradient descent. In Proc. International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 1159\u20131167, 2017.\n\n[10] T. Lin, C. Fan, M. Wang, and M. I. Jordan. Improved oracle complexity for stochastic composi-\n\ntional variance reduced gradient. arXiv preprint arXiv:1806.00458, 2018.\n\n[11] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik. Finite-sample analysis of\nproximal gradient td algorithms. In Proc. Conference on Uncertainty in Arti\ufb01cial Intelligence,\npages 504\u2013513, 2015.\n\n[12] Y. Liu, J. Chen, and L. Deng. Unsupervised sequence classi\ufb01cation using sequential output\n\nstatistics. In Advances in Neural Information Processing Systems, pages 3550\u20133559, 2017.\n\n[13] S. V. Macua, J. Chen, S. Zazo, and A. H. Sayed. Distributed policy evaluation under multiple\n\nbehavior strategies. IEEE Transactions on Automatic Control, 60(5):1260\u20131274, 2015.\n\n[14] R. T. Rockafellar. Convex analysis. Princeton university press, 2015.\n\n[15] A. Ruszczy\u00b4nski and A. Shapiro. Optimization of risk measures. In Probabilistic and randomized\n\nmethods for design under uncertainty, pages 119\u2013157. Springer, 2006.\n\n[16] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00b4ari, and E. Wiewiora.\nFast gradient-descent methods for temporal-difference learning with linear function approxima-\ntion. In Proc. International Conference on Machine Learning, pages 993\u20131000, 2009.\n\n[17] V. Vapnik. Statistical learning theory. 1998, volume 3. Wiley, New York, 1998.\n\n[18] M. Wang, E. X. Fang, and H. Liu. Stochastic compositional gradient descent: algorithms for\nminimizing compositions of expected-value functions. Mathematical Programming, 161(1-\n2):419\u2013449, 2017.\n\n[19] M. Wang, J. Liu, and E. Fang. Accelerating stochastic composition optimization. In Advances\n\nin Neural Information Processing Systems, pages 1714\u20131722, 2016.\n\n10\n\n\f[20] T. Xie, B. Liu, Y. Xu, M. Ghavamzadeh, Y. Chow, D. Lyu, and D. Yoon. A block coordinate\nascent algorithm for mean-variance optimization. In Advances in Neural Information Processing\nSystems, pages 1065\u20131075, 2018.\n\n[21] C.-K. Yeh, J. Chen, C. Yu, and D. Yu. Unsupervised speech recognition via segmental empirical\noutput distribution matching. In Proc. International Conference on Learning Representations,\n2019.\n\n[22] J. Zhang and L. Xiao. A composite randomized incremental gradient method. In International\n\nConference on Machine Learning, pages 7454\u20137462, 2019.\n\n[23] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk\n\nminimization. Journal of Machine Learning Research, 18(1):2939\u20132980, 2017.\n\n11\n\n\f", "award": [], "sourceid": 5233, "authors": [{"given_name": "Adithya M", "family_name": "Devraj", "institution": "University of Florida"}, {"given_name": "Jianshu", "family_name": "Chen", "institution": "Tencent AI Lab"}]}