{"title": "Stochastic Three-Composite Convex Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 4329, "page_last": 4337, "abstract": "We propose a stochastic optimization method for the minimization of the sum of three convex functions, one of which has Lipschitz continuous gradient as well as restricted strong convexity. Our approach is most suitable in the setting where it is computationally advantageous to process smooth term in the decomposition with its stochastic gradient estimate and the other two functions separately with their proximal operators, such as doubly regularized empirical risk minimization problems. We prove the convergence characterization of the proposed algorithm in expectation under the standard assumptions for the stochastic gradient estimate of the smooth term. Our method operates in the primal space and can be considered as a stochastic extension of the three-operator splitting method. Finally, numerical evidence supports the effectiveness of our method in real-world problems.", "full_text": "Stochastic Three-Composite Convex Minimization\n\nAlp Yurtsever, B`\u02d8ang C\u00f4ng V\u02dcu, and Volkan Cevher\n\nLaboratory for Information and Inference Systems (LIONS)\n\n\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne, Switzerland\n\nalp.yurtsever@ep\ufb02.ch, bang.vu@ep\ufb02.ch, volkan.cevher@ep\ufb02.ch\n\nAbstract\n\nWe propose a stochastic optimization method for the minimization of the sum of\nthree convex functions, one of which has Lipschitz continuous gradient as well\nas restricted strong convexity. Our approach is most suitable in the setting where\nit is computationally advantageous to process smooth term in the decomposition\nwith its stochastic gradient estimate and the other two functions separately with\ntheir proximal operators, such as doubly regularized empirical risk minimization\nproblems. We prove the convergence characterization of the proposed algorithm in\nexpectation under the standard assumptions for the stochastic gradient estimate of\nthe smooth term. Our method operates in the primal space and can be considered as\na stochastic extension of the three-operator splitting method. Numerical evidence\nsupports the effectiveness of our method in real-world problems.\n\n1\n\nIntroduction\n\nWe propose a stochastic optimization method for the three-composite minimization problem:\n\nminimize\n\nx\u2208Rd\n\nf (x) + g(x) + h(x),\n\n(1)\nwhere f : Rd \u2192 R and g : Rd \u2192 R are proper, lower semicontinuous convex functions that admit\ntractable proximal operators, and h : Rd \u2192 R is a smooth function with restricted strong convexity.\nWe assume that we have access to unbiased, stochastic estimates of the gradient of h in the sequel,\nwhich is key to scale up optimization and to address streaming settings where data arrive in time.\nTemplate (1) covers a large number of applications in machine learning, statistics, and signal process-\ning by appropriately choosing the individual terms. Operator splitting methods are powerful in this\nsetting, since they reduce the complex problem (1) into smaller subproblems. These algorithms are\neasy to implement, and they typically exhibit state-of-the-art performance.\nTo our knowledge, there is no operator splitting framework that can currently tackle template (1)\nusing stochastic gradient of h and the proximal operators of f and g separately, which is critical to\nthe scalability of the methods. This paper speci\ufb01cally bridges this gap.\nOur basic framework is closely related to the deterministic three operator splitting method proposed\nin [11], but we avoid the computation of the gradient \u2207h and instead work with its unbiased estimates.\nWe provide rigorous convergence guarantees for our approach and provide guidance in selecting the\nlearning rate under different scenarios.\nRoad map. Section 2 introduces the basic optimization background. Section 3 then presents the\nmain algorithm and provides its convergence characterization. Section 4 places our contributions in\nlight of the existing work. Numerical evidence that illustrates our theory appears in Section 5. We\nrelegate the technical proofs to the supplementary material.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f2 Notation and background\nThis section recalls a few basic notions from the convex analysis and the probability theory, and\npresents the notation used in the rest of the paper. Throughout, \u03930(Rd) denotes the set of all proper,\nlower semicontinuous convex functions from Rd to [\u2212\u221e, +\u221e], and (cid:104)\u00b7 | \u00b7(cid:105) is the standard scalar\nproduct on Rd with its associated norm (cid:107) \u00b7 (cid:107).\nSubdifferential. The subdifferential of f \u2208 \u03930(Rd) at a point x \u2208 Rd is de\ufb01ned as\n\n\u2202f (x) = {u \u2208 Rd | f (y) \u2212 f (x) \u2265 (cid:104)y \u2212 x | u(cid:105) ,\u2200y \u2208 Rd}.\n\nWe denote the domain of \u2202f as\n\ndom(\u2202f ) = {x \u2208 Rd | \u2202f (x) (cid:54)= \u2205}.\n\nIf \u2202f (x) is a singleton, then f is a differentiable function, and \u2202f (x) = {\u2207f (x)}.\nIndicator function. Given a nonempty subset C in Rd, the indicator function of C is given by\n\nProximal operator. The proximal operator of a function f \u2208 \u03930(Rd) is de\ufb01ned as follows\n\n\u03b9C(x) =\n\nif x \u2208 C,\n+\u221e if x (cid:54)\u2208 C.\n\n(cid:26)0\n\n(cid:26)\n\nproxf (x) = arg min\nz\u2208Rd\n\nf (z) +\n\n(cid:107)z \u2212 x(cid:107)2\n\n1\n2\n\n(cid:27)\n\n.\n\n(2)\n\n(3)\n\nRoughly speaking, the proximal operator is tractable when the computation of (3) is cheap. If f is\nthe indicator function of a nonempty, closed convex subset C, its proximity operator is the projection\noperator on C.\nLipschitz continuos gradient. A function f \u2208 \u03930(Rd) has Lipschitz continuous gradient with\nLipschitz constant L > 0 (or simply L-Lipschitz), if\n\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 L(cid:107)x \u2212 y(cid:107),\n\n\u2200x, y \u2208 Rd.\n\nStrong convexity. A function f \u2208 \u03930(Rd) is called strongly convex with some parameter \u00b5 > 0 (or\nsimply \u00b5-strongly convex), if\n\n(cid:104)p \u2212 q | x \u2212 y(cid:105) \u2265 \u00b5(cid:107)x \u2212 y(cid:107)2,\n\n\u2200x, y \u2208 dom(\u2202f ), \u2200p \u2208 \u2202f (x), \u2200q \u2208 \u2202f (y).\n\nSolution set. We denote optimum points of (1) by x(cid:63), and the solution set by X (cid:63):\n\nx(cid:63) \u2208 X (cid:63) = {x \u2208 Rd | 0 \u2208 \u2207h(x) + \u2202g(x) + \u2202f (x)}.\n\nThroughout this paper, we assume that X (cid:63) is not empty.\nRestricted strong convexity. A function f \u2208 \u03930(Rd) has restricted strong convexity with respect to\na point x(cid:63) in a set M \u2282 dom(\u2202f ), with parameter \u00b5 > 0, if\n\n(cid:104)p \u2212 q | x \u2212 x(cid:63)(cid:105) \u2265 \u00b5(cid:107)x \u2212 x(cid:63)(cid:107)2,\n\n\u2200x \u2208 M, \u2200p \u2208 \u2202f (x), \u2200q \u2208 \u2202f (x(cid:63)).\n\nLet (\u2126,F , P) be a probability space. An Rd-valued random variable is a measurable function\nx : \u2126 \u2192 Rd, where Rd is endowed with the Borel \u03c3-algebra. We denote by \u03c3(x) the \u03c3-\ufb01eld\ngenerated by x. The expectation of a random variable x is denoted by E[x]. The conditional\nexpectation of x given a \u03c3-\ufb01eld A \u2282 F is denoted by E[x|A]. Given a random variable y : \u2126 \u2192 Rd,\nthe conditional expectation of x given y is denoted by E[x|y]. See [17] for more details on probability\ntheory. An Rd-valued random process is a sequence (xn)n\u2208N of Rd-valued random variables.\n\n3 Stochastic three-composite minimization algorithm and its analysis\n\nWe present stochastic three-composite minimization method (S3CM) in Algorithm 1, for solving the\nthree-composite template (1). Our approach combines the stochastic gradient of h, denoted as r, and\nthe proximal operators of f and g in essentially the same structrure as the three-operator splitting\nmethod [11, Algorithm 2]. Our technique is a nontrivial combination of the algorithmic framework\nof [11] with stochastic analysis.\n\n2\n\n\fAlgorithm 1 Stochastic three-composite minimization algorithm (S3CM)\n\nInput: An initial point xf,0, a sequence of learning rates (\u03b3n)n\u2208N, and a sequence of squared\nintegrable Rd-valued stochastic gradient estimates (rn)n\u2208N.\nInitialization:\nxg,0 = prox\u03b30g(xf,0)\nug,0 = \u03b3\u22121\n\n0 (xf,0 \u2212 xg,0)\n\nMain loop:\nfor n = 0, 1, 2, . . . do\n\nxg,n+1 = prox\u03b3ng(xf,n + \u03b3nug,n)\nug,n+1 = \u03b3\u22121\nxf,n+1 = prox\u03b3n+1f (xg,n+1 \u2212 \u03b3n+1ug,n+1 \u2212 \u03b3n+1rn+1)\n\nn (xf,n \u2212 xg,n+1) + ug,n\n\nend for\nOutput: xg,n as an approximation of an optimal solution x(cid:63).\n\nTheorem 1 Assume that h is \u00b5h-strongly convex and has L-Lipschitz continuous gradient. Further\nassume that g is \u00b5g-strongly convex, where we allow \u00b5g = 0. Consider the following update rule for\nthe learning rate:\n\u2212\u03b32\n\nfor some \u03b30 > 0 and \u03b7 \u2208]0, 1[.\nDe\ufb01ne F n = \u03c3(xf,k)0\u2264k\u2264n, and suppose that the following conditions hold for every n \u2208 N:\n\nn\u00b5h\u03b7)2 + (1 + 2\u03b3n\u00b5g)\u03b32\nn\n1 + 2\u03b3n\u00b5g\n\nn\u00b5h\u03b7 +(cid:112)(\u03b32\n\n\u03b3n+1 =\n\n,\n\n1. E[rn+1|F n] = \u2207h(xg,n+1) almost surely,\n\n2. There exists c \u2208 [0, +\u221e[ and t \u2208 R, that satis\ufb01es(cid:80)n\n\nk=0 E[(cid:107)rk \u2212 \u2207h(xg,k)(cid:107)2] \u2264 cnt.\n\nThen, the iterates of S3CM satisfy\n\nE[(cid:107)xg,n \u2212 x(cid:63)(cid:107)2] = O(1/n2) + O(1/n2\u2212t).\n\n(4)\n\nRemark 1 The variance condition of the stochastic gradient estimates in the theorems above is\nsatis\ufb01ed when E[(cid:107)rn \u2212 \u2207h(xg,n)(cid:107)2] \u2264 c for all n \u2208 N and for some constant c \u2208 [0, +\u221e[. See\n[15, 22, 26] for details.\nRemark 2 When rn = \u2207h(xn), S3CM reduces to the deterministic three-operator splitting scheme\n[11, Algorithm 2] and we recover the convergence rate O(1/n2) as in [11]. When g is zero, S3CM\nreduces to the standard stochastic proximal point algorithm [2, 13, 26].\n\nRemark 3 Learning rate sequence (\u03b3n)n\u2208N in Theorem 1 depends on the strong convexity parameter\n\u00b5h, which may not be available a priori. Our next result avoids the explicit reliance on the strong\nconvexity parameter, while providing essentially the same convergence rate.\n\nTheorem 2 Assume that h is \u00b5h-strongly convex and has L-Lipschitz continuous gradient. Con-\nsider a positive decreasing learning rate sequence \u03b3n = \u0398(1/n\u03b1) for some \u03b1 \u2208]0, 1], and denote\n\u03b2 = limn\u2192\u221e 2\u00b5hn\u03b1\u03b3n.\nDe\ufb01ne F n = \u03c3(xf,k)0\u2264k\u2264n, and suppose that the following conditions hold for every n \u2208 N:\n\n1. E[rn+1|F n] = \u2207h(xg,n+1) almost surely,\n2. E[(cid:107)rn \u2212 \u2207h(xg,n)(cid:107)2] is uniformly bounded by some positive constant.\n3. E[(cid:107)ug,n \u2212 x(cid:63)(cid:107)2] is uniformly bounded by some positive constant.\n\nThen, the iterates of S3CM satisfy\n\nE[(cid:107)xg,n \u2212 x(cid:63)(cid:107)2] =\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nO(cid:0)1/n\u03b1(cid:1)\nO(cid:0)1/n\u03b2(cid:1)\nO(cid:0)(log n)/n(cid:1)\nO(cid:0)1/n(cid:1)\n\n3\n\nif 0 < \u03b1 < 1\nif \u03b1 = 1, and \u03b2 < 1\nif \u03b1 = 1, and \u03b2 = 1,\nif \u03b1 = 1, and \u03b2 > 1.\n\n\fProof outline. We consider the proof of three-operator splitting method as a baseline, and we use\nthe stochastic \ufb01xed point theory to derive the convergence of the iterates via the stochastic Fej\u00e9r\nmonotone sequence. See the supplement for the complete proof.\nRemark 4 Note that ug,n \u2208 \u2202g(xg,n). Hence, we can replace condition 3 in Theorem 2 with the\nbounded subgradient assumption: (cid:107)p(cid:107) \u2264 c,\u2200p \u2208 \u2202g(xg,n), for some positive constant c.\nRemark 5 (Restricted strong convexity) Let M be a subset of Rd that contains (xg,n)n\u2208N and x(cid:63).\nSuppose that h has restricted strong convexity on M with parameter \u00b5h. Then, Theorems 1 and 2\nstill hold. An example role of the restricted strong convexity assumption on algorithmic convergence\ncan be found in [1, 21].\n\nRemark 6 (Extension to arbitrary number of non-smooth terms.) Using the product space tech-\nnique [5, Section 6.1], S3CM can be applied to composite problems with arbitrary number of\nnon-smooth terms:\n\nm(cid:88)\n\ni=1\n\nminimize\n\nx\u2208Rd\n\nfi(x) + h(x),\n\nwhere fi : Rd \u2192 R are proper, lower semicontinuous convex functions, and h : Rd \u2192 R is a smooth\nfunction with restricted strong convexity. We present this variant in Algorithm 2. Theorems 1 and 2\nhold for this variant, replacing xg,n by xn, and ug,n by ui,n for i = 1, 2, . . . , m.\n\nAlgorithm 2 Stochastic m(ulti)-composite minimization algorithm (SmCM)\n\nInput: Initial points {xf1,0, xf2,0, . . . , xfm,0}, a sequence of learning rates (\u03b3n)n\u2208N, and a se-\nquence of squared integrable Rd-valued stochastic gradient estimates (rn)n\u2208N\nInitialization:\n\nx0 = m\u22121(cid:80)m\n\ni=1 xfi,0\n\nfor i=1,2,. . . ,m do\n\nui,0 = \u03b3\u22121\n\n0 (xfi,0 \u2212 x0)\n\nend for\nMain loop:\nfor n = 0, 1, 2, . . . do\n\nxn+1 = m\u22121(cid:80)m\n\ni=1(xfi,n + \u03b3nui,n)\nn (xfi,n \u2212 xn+1) + ui,n\n\nfor i=1,2,. . . ,m do\nui,n+1 = \u03b3\u22121\nxfi,n+1 = prox\u03b3n+1mfi (xn+1 \u2212 \u03b3n+1ui,n+1 \u2212 \u03b3n+1rn+1)\n\nend for\n\nend for\nOutput: xn as an approximation of an optimal solution x(cid:63).\n\nRemark 7 With a proper learning rate, S3CM still converges even if h is not (restricted) strongly\nconvex under mild assumptions. Suppose that h has L-Lipschitz continuous gradient. Set the learning\nrate such that \u03b5 \u2264 \u03b3n \u2261 \u03b3 \u2264 \u03b1(2L\u22121 \u2212 \u03b5), for some \u03b1 and \u03b5 in ]0, 1[. De\ufb01ne F n = \u03c3(xf,k)0\u2264k\u2264n,\nand suppose that the following conditions hold for every n \u2208 N:\n\n2. (cid:80)\n\n1. E[rn+1|F n] = \u2207h(xg,n+1) almost surely.\n\nn\u2208N E[(cid:107)rn+1 \u2212 \u2207h(xg,n+1)(cid:107)2|F n] < +\u221e almost surely.\n\nThen, (xg,n)n\u2208N converges to a X (cid:63)-valued random vector almost surely. See [7] for details.\n\nRemark 8 All the results above hold for any separable Hilbert space, except that the strong con-\nvergence in Remark 7 is replaced by weak convergence. Note however that extending Remark 7 to\nvariable metric setting as in [10, 27] is an open problem.\n\n4\n\n\f4 Contributions in the light of prior work\n\nRecent algorithms in the operator splitting, such as generalized forward-backward splitting [24],\nforward-Douglas-Rachford splitting [5], and the three-operator splitting [11], apply to our problem\ntemplate (1). These key results, however, are in the deterministic setting.\nOur basic framework can be viewed as a combination of the three-operator splitting method in [11]\nwith the stochastic analysis.\nThe idea of using unbiased estimates of the gradient dates back to [25]. Recent developments\nof this idea can be viewed as proximal based methods for solving the generic composite convex\nminimization template with a single non-smooth term [2, 9, 12, 13, 15, 16, 19, 26, 23]. This generic\nform arises naturally in regularized or constrained composite problems [3, 13, 20], where the smooth\nterm typically encodes the data \ufb01delity. These methods require the evaluation of the joint prox of f\nand g when applied to the three-composite template (1).\nUnfortunately, evaluation of the joint prox is arguably more expensive compared to the individual\nprox operators. To make comparison stark, consider the simple example where f and g are indicator\nfunctions for two convex sets. Even if the projection onto the individual sets are easy to compute,\nprojection onto the intersection of these sets can be challenging.\nRelated literature also contains algorithms that solve some speci\ufb01c instances of template (1). To point\nout a few, random averaging projection method [28] handles multiple constraints simultaneously\nbut cannot deal with regularizers. On the other hand, accelerated stochastic gradient descent with\nproximal average [29] can handle multiple regularizers simultaneously, but the algorithm imposes a\nLipschitz condition on regularizers, and hence, it cannot deal with constraints.\nTo our knowledge, our method is the \ufb01rst operator splitting framework that can tackle optimization\ntemplate (1) using the stochastic gradient estimate of h and the proximal operators of f and g\nseparately, without any restriction on the non-smooth parts except that their subdifferentials are\nmaximally monotone. When h is strongly convex, under mild assumptions, and with a proper learning\nrate, our algorithm converges with O(1/n) rate, which is optimal for the stochastic methods under\nstrong convexity assumption for this problem class.\n\n5 Numerical experiments\n\nWe present numerical evidence to assess the theoretical convergence guarantees of the proposed\nalgorithm. We provide two numerical examples from Markowitz portfolio optimization and support\nvector machines.\nAs a baseline, we use the deterministic three-operator splitting method [11]. Even though the random\naveraging projection method proposed in [28] does not apply to our template (1) with its all generality,\nit does for the speci\ufb01c applications that we present below. In our numerical tests, however, we\nobserved that this method exhibits essentially the same convergence behavior as ours when used\nwith the same learning rate sequence. For the clarity of the presentation, we omit this method in our\nresults.\n\n5.1 Portfolio optimization\n\nTraditional Markowitz portfolio optimization aims to reduce risk by minimizing the variance for a\ngiven expected return. Mathematically, we can formulate this as a convex optimization problem [6]:\n\nE(cid:2)|aT\n\ni x \u2212 b|2(cid:3)\n\nminimize\n\nx\u2208Rd\n\nsubject to x \u2208 \u2206, aT\n\nav x \u2265 b,\n\nwhere \u2206 is the standard simplex for portfolios with no-short positions or a simple sum constraint,\naav = E [ai] is the average returns for each asset that is assumed to be known (or estimated), and b\nencodes a minimum desired return.\nThis problem has a streaming nature where new data points arrive in time. Hence, we typically do not\nhave access to the whole dataset, and the stochastic setting is more favorable. For implementation,\n\n5\n\n\fwe replace the expectation with the empirical sample average:\n\nminimize\n\nx\u2208Rd\n\n1\np\n\ni x \u2212 b)2\n\n(aT\n\nsubject to x \u2208 \u2206, aT\n\nav x \u2265 b.\n\n(5)\n\np(cid:88)\n\ni=1\n\np(cid:88)\n\ni=1\n\nThis problem \ufb01ts into our optimization template (1) by setting\n\nh(x) =\n\n1\np\n\ni x \u2212 b)2,\n\n(aT\n\ng(x) = \u03b9\u2206(x),\n\nand\n\nf (x) = \u03b9{x | aT\n\navx\u2265b}(x).\n\nWe compute the unbiased estimates of the gradient by rn = 2(aT\nin\nchosen uniformly random.\nWe use 5 different real portfolio datasets: Dow Jones industrial average (DJIA, with 30 stocks for\n507 days), New York stock exchange (NYSE, with 36 stocks for 5651 days), Standard & Poor\u2019s 500\n(SP500, with 25 stocks for 1276 days), Toronto stock exchange (TSE, with 88 stocks for 1258 days)\nthat are also considered in [4]; and one dataset by Fama and French (FF100, 100 portfolios formed\non size and book-to-market, 23,647 days) that is commonly used in \ufb01nancial literature, e.g., [6, 14].\nWe impute the missing data in FF100 using nearest-neighbor method with Euclidean distance.\n\nx \u2212 b)ain, where index in is\n\nFigure 1: Comparison of the deterministic three-operators splitting method [11, Algorithm 2] and\nour stochastic three-composite minimization method (S3CM) for Markowitz portfolio optimization\n(5). Results are averaged over 100 Monte-Carlo simulations, and the boundaries of the shaded area\nare the best and worst instances.\n\nFor the deterministic algorithm, we set \u03b7 = 0.1. We evaluate the Lipschitz constant L and the strong\nconvexity parameter \u00b5h to determine the step-size. For the stochastic algorithm, we do not have\naccess to the whole data, so we cannot compute these parameter. Hence, we adopt the learning\nrate sequence de\ufb01ned in Theorem 2. We simply use \u03b3n = \u03b30/(n + 1) with \u03b30 = 1 for FF100, and\n\u03b30 = 103 for others.1 We start both algorithms from the zero vector.\n\n1Note that a \ufb01ne-tuned learning rate with a more complex de\ufb01nition can improve the empirical performance,\n\ne.g., \u03b3n = \u03b30/(n + \u03b6) for some positive constants \u03b30 and \u03b6.\n\n6\n\n\fWe split all the datasets into test (10%) and train (90%) partitions randomly. We set the desired\nreturn as the average return over all assets in the training set, b = mean(aav). Other b values exhibit\nqualitatively similar behavior.\nThe results of this experiment are compiled in Figure 1. We compute the objective function over\nthe datapoints in the test partition, htest. We compare our algorithm against the deterministic three-\noperator splitting method [11, Algorithm 2]. Since we seek statistical solutions, we compare the\nalgorithms to achieve low to medium accuracy. [11] provides other variants of the deterministic algo-\nrithm, including two ergodic averaging schemes that feature improved theoretical rate of convergence.\nHowever, these variants performed worse in practice than the original method, and are omitted.\nSolid lines in Figure 1 present the average results over 100 Monte-Carlo simulations, and the\nboundaries of the shaded area are the best and worst instances. We also assess empirical evidence of\nthe O(1/n) convergence rate guaranteed in Theorem 2, by presenting squared relative distance to the\noptimum solution for FF100 dataset. Here, we approximate the ground truth by solving the problem\nto high accuracy with the deterministic algorithm for 105 iterations.\n\n5.2 Nonlinear support vector machines classi\ufb01cation\n\nThis section demonstrates S3CM on a support vector machines (SVM) for binary classi\ufb01cation\nproblem. We are given a training set A = {a1, a2, . . . , ad} and the corresponding class labels\n{b1, b2, . . . , bd}, where ai \u2208 Rp and bi \u2208 {\u22121, 1}. The goal is to build a model that assigns new\nexamples into one class or the other correctly.\nAs common in practice, we solve the dual soft-margin SVM formulation:\n\nd(cid:88)\n\nd(cid:88)\n\nK(ai, aj)bibjxixj \u2212 d(cid:88)\n\ni=1\n\nj=1\n\ni=1\n\nminimize\n\nx\u2208Rd\n\n1\n2\n\nsubject to x \u2208 [0, C]d, bT x = 0,\n\nxi\n\nwhere C \u2208 [0, +\u221e[ is the penalty parameter and K : Rp \u00d7 Rp \u2192 R is a kernel function. In our\nexample we use the Gaussian kernel given by K\u03c3(ai, aj) = exp(\u2212\u03c3(cid:107)ai \u2212 aj(cid:107)2) for some \u03c3 > 0.\nDe\ufb01ne symmetric positive semide\ufb01nite matrix M \u2208 Rd\u00d7d with entries Mij = K\u03c3(ai, aj)bibj.\nThen the problem takes the form\n\nsubject to x \u2208 [0, C]d, bT x = 0.\n\nxi\n\n(6)\n\nxT M x \u2212 d(cid:88)\n\ni=1\n\nminimize\n\nx\u2208Rd\n\n1\n2\n\nxT M x \u2212 d(cid:88)\n\ni=1\n\nh(x) =\n\n1\n2\n\nThis problem \ufb01ts into three-composite optimization template (1) with\n\nxi,\n\ng(x) = \u03b9[0,C]d (x),\n\nand\n\nf (x) = \u03b9{x | bT x=0}(x).\n\nOne can solve this problem using three-operator splitting method [11, Algorithm 1]. Note that proxf\nand proxg, which are projections onto the corresponding constraint sets, incur O(d) computational\ncost, whereas the cost of computing the gradient is O(d2).\nTo compute an unbiased gradient estimate, we choose an index in uniformly random, and we form\nrn = dM in xin \u2212 1. Here M in denotes ith\nn column of matrix M, and 1 represents the vector of ones.\nWe can compute rn in O(d) computations, hence each iteration of S3CM costs an order cheaper\ncompared to deterministic algorithm.\nWe use UCI machine learning dataset \u201ca1a\u201d, with d = 1605 datapoints and p = 123 features [8, 18].\nNote that our goal here is to demonstrate the optimization performance of our algorithm for a real\nworld problem, rather than competing the prediction quality of the best engineered solvers. Hence,\nto keep experiments simple, we \ufb01x problem parameters C = 1 and \u03c3 = 2\u22122, and we focus on the\neffects of algorithmic parameters on the convergence behavior.\nSince p < d, M is rank de\ufb01cient and h is not strongly convex. Nevertheless we use S3CM with the\nlearning rate \u03b3n = \u03b30/(n + 1) for various values of \u03b30. We observe O(1/n) empirical convergence\nrate on the squared relative error for large enough \u03b30, which is guaranteed under restricted strong\nconvexity assumption. See Figure 2 for the results.\n\n7\n\n\f[Left] Convergence of S3CM in the squared relative error with learning rate\nFigure 2:\n\u03b3n = \u03b30/(n + 1). [Right] Comparison of the deterministic three-operators splitting method [11,\nAlgorithm 1] and S3CM with \u03b30 = 1 for SVM classi\ufb01cation problem. Results are averaged over 100\nMonte-Carlo simulations. Boundaries of the shaded area are the best and worst instances.\n\nAcknowledgments\n\nThis work was supported in part by ERC Future Proof, SNF 200021-146750, SNF CRSII2-147633,\nand NCCR-Marvel.\n\nReferences\n[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence of gradient methods\n\nfor high-dimensional statistical recovery. Ann. Stat., 40(5):2452\u20132482, 2012.\n\n[2] Y. F. Atchad\u00e9, G. Fort, and E. Moulines. On stochastic proximal gradient algorithms.\n\narXiv:1402.2365v2, 2014.\n\n[3] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert\n\nspaces. Springer-Verlag, 2011.\n\n[4] A. Borodin, R. El-Yaniv, and V. Gogan. Can we learn to beat the best stock. In Advances in\n\nNeural Information Processing Systems 16, pages 345\u2013352. 2004.\n\n[5] L. M. Brice\u00f1o-Arias. Forward-Douglas\u2013Rachford splitting and forward-partial inverse method\n\nfor solving monotone inclusions. Optimization, 64(5):1239\u20131261, 2015.\n\n[6] J. Brodie, I. Daubechies, C. de Mol, D. Giannone, and I. Loris. Sparse and stable Markowitz\n\nportfolios. Proc. Natl. Acad. Sci., 106:12267\u201312272, 2009.\n\n[7] V. Cevher, B. C. V\u02dcu, and A. Yurtsever. Stochastic forward\u2013Douglas\u2013Rachford splitting for\n\nmonotone inclusions. EPFL-Report-215759, 2016.\n\n[8] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Trans. Intell.\n\nSyst. Technol., 2(3):27:1\u201327:27, 2011.\n\n[9] P. L. Combettes and J.-C. Pesquet. Stochastic approximations and perturbations in forward-\n\nbackward splitting for monotone operators. arXiv:1507.07095v1, 2015.\n\n[10] P. L. Combettes and B. C. V\u02dcu. Variable metric forward\u2013backward splitting with applications to\n\nmonotone inclusions in duality. Optimization, 63(9):1289\u20131318, 2014.\n\n[11] D. Davis and W. Yin. A three-operator splitting scheme and its optimization applications.\n\narXiv:1504.01032v1, 2015.\n\n[12] O. Devolder. Stochastic \ufb01rst order methods in smooth convex optimization. Technical report,\n\nCenter for Operations Research and Econometrics, 2011.\n\n8\n\n\f[13] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting. J.\n\nMach. Learn. Res., 10:2899\u20132934, 2009.\n\n[14] E. F. Fama and K. R. French. Multifactor explanations of asset pricing anomalies. Journal of\n\nFinance,, 51:55\u201384, 1996.\n\n[15] C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and\nonline learning. In Advances in Neural Information Processing Systems 22, pages 781\u2013789.\n2009.\n\n[16] G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(1):365\u2013\n\n397, 2012.\n\n[17] M. Ledoux and M. Talagrand. Probability in Banach spaces: Isoperimetry and processes.\n\nSpringer-Verlag, 1991.\n\n[18] M. Lichman. UCI machine learning repository. University of California, Irvine, School of\n\nInformation and Computer Sciences, 2013.\n\n[19] Q. Lin, X. Chen, and J. Pe\u00f1a. A smoothing stochastic gradient method for composite optimiza-\n\ntion. Optimization Methods and Software, 29(6):1281\u20131301, 2014.\n\n[20] S. Mosci, L. Rosasco, M. Santoro, A. Verri, and S. Villa. Solving structured sparsity regulariza-\ntion with proximal methods. In European Conf. Machine Learning and Principles and Practice\nof Knowledge Discovery, pages 418\u2013433, 2010.\n\n[21] S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar. A uni\ufb01ed framework for high-\ndimensional analysis of M-estimators with decomposable regularizers. In Advances in Neural\nInformation Processing Systems 22, pages 1348\u20131356, 2009.\n\n[22] A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with\nLipschitz continuous monotone operators and smooth convex-concave saddle point problems.\nSIAM J. on Optimization, 15(1):229\u2013251, 2005.\n\n[23] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in\n\nNeural Information Processing Systems 27, pages 1574\u20131582. 2014.\n\n[24] H. Raguet, J. Fadili, and G. Peyr\u00e9. A generalized forward-backward splitting. SIAM Journal on\n\nImaging Sciences, 6(3):1199\u20131226, 2013.\n\n[25] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400\u2013\n\n407, 1951.\n\n[26] L. Rosasco, S. Villa, and B. C. V\u02dcu. Convergence of stochastic proximal gradient algorithm.\n\narXiv:1403.5074v3, 2014.\n\n[27] B. C. V\u02dcu. Almost sure convergence of the forward\u2013backward\u2013forward splitting algorithm.\n\nOptimization Letters, 10(4):781\u2013803, 2016.\n\n[28] M. Wang, Y. Chen, J. Liu, and Y. Gu. Random multi\u2013constraint projection: Stochastic gradient\n\nmethods for convex optimization with many constraints. arXiv:1511.03760v1, 2015.\n\n[29] W. Zhong and J. Kwok. Accelerated stochastic gradient method for composite regularization. J.\n\nMach. Learn. Res., 33:1086\u20131094, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2141, "authors": [{"given_name": "Alp", "family_name": "Yurtsever", "institution": "EPFL"}, {"given_name": "Bang Cong", "family_name": "Vu", "institution": "LIONS, EPFL"}, {"given_name": "Volkan", "family_name": "Cevher", "institution": "EPFL"}]}