{"title": "Stochastic Variance Reduction Methods for Saddle-Point Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1416, "page_last": 1424, "abstract": "We consider convex-concave saddle-point problems where the objective functions may be split in many components, and extend recent stochastic variance reduction methods (such as SVRG or SAGA) to provide the first large-scale linearly convergent algorithms for this class of problems which are common in machine learning. While the algorithmic extension is straightforward, it comes with challenges and opportunities: (a) the convex minimization analysis does not apply and we use the notion of monotone operators to prove convergence, showing in particular that the same algorithm applies to a larger class of problems, such as variational inequalities, (b) there are two notions of splits, in terms of functions, or in terms of partial derivatives, (c) the split does need to be done with convex-concave terms, (d) non-uniform sampling is key to an efficient algorithm, both in theory and practice, and (e) these incremental algorithms can be easily accelerated using a simple extension of the \"catalyst\" framework, leading to an algorithm which is always superior to accelerated batch algorithms.", "full_text": "Stochastic Variance Reduction Methods\n\nfor Saddle-Point Problems\n\nP. Balamurugan\n\nINRIA - Ecole Normale Sup\u00e9rieure, Paris\nbalamurugan.palaniappan@inria.fr\n\nFrancis Bach\n\nINRIA - Ecole Normale Sup\u00e9rieure, Paris\n\nfrancis.bach@ens.fr\n\nAbstract\n\nWe consider convex-concave saddle-point problems where the objective functions\nmay be split in many components, and extend recent stochastic variance reduction\nmethods (such as SVRG or SAGA) to provide the \ufb01rst large-scale linearly conver-\ngent algorithms for this class of problems which are common in machine learning.\nWhile the algorithmic extension is straightforward, it comes with challenges and\nopportunities: (a) the convex minimization analysis does not apply and we use\nthe notion of monotone operators to prove convergence, showing in particular\nthat the same algorithm applies to a larger class of problems, such as variational\ninequalities, (b) there are two notions of splits, in terms of functions, or in terms of\npartial derivatives, (c) the split does need to be done with convex-concave terms,\n(d) non-uniform sampling is key to an ef\ufb01cient algorithm, both in theory and prac-\ntice, and (e) these incremental algorithms can be easily accelerated using a simple\nextension of the \u201ccatalyst\u201d framework, leading to an algorithm which is always\nsuperior to accelerated batch algorithms.\n\nIntroduction\n\n1\nWhen using optimization in machine learning, leveraging the natural separability of the objective\nfunctions has led to many algorithmic advances; the most common example is the separability as a sum\nof individual loss terms corresponding to individual observations, which leads to stochastic gradient\ndescent techniques. Several lines of work have shown that the plain Robbins-Monro algorithm could\nbe accelerated for strongly-convex \ufb01nite sums, e.g., SAG [1], SVRG [2], SAGA [3]. However, these\nonly apply to separable objective functions.\nIn order to tackle non-separable losses or regularizers, we consider the saddle-point problem:\n\nmin\nx\u2208Rd\n\nmax\ny\u2208Rn\n\nK(x, y) + M (x, y),\n\n(1)\n\nwhere the functions K and M are \u201cconvex-concave\u201d, that is, convex with respect to the \ufb01rst variable,\nand concave with respect to the second variable, with M potentially non-smooth but \u201csimple\u201d (e.g.,\nfor which the proximal operator is easy to compute), and K smooth. These problems occur naturally\nwithin convex optimization through Lagrange or Fenchel duality [4]; for example the bilinear saddle-\npoint problem minx\u2208Rd maxy\u2208Rn f (x)+y(cid:62)Kx\u2212g(y) corresponds to a supervised learning problem\nwith design matrix K, a loss function g\u2217 (the Fenchel conjugate of g) and a regularizer f.\nWe assume that the function K may be split into a potentially large number of components. Many\nproblems in machine learning exhibit that structure in the saddle-point formulation, but not in the\nassociated convex minimization and concave maximization problems (see examples in Section 2.1).\nLike for convex minimization, gradient-based techniques that are blind to this separable structure\nneed to access all the components at every iteration. We show that algorithms such as SVRG [2] and\nSAGA [3] may be readily extended to the saddle-point problem. While the algorithmic extension is\nstraightforward, it comes with challenges and opportunities. We make the following contributions:\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u2013 We provide the \ufb01rst convergence analysis for these algorithms for saddle-point problems, which\ndiffers signi\ufb01cantly from the associated convex minimization set-up. In particular, we use in\nSection 6 the interpretation of saddle-point problems as \ufb01nding the zeros of a monotone operator,\nand only use the monotonicity properties to show linear convergence of our algorithms, thus\nshowing that they extend beyond saddle-point problems, e.g., to variational inequalities [5, 6].\n\n\u2013 We show that the saddle-point formulation (a) allows two different notions of splits, in terms\nof functions, or in terms of partial derivatives, (b) does need splits into convex-concave terms\n(as opposed to convex minimization), and (c) that non-uniform sampling is key to an ef\ufb01cient\nalgorithm, both in theory and practice (see experiments in Section 7).\n\n\u2013 We show in Section 5 that these incremental algorithms can be easily accelerated using a simple\nextension of the \u201ccatalyst\u201d framework of [7], thus leading to an algorithm which is always superior\nto accelerated batch algorithms.\n\nprox\u03c3\n\nM (x(cid:48), y(cid:48)) = arg min\nx\u2208Rd\n\nmax\ny\u2208Rn\n\n2 Composite Decomposable Saddle-Point Problems\nWe now present our new algorithms on saddle-point problems and show a natural extension to\nmonotone operators later in Section 6. We thus consider the saddle-point problem de\ufb01ned in Eq. (1),\nwith the following assumptions:\n2(cid:107)y(cid:107)2 is\n(A) M is strongly (\u03bb, \u03b3)-convex-concave, that is, the function (x, y) (cid:55)\u2192 M (x, y) \u2212 \u03bb\n2(cid:107)x(cid:107)2 + \u03b3\nconvex-concave. Moreover, we assume that we may compute the proximal operator of M, i.e., for\nany (x(cid:48), y(cid:48)) \u2208 Rn+d (\u03c3 is the step-length parameter associated with the prox operator):\n2(cid:107)y \u2212 y(cid:48)(cid:107)2.\n\n2(cid:107)x \u2212 x(cid:48)(cid:107)2 \u2212 \u03b3\n\n\u03c3M (x, y) + \u03bb\n\n(2)\nThe values of \u03bb and \u03b3 lead to the de\ufb01nition of a weighted Euclidean norm on Rn+d de\ufb01ned as\n\u2126(x, y)2 = \u03bb(cid:107)x(cid:107)2 + \u03b3(cid:107)y(cid:107)2, with dual norm de\ufb01ned through \u2126\u2217(x, y)2 = \u03bb\u22121(cid:107)x(cid:107)2 + \u03b3\u22121(cid:107)y(cid:107)2.\nDealing with the two different scaling factors \u03bb and \u03b3 is crucial for good performance, as these\nmay be very different, depending on the many arbitrary ways to set-up a saddle-point problem.\n(B) K is convex-concave and has Lipschitz-continuous gradients; it is natural to consider the gradient\noperator B : Rn+d \u2192 Rn+d de\ufb01ned as B(x, y) = (\u2202xK(x, y),\u2212\u2202yK(x, y)) \u2208 Rn+d and\nto consider L = sup\u2126(x\u2212x(cid:48),y\u2212y(cid:48))=1 \u2126\u2217(B(x, y) \u2212 B(x(cid:48), y(cid:48))). The quantity L represents the\ncondition number of the problem.\n(C) The vector-valued function B(x, y) = (\u2202xK(x, y),\u2212\u2202yK(x, y)) \u2208 Rn+d may be split into a\ni\u2208I Bi, where the only constraint is that each Bi is\nLipschitz-continuous (with constant Li). There is no need to assume the existence of a function\nKi : Rn+d \u2192 R such that Bi = (\u2202xKi,\u2212\u2202yKi).\nWe will also consider splits which are adapted to the saddle-point nature of the problem, that is,\nj\u2208J By\nI = J \u00d7 K, Bjk(x, y) = (pjBx\nj (x, y)), for p and q sequences that sum to one. This\nsubstructure, which we refer to as \u201cfactored\u201d, will only make a difference when storing the values\nof these operators in Section 4 for our SAGA algorithm.\n\nfamily of vector-valued functions as B =(cid:80)\nof the form B(x, y) =(cid:0)(cid:80)\nk (x, y),(cid:80)\n\nj (x, y)(cid:1), which is a subcase of the above with\n\nk (x, y), qkBy\n\nk\u2208K Bx\n\nGiven assumptions (A)-(B), the saddle-point problem in Eq. (1) has a unique solution (x\u2217, y\u2217) such\nthat K(x\u2217, y)+M (x\u2217, y) (cid:54) K(x\u2217, y\u2217)+M (x\u2217, y\u2217) (cid:54) K(x, y\u2217)+M (x, y\u2217), for all (x, y); moreover\nminx\u2208Rd maxy\u2208Rn K(x, y) + M (x, y) = maxy\u2208Rn minx\u2208Rd K(x, y) + M (x, y) (see, e.g., [8, 4]).\nThe main generic examples for the functions K(x, y) and M (x, y) are:\n\u2013 Bilinear saddle-point problems: K(x, y) = y(cid:62)Kx for a matrix K \u2208 Rn\u00d7d (we identify here a\n\u221a\nmatrix with the associated bilinear function), for which the vector-valued function B(x, y) is linear,\ni.e., B(x, y) = (K(cid:62)y,\u2212Kx). Then L = (cid:107)K(cid:107)op/\n\u03b3\u03bb, where (cid:107)K(cid:107)op is the largest singular value\n(cid:80)d\nof K.\nThere are two natural potential splits with I = {1, . . . , n} \u00d7 {1, . . . , d}, with B =\nk=1 Bjk: (a) the split into individual elements Bjk(x, y) = Kjk(yj,\u2212xk), where ev-\nery element is the gradient operator of a bi-linear function, and (b) the \u201cfactored\u201d split into\nj\u00b7 ,\u2212pjxkK\u00b7k), where Kj\u00b7 and K\u00b7k are the j-th row and k-th\nrows/columns Bjk(x, y) = (qkyjK(cid:62)\ncolumn of K, p and q are any set of vectors summing to one, and every element is not the gradient\noperator of any function. These splits correspond to several \u201csketches\u201d of the matrix K [9], adapted\nto subsampling of K, but other sketches could be considered.\n\n(cid:80)n\n\nj=1\n\n2\n\n\f\u2013 Separable functions: M (x, y) = f (x) \u2212 g(y) where f is any \u03bb-strongly-convex and g is \u03b3-\n2(cid:107)x\u2212x(cid:48)(cid:107)2\n2(cid:107)y \u2212 y(cid:48)(cid:107)2 are easy to compute. In this situation,\ng (y(cid:48))). Following the usual set-up of composite optimiza-\n\nstrongly convex, for which the proximal operators prox\u03c3\nand prox\u03c3\nprox\u03c3\ntion [10], no smoothness assumption is made on M and hence on f or g.\n\ng (y(cid:48)) = arg maxy\u2208Rd \u2212\u03c3g(y) \u2212 \u03b3\n\nf (x(cid:48)) = arg minx\u2208Rd \u03c3f (x)+ \u03bb\n\nM (x(cid:48), y(cid:48)) = (prox\u03c3\n\nf (x(cid:48)), prox\u03c3\n\n2.1 Examples in machine learning\n\nMany learning problems are formulated as convex optimization problems, and hence by duality as\nsaddle-point problems. We now give examples where our new algorithms are particularly adapted.\nSupervised learning with non-separable losses or regularizers. For regularized linear supervised\nlearning, with n d-dimensional observations put in a design matrix K \u2208 Rn\u00d7d, the predictions\nare parameterized by a vector x \u2208 Rd and lead to a vector of predictions Kx \u2208 Rn. Given a loss\nfunction de\ufb01ned through its Fenchel conjugate g\u2217 from Rn to R, and a regularizer f (x), we obtain\nexactly a bi-linear saddle-point problem. When the loss g\u2217 or the regularizer f is separable, i.e., a\nsum of functions of individual variables, we may apply existing fast gradient-techniques [1, 2, 3] to\nthe primal problem minx\u2208Rd g\u2217(Kx) + f (x) or the dual problem maxy\u2208Rn \u2212g(y) \u2212 f\u2217(K(cid:62)y), as\nwell as methods dedicated to separable saddle-point problems [11, 12]. When the loss g\u2217 and the\nregularizer f are not separable (but have a simple proximal operator), our new fast algorithms are the\nonly ones that can be applied from the class of large-scale linearly convergent algorithms.\nNon-separable losses may occur when (a) predicting by af\ufb01ne functions of the inputs and not\npenalizing the constant terms (in this case de\ufb01ning the loss functions as the minimum over the\nconstant term, which becomes non-separable) or (b) using structured output prediction methods\nthat lead to convex surrogates to the area under the ROC curve (AUC) or other precision/recall\nquantities [13, 14]. These come often with ef\ufb01cient proximal operators (see Section 7 for an\nexample).\nNon-separable regularizers with available ef\ufb01cient proximal operators are numerous, such as grouped-\nnorms with potentially overlapping groups, norms based on submodular functions, or total variation\n(see [15] and references therein, and an example in Section 7).\nRobust optimization. The framework of robust optimization [16] aims at optimizing an objective\nfunction with uncertain data. Given that the aim is then to minimize the maximal value of the\nobjective function given the uncertainty, this leads naturally to saddle-point problems.\nConvex relaxation of unsupervised learning. Unsupervised learning leads to convex relaxations\nwhich often exhibit structures naturally amenable to saddle-point problems, e.g, for discriminative\nclustering [17] or matrix factorization [18].\n\n2.2 Existing batch algorithms\n\nIn this section, we review existing algorithms aimed at solving the composite saddle-point problem in\nEq. (1), without using the sum-structure. Note that it is often possible to apply batch algorithms for\nthe associated primal or dual problems (which are not separable in general).\nForward-backward (FB) algorithm. The main iteration is\n\n(cid:2)(xt\u22121, yt\u22121) \u2212 \u03c3(cid:0)1/\u03bb 0\n(cid:0)xt\u22121 \u2212 \u03c3\u03bb\u22121\u2202xK(xt\u22121, yt\u22121) + \u03c3\u03b3\u22121\u2202yK(xt\u22121, yt\u22121)).\n\n(cid:1)B(xt\u22121, yt\u22121)(cid:3)\n\n(xt, yt) = prox\u03c3\nM\n= prox\u03c3\nM\n\n0 1/\u03b3\n\nThe algorithm aims at simultaneously minimizing with respect to x while maximizing with re-\nspect to y (when M (x, y) is the sum of isotropic quadratic terms and indicator functions, we get\nsimultaneous projected gradient descents). This algorithm is known not to converge in general [8],\nbut is linearly convergent for strongly-convex-concave problems, when \u03c3 = 1/L2, with the rate\n(1 \u2212 1/(1 + L2))t [19] (see simple proof in Appendix B.1). This is the one we are going to adapt to\nstochastic variance reduction.\nWhen M (x, y) = f (x) \u2212 g(y), we obtain the two parallel updates xt = prox\u03c3\n\u03bb\u22121\u03c3\u2202xK(xt\u22121, yt\u22121\nally by replacing the second one by yt = prox\u03c3\ng\nto as the Arrow-Hurwicz method (see [20] and references therein).\n\n(cid:0)xt\u22121 \u2212\n(cid:1)(cid:1), which can de done seri-\n(cid:1)(cid:1). This is often referred\n\n(cid:0)yt\u22121 + \u03b3\u22121\u03c3\u2202yK(xt\u22121, yt\u22121\n(cid:0)yt\u22121 + \u03b3\u22121\u03c3\u2202yK(xt, yt\u22121\n\n(cid:1)(cid:1) and yt = prox\u03c3\n\ng\n\nf\n\n3\n\n\f(cid:2)(xt\u22121, yt\u22121) \u2212 \u03c3(cid:0)1/\u03bb 0\n\nAccelerated forward-backward algorithm. The forward-backward algorithm may be accelerated\nby a simple extrapolation step, similar to Nesterov\u2019s acceleration for convex minimization [21]. The\nalgorithm from [20], which only applies to bilinear functions K, and which we extend from separable\nM to our more general set-up in Appendix B.2, has the following iteration:\n\n(cid:1)B(xt\u22121 + \u03b8(xt\u22121 \u2212 xt\u22122), yt\u22121 + \u03b8(yt\u22121 \u2212 yt\u22122))(cid:3).\n\n(xt, yt) = prox\u03c3\nM\nWith \u03c3 = 1/(2L) and \u03b8 = L/(L + 1), we get an improved convergence rate, where (1 \u2212\n1/(1 + L2))t is replaced by (1 \u2212 1/(1 + 2L))t. This is always a strong improvement when L\nis large (ill-conditioned problems), as illustrated in Section 7. Note that our acceleration technique in\nSection 5 may be extended to get a similar rate for the batch set-up for non-linear K.\n\n0 1/\u03b3\n\n2.3 Existing stochastic algorithms\n\nForward-backward algorithms have been studied with added noise [22], leading to a convergence\nrate in O(1/t) after t iterations for strongly-convex-concave problems. In our setting, we replace\nBi(x, y), where i \u2208 I is sampled from the probability vector (\u03c0i)i\nB(x, y) in our algorithm with 1\n\u03c0i\n(good probability vectors will depend on the application, see below for bilinear problems). We have\nEBi(x, y) = B(x, y); the main iteration is then\n\n(cid:2)(xt\u22121, yt\u22121) \u2212 \u03c3t\n\n(cid:0)1/\u03bb 0\n\n0 1/\u03b3\n\n\u03c0it\n\n(cid:1) 1\n\nBit(xt\u22121, yt\u22121)(cid:3),\n\n(xt, yt) = prox\u03c3t\nM\n\nwith it selected independently at random in I with probability vector \u03c0. In Appendix C, we show that\nusing \u03c3t = 2/(t + 1 + 8 \u00afL(\u03c0)2) leads to a convergence rate in O(1/t), where \u00afL(\u03c0) is a smoothness\nconstant explicited below. For saddle-point problems, it leads to the complexities shown in Table 1.\nLike for convex minimization, it is fast early on but the performance levels off. Such schemes are\ntypically used in sublinear algorithms [23].\n\n2.4 Sampling probabilities, convergence rates and running-time complexities\n\n(cid:80)\n\nIn order to characterize running-times, we denote by T (A) the complexity of computing A(x, y)\nfor any operator A and (x, y) \u2208 Rn+d, while we denote by Tprox(M ) the complexity of computing\nM (x, y). In our motivating example of bilinear functions K(x, y), we assume that Tprox(M )\nprox\u03c3\ntakes times proportional to n + d and getting a single element of K is O(1).\nIn order to characterize the convergence rate, we need the Lipschitz-constant L (which happens to\nbe the condition number with our normalization) de\ufb01ned earlier as well as a smoothness constant\nadapted to our sampling schemes:\n\u2126\u2217(Bi(x, y) \u2212 Bi(x(cid:48), y(cid:48)))2 such that \u2126(x \u2212 x(cid:48), y \u2212 y(cid:48))2 (cid:54) 1.\n\u00afL(\u03c0)2 = sup(x,y,x(cid:48),y(cid:48))\ni\u2208I\nWe always have the bounds L2 (cid:54) \u00afL(\u03c0)2 (cid:54) maxi\u2208I L2\n. However, in structured situations\ni\u2208I\n(like in bilinear saddle-point problems), we get much improved bounds, as described below.\n\u221a\nBi-linear saddle-point. The constant L is equal to (cid:107)K(cid:107)op/\n\u03bb\u03b3, and we will consider as well\nthe Frobenius norm (cid:107)K(cid:107)F de\ufb01ned through (cid:107)K(cid:107)2\njk, and the norm (cid:107)K(cid:107)max de\ufb01ned as\nj,k K 2\n(cid:107)K(cid:107)max = max{supj(KK(cid:62))1/2\njj , supk(K(cid:62)K)1/2\n\n(cid:107)K(cid:107)max (cid:54) (cid:107)K(cid:107)op (cid:54) (cid:107)K(cid:107)F (cid:54)(cid:112)max{n, d}(cid:107)K(cid:107)max (cid:54)(cid:112)max{n, d}(cid:107)K(cid:107)op,\n\nkk }. Among the norms above, we always have:\n\ni \u00d7(cid:80)\nF =(cid:80)\n\n(3)\n\n1\n\u03c0i\n\n1\n\u03c0i\n\n\u221a\n\nwhich allows to show below that some algorithms have better bounds than others.\nThere are several schemes to choose the probabilities \u03c0jk (individual splits) and \u03c0jk = pjqk (fac-\ntored splits). For the factored formulation where we select random rows and columns, we con-\nsider the non-uniform schemes pj = (KK(cid:62))jj/(cid:107)K(cid:107)2\nF and qk = (K(cid:62)K)kk/(cid:107)K(cid:107)2\nF , leading to\n\u221a\n\u00afL(\u03c0) (cid:54) (cid:107)K(cid:107)F /\n\u03bb\u03b3. For the indi-\njk/(cid:107)K(cid:107)2\nF , leading to\nvidual formulation where we select random elements, we consider \u03c0jk = K 2\n\u221a\n\u03bb\u03b3 (in these\n\n\u03bb\u03b3, or uniform, leading to \u00afL(\u03c0) (cid:54) (cid:112)max{n, d}(cid:107)K(cid:107)max/\n\n\u00afL(\u03c0) (cid:54) (cid:112)max{n, d}(cid:107)K(cid:107)F /\n\n\u03bb\u03b3, or uniform, leading to \u00afL(\u03c0) (cid:54) \u221a\n\nnd(cid:107)K(cid:107)max/\n\nsituations, it is important to select several elements simultaneously, which our analysis supports).\nWe characterize convergence with the quantity \u03b5 = \u2126(x\u2212 x\u2217, y \u2212 y\u2217)2/\u2126(x0 \u2212 x\u2217, y0 \u2212 y\u2217)2, where\n(x0, y0) is the initialization of our algorithms (typically (0, 0) for bilinear saddle-points). In Table 1\nwe give a summary of the complexity of all algorithms discussed in this paper: we recover the same\ntype of speed-ups as for convex minimization. A few points are worth mentioning:\n\n\u221a\n\n4\n\n\fComplexity\n\nAlgorithms\nBatch FB\nBatch FB-accelerated\n\n(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)\nSVRG-non-uniform-accelerated log(1/\u03b5) \u00d7(cid:0) nd +(cid:112)nd max{n, d}(cid:107)K(cid:107)F /\n\nlog(1/\u03b5) \u00d7(cid:0) nd + nd(cid:107)K(cid:107)2\nlog(1/\u03b5) \u00d7(cid:0) nd + nd(cid:107)K(cid:107)op/\n(1/\u03b5) \u00d7(cid:0) max{n, d}(cid:107)K(cid:107)2\n(cid:12)(cid:12)(cid:12)\n(1/\u03b5) \u00d7(cid:0) nd(cid:107)K(cid:107)2\nlog(1/\u03b5) \u00d7(cid:0) nd + nd(cid:107)K(cid:107)2\nlog(1/\u03b5) \u00d7(cid:0) nd + max{n, d}(cid:107)K(cid:107)2\n\nSAGA/SVRG-uniform\nSAGA/SVRG-non-uniform\n\nStochastic FB-non-uniform\nStochastic FB-uniform\n\nop/(\u03bb\u03b3)\n\u03bb\u03b3)\n\nmax/(\u03bb\u03b3)\n\nmax/(\u03bb\u03b3)\n\nF /(\u03bb\u03b3)\n\n\u221a\n\n(cid:12)(cid:12)(cid:12)\n\n\u221a\n\nF /(\u03bb\u03b3)\n\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:1)\n(cid:12)(cid:12)(cid:12) (cid:1)\n\n\u03bb\u03b3\n\nTable 1: Summary of convergence results for the strongly (\u03bb, \u03b3)-convex-concave bilinear saddle-point\nproblem with matrix K and individual splits (and n + d updates per iteration). For factored splits\n(little difference), see Appendix D.4. For accelerated SVRG, we omitted the logarithmic term (see\nSection 5).\n\n\u2013 Given the bounds between the various norms on K in Eq. (3), SAGA/SVRG with non-uniform\nsampling always has convergence bounds superior to SAGA/SVRG with uniform sampling, which\nis always superior to batch forward-backward. Note however, that in practice, SAGA/SVRG with\nuniform sampling may be inferior to accelerated batch method (see Section 7).\n\n\u2013 Accelerated SVRG with non-uniform sampling is the most ef\ufb01cient method, which is con\ufb01rmed\nin our experiments. Note that if n = d, our bound is better than or equal to accelerated forward-\nbackward, in exactly the same way than for regular convex minimization. There is thus a formal\nadvantage for variance reduction.\n\n(cid:80)\n\n3 SVRG: Stochastic Variance Reduction for Saddle Points\nFollowing [2, 24], we consider a stochastic-variance reduced estimation of the \ufb01nite sum B(x, y) =\ni\u2208I Bi(x, y). This is achieved by assuming that we have an iterate (\u02dcx, \u02dcy) with a known value of\n\nB(\u02dcx, \u02dcy), and we consider the estimate of B(x, y):\n\nB(\u02dcx, \u02dcy) + 1\n\u03c0i\n\nBi(x, y) \u2212 1\n\n\u03c0i\n\nBi(\u02dcx, \u02dcy),\n\nwhich has the correct expectation when i is sampled from I with probability \u03c0, but with a reduced\nvariance. Since we need to refresh (\u02dcx, \u02dcy) regularly, the algorithm works in epochs (we allow to\nsample m elements per updates, i.e., a mini-batch of size m), with an algorithm that shares the same\nstructure than SVRG for convex minimization; see Algorithm 1. Note that we provide an explicit\nnumber of iterations per epoch, proportional to (L2 + 3 \u00afL2/m). We have the following theorem,\nshown in Appendix D.1 (see also a discussion of the proof in Section 6).\n\nTheorem 1 Assume (A)-(B)-(C). After v epochs of Algorithm 1, we have:\n\nE(cid:2)\u2126(xv \u2212 x\u2217, yv \u2212 y\u2217)2(cid:3) (cid:54) (3/4)v\u2126(x0 \u2212 x\u2217, y0 \u2212 y\u2217)2.\n\n\u00afL2) maxi\u2208I T (Bi) + (1 + L2 + \u00afL2/m)Tprox(M )(cid:3) log 1\n\nThe computational complexity to reach precision \u03b5 is proportional\n\n\u03b5 . Note that by taking the mini-batch size m\nlarge, we can alleviate the complexity of the proximal operator proxM if too large. Moreover, if L2\nis too expensive to compute, we may replace it by \u00afL2 but with a worse complexity bound.\nBilinear saddle-point problems. When using a mini-batch size m = 1 with the factored updates,\nor m = n + d for the individual updates, we get the same complexities proportional to [nd +\nmax{n, d}(cid:107)K(cid:107)2\nF /(\u03bb\u03b3)] log(1/\u03b5) for non-uniform sampling, which improve signi\ufb01cantly over (non-\naccelerated) batch methods (see Table 1).\n\nto (cid:2)T (B) + (mL2 +\n\n4 SAGA: Online Stochastic Variance Reduction for Saddle Points\n\nFollowing [3], we consider a stochastic-variance reduced estimation of B(x, y) =(cid:80)\n\ni\u2208I Bi(x, y).\nThis is achieved by assuming that we store values gi = Bi(xold(i), yold(i)) for an old iterate\n\n5\n\n\fAlgorithm 1 SVRG: Stochastic Variance Reduction for Saddle Points\nInput: Functions (Ki)i, M, probabilities (\u03c0i)i, smoothness \u00afL(\u03c0) and L, iterate (x, y)\n\nnumber of epochs v, number of updates per iteration (mini-batch size) m\n\nSet \u03c3 =(cid:2)L2 + 3 \u00afL2/m(cid:3)\u22121\n\nfor u = 1 to v do\n\nInitialize (\u02dcx, \u02dcy) = (x, y) and compute B(\u02dcx, \u02dcy)\nfor k = 1 to log 4 \u00d7 (L2 + 3 \u00afL2/m) do\n\nSample i1, . . . , im \u2208 I from the probability vector (\u03c0i)i with replacement\n(x, y) \u2190 prox\u03c3\n\n(cid:2)(x, y)\u2212\u03c3(cid:0)1/\u03bb 0\n\n(cid:1)(cid:0)B(\u02dcx, \u02dcy)+ 1\n\n(cid:80)m\n\n(cid:8) 1\n\nBik (x, y)\u2212 1\n\nM\n\nm\n\nk=1\n\n0 1/\u03b3\n\n\u03c0ik\n\n\u03c0ik\n\nBik (\u02dcx, \u02dcy)(cid:9)(cid:1)(cid:3)\n\nend for\n\nend for\n\nOutput: Approximate solution (x, y)\n\n(xold(i), yold(i)), and we consider the estimate of B(x, y):\n\n(cid:80)\n\nj\u2208I gj + 1\n\u03c0i\n\nBi(x, y) \u2212 1\n\n\u03c0i\n\ngi,\n\nwhich has the correct expectation when i is sampled from I with probability \u03c0. At every iteration,\nwe also refresh the operator values gi \u2208 Rn+d, for the same index i or with a new index i sampled\nuniformly at random. This leads to Algorithm 2, and we have the following theorem showing linear\nconvergence, proved in Appendix D.2. Note that for bi-linear saddle-points, the initialization at (0, 0)\nhas zero cost (which is not possible for convex minimization).\n\nTheorem 2 Assume (A)-(B)-(C). After t iterations of Algorithm 2 (with the option of resampling\nwhen using non-uniform sampling), we have:\n\nE(cid:2)\u2126(xt \u2212 x\u2217, yt \u2212 y\u2217)2(cid:3) (cid:54) 2(cid:0)1 \u2212 (max{ 3|I|\n\n\u2126(x0 \u2212 x\u2217, y0 \u2212 y\u2217)2.\n\nm\u00b52})\u22121(cid:1)t\n\n2m , 1 + L2\n\n\u00b52 + 3 \u00afL2\n\nResampling or re-using the same gradients. For the bound above to be valid for non-uniform\nsampling, like for convex minimization [25], we need to resample m operators after we make\nthe iterate update. In our experiments, following [25], we considered a mixture of uniform and\nnon-uniform sampling, without the resampling step.\nSAGA vs. SVRG. The difference between the two algorithms is the same as for convex minimization\n(see, e.g., [26] and references therein), that is SVRG has no storage, but works in epochs and requires\nslightly more accesses to the oracles, while SAGA is a pure online method with fewer parameters but\nrequires some storage (for bi-linear saddle-point problems, we only need to store O(n+d) elements\nfor the factored splits, while we need O(dn) for the individual splits). Overall they have the same\nrunning-time complexity for individual splits; for factored splits, see Appendix D.4.\nFactored splits. When using factored splits, we need to store the two parts of the operator values\nseparately and update them independently, leading in Theorem 2 to replacing |I| by max{|J|,|K|}.\n\n5 Acceleration\nFollowing the \u201ccatalyst\u201d framework of [7], we consider a sequence of saddle-point problems with\nadded regularization; namely, given (\u00afx, \u00afy), we use SVRG to solve approximately\n2 (cid:107)y \u2212 \u00afy(cid:107)2,\n\n2 (cid:107)x \u2212 \u00afx(cid:107)2 \u2212 \u03b3\u03c4\n\nK(x, y) + M (x, y) + \u03bb\u03c4\n\n(4)\n\nmin\nx\u2208Rd\n\nmax\ny\u2208Rn\n\nfor well-chosen \u03c4 and (\u00afx, \u00afy). The main iteration of the algorithm differs from the original SVRG by\nthe presence of the iterate (\u00afx, \u00afy), which is updated regularly (after a precise number of epochs), and\ndifferent step-sizes (see details in Appendix D.3). The complexity to get an approximate solution of\nEq. (4) (forgetting the complexity of the proximal operator and for a single update), up to logarithmic\nterms, is proportional, to T (B) + \u00afL2(1 + \u03c4 )\u22122 maxi\u2208I T (Bi).\nThe key difference with the convex optimization set-up is that the analysis is simpler, without\nthe need for Nesterov acceleration machinery [21] to de\ufb01ne a good value of (\u00afx, \u00afy); indeed, the\nsolution of Eq. (4) is one iteration of the proximal-point algorithm, which is known to converge\n\n6\n\n\fAlgorithm 2 SAGA: Online Stochastic Variance Reduction for Saddle Points\nInput: Functions (Ki)i, M, probabilities (\u03c0i)i, smoothness \u00afL(\u03c0) and L, iterate (x, y)\n\nnumber of iterations t, number of updates per iteration (mini-batch size) m\n\n2m \u2212 1, L2 + 3 \u00afL2\n\nSet \u03c3 =(cid:2) max{ 3|I|\nInitialize gi = Bi(x, y) for all i \u2208 I and G =(cid:80)\n(cid:1)(cid:0)G + 1\n\nm }(cid:3)\u22121\n(cid:2)(x, y) \u2212 \u03c3(cid:0)1/\u03bb 0\n\nfor u = 1 to t do\n\nSample i1, . . . , im \u2208 I from the probability vector (\u03c0i)i with replacement\n(cid:8) 1\n(cid:80)m\nCompute hk = Bik (x, y) for k \u2208 {1, . . . , m}\n(x, y) \u2190 prox\u03c3\n(optional) Sample i1, . . . , im \u2208 I uniformly with replacement\n(optional) Compute hk = Bik (x, y) for k \u2208 {1, . . . , m}\n\nReplace G \u2190 G \u2212(cid:80)m\n\nk=1{gik \u2212 hk} and gik \u2190 hk for k \u2208 {1, . . . , m}\n\ngik(cid:9)(cid:1)(cid:3)\n\nhk \u2212 1\n\u03c0ik\n\nm\n\nk=1\n\n\u03c0ik\n\ni\u2208I gi\n\nM\n\n0 1/\u03b3\n\nend for\n\nOutput: Approximate solution (x, y)\n\nlinearly [27] with rate (1 + \u03c4\u22121)\u22121 = (1 \u2212 1\n1+\u03c4 ). Thus the overall complexity is up to loga-\nrithmic terms equal to T (B)(1 + \u03c4 ) + \u00afL2(1 + \u03c4 )\u22121 maxi\u2208I T (Bi). The trade-off in \u03c4 is opti-\n\nmal for 1 + \u03c4 = \u00afL(cid:112)maxi\u2208I T (Bi)/T (B), showing that there is a potential acceleration when\n\u00afL(cid:112)maxi\u2208I T (Bi)/T (B) (cid:62) 1, leading to a complexity \u00afL(cid:112)T (B) maxi\u2208I T (Bi).\n\nSince the SVRG algorithm already works in epochs, this leads to a simple modi\ufb01cation where every\nlog(1 + \u03c4 ) epochs, we change the values of (\u00afx, \u00afy). See Algorithm 3 in Appendix D.3. Moreover, we\ncan adaptively update (\u00afx, \u00afy) more aggressively to speed-up the algorithm.\nThe following theorem gives the convergence rate of the method (see proof in Appendix D.3).\n\n(cid:112)max{n\u22121, d\u22121} \u2212 1(cid:9)\nWith the value of \u03c4 de\ufb01ned above (corresponding to \u03c4 = max(cid:8)0,\nfor bilinear problems), we get the complexity \u00afL(cid:112)T (B) maxi\u2208I T (Bi), up to the logarithmic term\n\nlog(1 + \u03c4 ). For bilinear problems, this provides a signi\ufb01cant acceleration, as shown in Table 1.\n\n(cid:107)K(cid:107)F\u221a\n\n\u03bb\u03b3\n\nTheorem 3 Assume (A)-(B)-(C). After v epochs of Algorithm 3, we have, for any positive v:\n\nE(cid:2)\u2126(xv \u2212 x\u2217, yv \u2212 y\u2217)2(cid:3) (cid:54)(cid:0)1 \u2212 1\n\n(cid:1)v\n\n\u03c4 +1\n\n\u2126(x0 \u2212 x\u2217, y0 \u2212 y\u2217)2.\n\nWhile we provide a proof only for SVRG, the same scheme should work for SAGA. Moreover, the\nsame idea also applies to the batch setting (by simply considering |I| = 1, i.e., a single function),\nleading to an acceleration, but now valid for all functions K (not only bilinear).\n\n6 Extension to Monotone Operators\n\nmonotone) operators Bi, i \u2208 I, with B =(cid:80)\nA +(cid:80)\n\nIn this paper, we have chosen to focus on saddle-point problems because of their ubiquity in machine\nlearning. However, it turns out that our algorithm and, more importantly, our analysis extend\nto all set-valued monotone operators [8, 28]. We thus consider a maximal strongly-monotone\noperator A on a Euclidean space E, as well as a \ufb01nite family of Lipschitz-continuous (not necessarily\ni\u2208I Bi monotone. Our algorithm then \ufb01nds the zeros of\ni\u2208I Bi = A + B, from the knowledge of the resolvent (\u201cbackward\u201d) operator (I + \u03c3A)\u22121\n(for a well chosen \u03c3 > 0) and the forward operators Bi, i \u2208 I. Note the difference with [29], which\nrequires each Bi to be monotone with a known resolvent and A to be monotone and single-valued.\nThere several interesting examples (on which our algorithms apply):\n\u2013 Saddle-point problems: We assume for simplicity that \u03bb = \u03b3 = \u00b5 (this can be achieved by a\nsimple change of variable). If we denote B(x, y) = (\u2202xK(x, y),\u2212\u2202yK(x, y)) and the multi-\nvalued operator A(x, y) = (\u2202xM (x, y),\u2212\u2202yM (x, y)), then the proximal operator prox\u03c3\nM may be\nwritten as (\u00b5I + \u03c3A)\u22121(\u00b5x, \u00b5y), and we recover exactly our framework from Section 2.\n\ntions fi: we recover proximal-SVRG [24] and SAGA [3], to minimize minz\u2208E g(z) +(cid:80)\n\n\u2013 Convex minimization: A = \u2202g and Bi = \u2202fi for a strongly-convex function g and smooth func-\ni\u2208I fi(z).\nHowever, this is a situation where the operators Bi have an extra property called co-coercivity [6],\n\n7\n\n\fwhich we are not using because it is not satis\ufb01ed for saddle-point problems. The extension of\nSAGA and SVRG to monotone operators was proposed earlier by [30], but only co-coercive opera-\ntors are considered, and thus only convex minimization is considered (with important extensions\nbeyond plain SAGA and SVRG), while our analysis covers a much broader set of problems. In\nparticular, the step-sizes obtained with co-coercivity lead to divergence in the general setting.\nBecause we do not use co-coercivity, applying our results directly to convex minimization, we\nwould get slower rates, while, as shown in Section 2.1, they can be easily cast as a saddle-point\nproblem if the proximal operators of the functions fi are known, and we then get the same rates\nthan existing fast techniques which are dedicated to this problem [1, 2, 3].\n\n\u2013 Variational inequality problems, which are notably common in game theory (see, e.g., [5]).\n\nnorm \u03bb(cid:107)x(cid:107)2/2 and the clustering-inducing term(cid:80)\nsurrogate to the area under the ROC curve, de\ufb01ned as proportional to(cid:80)\n\n7 Experiments\nWe consider binary classi\ufb01cation problems with design matrix K and label vector in {\u22121, 1}n, a\nnon-separable strongly-convex regularizer with an ef\ufb01cient proximal operator (the sum of the squared\ni(cid:54)=j |xi \u2212 xj|, for which the proximal operator\n(cid:80)\nmay be computed in O(n log n) by isotonic regression [31]) and a non-separable smooth loss (a\ni\u2212\u2208I\u2212 (1\u2212 yi + yj)2,\nwhere I+/I\u2212 are sets with positive/negative labels, for a vector of prediction y, for which an ef\ufb01cient\nproximal operator may be computed as well, see Appendix E).\nF /(\u03bb\u03b3) where \u03bb is the regularization strength and \u03b3 \u2248 n\nOur upper-bounds depend on the ratio (cid:107)K(cid:107)2\nin our setting where we minimize an average risk. Setting \u03bb = \u03bb0 = (cid:107)K(cid:107)2\nF /n2 corresponds to a\nregularization proportional to the average squared radius of the data divided by 1/n which is standard\nin this setting [1]. We also experiment with smaller regularization (i.e., \u03bb/\u03bb0 = 10\u22121), to make\nthe problem more ill-conditioned (it turns out that the corresponding testing losses are sometimes\nslightly better). We consider two datasets, sido (n = 10142, d = 4932, non-separable losses and\nregularizers presented above) and rcv1 (n = 20242, d = 47236, separable losses and regularizer\ndescribed in Appendix F, so that we can compare with SAGA run in the primal). We report below the\nsquared distance to optimizers which appears in our bounds, as a function of the number of passes on\nthe data (for more details and experiments with primal-dual gaps and testing losses, see Appendix F).\nUnless otherwise speci\ufb01ed, we always use non-uniform sampling.\n\ni+\u2208I+\n\nWe see that uniform sampling for SAGA does not improve on batch methods, SAGA and accelerated\nSVRG (with non-uniform sampling) improve signi\ufb01cantly over the existing methods, with a stronger\ngain for the accelerated version for ill-conditioned problems (middle vs. left plot). On the right plot,\nwe compare to primal methods on a separable loss, showing that primal methods (here \u201cfba-primal\u201d,\nwhich is Nesterov acceleration) that do not use separability (and can thus be applied in all cases)\nare inferior, while SAGA run on the primal remains faster (but cannot be applied for non-separable\nlosses).\n\n8 Conclusion\nWe proposed the \ufb01rst linearly convergent incremental gradient algorithms for saddle-point problems,\nwhich improve both in theory and practice over existing batch or stochastic algorithms. While we\ncurrently need to know the strong convexity-concavity constants, we plan to explore in future work\nadaptivity to these constants like already obtained for convex minimization [3], paving the way to an\nanalysis without strong convexity-concavity.\n\n8\n\n010020030040050010\u22125100sido \u2212 distance to optimizers \u2212 \u03bb/\u03bb0=1.00 fb\u2212accfb\u2212stosagasaga (unif)svrgsvrg\u2212accfba\u2212primal010020030040050010\u22125100sido \u2212 distance to optimizers \u2212 \u03bb/\u03bb0=0.10 fb\u2212accfb\u2212stosagasaga (unif)svrgsvrg\u2212accfba\u2212primal010020030040050010\u22121510\u22121010\u22125100rcv1 \u2212 distance to optimizers \u2212 \u03bb/\u03bb0=1.00 fb\u2212accfb\u2212stosagasaga (unif)svrgsvrg\u2212accfba\u2212primalsaga\u2212primal\fReferences\n[1] N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate\n\nfor \ufb01nite training sets. In Adv. NIPS, 2012.\n\n[2] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In\n\nAdv. NIPS, 2013.\n\n[3] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for\n\nnon-strongly convex composite objectives. In Adv. NIPS, 2014.\n\n[4] R. T. Rockafellar. Monotone operators associated with saddle-functions and minimax problems. Nonlinear\n\nFunctional Analysis, 18(part 1):397\u2013407, 1970.\n\n[5] P. T. Harker and J.-S. Pang. Finite-dimensional variational inequality and nonlinear complementarity\n\nproblems: a survey of theory, algorithms and applications. Math. Prog., 48(1-3):161\u2013220, 1990.\n\n[6] D. L. Zhu and P. Marcotte. Co-coercivity and its role in the convergence of iterative schemes for solving\n\nvariational inequalities. SIAM Journal on Optimization, 6(3):714\u2013726, 1996.\n\n[7] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization. In Adv. NIPS, 2015.\n[8] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces.\n\nSpringer Science & Business Media, 2011.\n\n[9] D. Woodruff. Sketching as a tool for numerical linear algebra. Technical Report 1411.4357, arXiv, 2014.\n[10] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[11] X. Zhu and A. J. Storkey. Adaptive stochastic primal-dual coordinate descent for separable saddle point\nproblems. In Machine Learning and Knowledge Discovery in Databases, pages 645\u2013658. Springer, 2015.\n[12] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization.\n\nIn Proc. ICML, 2015.\n\n[13] T. Joachims. A support vector method for multivariate performance measures. In Proc. ICML, 2005.\n[14] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In Adv.\n\nNIPS, 1999.\n\n[15] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Founda-\n\ntions and Trends in Machine Learning, 4(1):1\u2013106, 2012.\n\n[16] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009.\n[17] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In Adv. NIPS, 2004.\n[18] F. Bach, J. Mairal, and J. Ponce. Convex sparse matrix factorizations. Technical Report 0812.1869, arXiv,\n\n2008.\n\n[19] G. H. G. Chen and R. T. Rockafellar. Convergence rates in forward-backward splitting. SIAM Journal on\n\nOptimization, 7(2):421\u2013444, 1997.\n\n[20] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with applications to\n\nimaging. Journal of Mathematical Imaging and Vision, 40(1):120\u2013145, 2011.\n\n[21] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, 2004.\n[22] L. Rosasco, S. Villa, and B. C. V\u02dcu. A stochastic forward-backward splitting method for solving monotone\n\ninclusions in hilbert spaces. Technical Report 1403.7999, arXiv, 2014.\n\n[23] K. L. Clarkson, E. Hazan, and D. P. Woodruff. Sublinear optimization for machine learning. Journal of the\n\nACM (JACM), 59(5):23, 2012.\n\n[24] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM\n\nJournal on Optimization, 24(4):2057\u20132075, 2014.\n\n[25] M. Schmidt, R. Babanezhad, M.O. Ahmed, A. Defazio, A. Clifton, and A. Sarkar. Non-uniform stochastic\n\naverage gradient method for training conditional random \ufb01elds. In Proc. AISTATS, 2015.\n\n[26] R. Harikandeh, M. O. Ahmed, A. Virani, M. Schmidt, J. Kone\u02c7cn`y, and S. Sallinen. Stop wasting my\n\ngradients: Practical SVRG. In Adv. NIPS, 2015.\n\n[27] R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and\n\nOptimization, 14(5):877\u2013898, 1976.\n\n[28] E. Ryu and S. Boyd. A primer on monotone operator methods. Appl. Comput. Math., 15(1):3\u201343, 2016.\n[29] H. Raguet, J. Fadili, and G. Peyr\u00e9. A generalized forward-backward splitting. SIAM Journal on Imaging\n\nSciences, 6(3):1199\u20131226, 2013.\n\n[30] D. Davis. Smart: The stochastic monotone aggregated root-\ufb01nding algorithm. Technical Report 1601.00698,\n\narXiv, 2016.\n\n[31] X. Zeng and M. Figueiredo. Solving OSCAR regularization problems by fast approximate proximal\n\nsplitting algorithms. Digital Signal Processing, 31:124\u2013135, 2014.\n\n9\n\n\f", "award": [], "sourceid": 803, "authors": [{"given_name": "Balamurugan", "family_name": "Palaniappan", "institution": "INRIA"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}]}