{"title": "SEGA: Variance Reduction via Gradient Sketching", "book": "Advances in Neural Information Processing Systems", "page_first": 2082, "page_last": 2093, "abstract": "We propose a novel randomized first order optimization method---SEGA (SkEtched GrAdient method)---which progressively throughout its iterations builds a variance-reduced estimate of the gradient from random linear measurements (sketches) of the gradient provided at each iteration by an oracle. In each iteration, SEGA updates the current estimate of the gradient through a sketch-and-project operation using the information provided by the latest sketch, and this is subsequently used to compute an unbiased estimate of the true gradient through a random relaxation procedure. This unbiased estimate is then used to perform a gradient step. Unlike standard subspace descent methods, such as coordinate descent, SEGA can be used for optimization problems with a non-separable proximal term. We provide a general convergence analysis and prove linear convergence for strongly convex objectives. In the special case of coordinate sketches, SEGA can be enhanced with various techniques such as importance sampling, minibatching and acceleration, and its rate is up to a small constant factor identical to the best-known rate of coordinate descent.", "full_text": "SEGA: Variance Reduction via Gradient Sketching\n\nFilip Hanzely1\n\nKonstantin Mishchenko1\n\nPeter Richt\u00b4arik1,2,3\n\n1 King Abdullah University of Science and Technology, 2University of Edinburgh,\n\n3Moscow Institute of Physics and Technology\n\nAbstract\n\nWe propose a randomized \ufb01rst order optimization method\u2014SEGA (SkEtched\nGrAdient)\u2014which progressively throughout\nits iterations builds a variance-\nreduced estimate of the gradient from random linear measurements (sketches) of\nthe gradient. In each iteration, SEGA updates the current estimate of the gradi-\nent through a sketch-and-project operation using the information provided by the\nlatest sketch, and this is subsequently used to compute an unbiased estimate of\nthe true gradient through a random relaxation procedure. This unbiased estimate\nis then used to perform a gradient step. Unlike standard subspace descent meth-\nods, such as coordinate descent, SEGA can be used for optimization problems with\na non-separable proximal term. We provide a general convergence analysis and\nprove linear convergence for strongly convex objectives. In the special case of\ncoordinate sketches, SEGA can be enhanced with various techniques such as im-\nportance sampling, minibatching and acceleration, and its rate is up to a small\nconstant factor identical to the best-known rate of coordinate descent.\n\n1\n\nIntroduction\n\nConsider the optimization problem\n\nF (x)\n\ndef\n= f (x) + R(x),\n\n(1)\n\nmin\nx2Rn\n\nwhere f : Rn ! R is smooth and \u00b5\u2013strongly convex, and R : Rn ! R [{ +1} is a closed convex\nregularizer. In some applications, R is either the indicator function of a convex set or a sparsity\ninducing non-smooth penalty such as `1-norm. We assume that the proximal operator of R, de\ufb01ned\nB , is easily computable (e.g., in closed form).\nby prox\u21b5R(x)\n2\u21b5ky xk2\n= hx, xi1/2\nAbove we use the weighted Euclidean norm kxkB\n= hBx, yi is a\nweighted inner product associated with a positive de\ufb01nite weight matrix B 0. Strong convexity\nof f is de\ufb01ned with respect to the same product and norm1.\n\n= argminy2RnR(y) + 1\n\nB , where hx, yiB\n\ndef\n\ndef\n\ndef\n\n1.1 Gradient sketching\n\nIn this paper we design proximal gradient-type methods for solving (1) without assuming that the\ntrue gradient of f is available. Instead, we assume that an oracle provides a random linear trans-\nformation (i.e., a sketch) of the gradient, which is the information available to drive the iterative\n\n1f is \u00b5\u2013strongly convex if f (x) f (y) + hrf (y), x yiB + \u00b5\n\n2kx yk2\n\nB for all x, y 2 Rn.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fprocess. In particular, given a \ufb01xed distribution D over matrices S 2 Rn\u21e5b (b 1 can but does not\nneed to be \ufb01xed), and a query point x 2 Rn, our oracle provides us the random linear transformation\nof the gradient given by\n\n\u21e3(S, x)\n\ndef\n\n= S>rf (x) 2 Rb,\n\nS \u21e0D .\n\n(2)\n\nInformation of this type is available/used in a variety of scenarios. For instance, randomized coordi-\nnate descent (CD) methods use oracle (2) with D corresponding to a distribution over standard basis\nvectors. Minibatch/parallel variants of CD methods utilize oracle (2) with D corresponding to a dis-\ntribution over random column submatrices of the identity matrix. If one is prepared to use difference\nof function values to approximate directional derivatives, one can apply our oracle model to zeroth-\norder optimization [8]. Indeed, the directional derivative of f in a random direction S = s 2 Rn\u21e51\ncan be approximated by \u21e3(s, x) \u21e1 1\nWe now illustrate this concept using two examples.\nExample 1.1 (Sketches). (i) Coordinate sketch. Let D be the uniform distribution over standard\nunit basis vectors e1, e2, . . . , en of Rn. Then \u21e3(ei, x) = e>i rf (x), i.e., the ith partial derivative of f\nat x. (ii) Gaussian sketch. Let D be the standard Gaussian distribution in Rn. Then for s \u21e0D we\nhave \u21e3(s, x) = s>rf (x), i.e., the directional derivative of f at x in direction s.\n1.2 Related work\n\n\u270f (f (x + \u270fs) f (x)), where \u270f> 0 is suf\ufb01ciently small.\n\nIn the last decade, stochastic gradient-type methods for solving problem (1) have received unprece-\ndented attention by theoreticians and practitioners alike. Speci\ufb01c examples of such methods are\nstochastic gradient descent (SGD) [43], variance-reduced variants of SGD such as SAG [44], SAGA [10],\nSVRG [22], and their accelerated counterparts [26, 1]. While these methods are speci\ufb01cally designed\nfor objectives formulated as an expectation or a \ufb01nite sum, we do not assume such a structure.\nMoreover, these methods utilize a fundamentally different stochastic gradient information:\nthey\nhave access to an unbiased gradient estimator. In contrast, we do not assume that (2) is an unbiased\nestimator of rf (x). In fact, \u21e3(S, x) 2 Rb and rf (x) 2 Rn do not even necessarily belong to the\nsame space. Therefore, our algorithms and results are complementary to the above line of research.\nWhile the gradient sketch \u21e3(S, x) does not immediatey lead to an unbiased estimator of the gradient,\nSEGA uses the information provided in the sketch to construct an unbiased estimator of the gradient\nvia a sketch-and-project process. Sketch-and-project iterations were introduced in [15] in the contex\nof linear feasibility problems. A dual view uncovering a direct relationship with stochastic subspace\nascent methods was developed in [16]. The latest and most in-depth treatment of sketch-and-project\nfor linear feasibility is based on the idea of stochastic reformulations [42]. Sketch-and-project can\nbe combined with Polyak [29, 28] and Nesterov momentum [14, 47], extended to convex feasibility\nproblems [30], matrix inversion [18, 17, 14], and empirical risk minimization [13, 19].\nThe line of work most closely related to our setup is that on randomized coordinate/subspace de-\nscent methods [34, 16]. Indeed, the information available to these methods is compatible with our\noracle for speci\ufb01c distributions D. However, the main disadvantage of these methods is that they\ncan not handle non-separable regularizers R. In contrast, the algorithm we propose\u2014SEGA\u2014works\nfor any regularizer R. In particular, SEGA can handle non-separable constraints even with coordinate\nsketches, which is out of range of current CD methods. Hence, our work could be understood as\nextending the reach of coordinate and subspace descent methods from separable to arbitrary regular-\nizers, which allows for a plethora of new applications. Our method is able to work with an arbitrary\nregularizer due to its ability to build an unbiased variance-reduced estimate of the gradient of f\nthroughout the iterative process from the random sketches provided by the oracle. Moreover, and\nunlike coordinate descent, SEGA allows for general sketches from essentially any distribution D.\n\n2\n\n\fAnother stream of work on designing gradient-type methods without assuming perfect access to the\ngradient is represented by the inexact gradient descent methods [9, 11, 45]. However, these methods\ndeal with deterministic estimates of the gradient and are not based on linear transformations of the\ngradient. Therefore, this second line of research is also signi\ufb01cantly different from what we do here.\n\n1.3 Outline\n\nWe describe SEGA in Section 2. Convergence results for general sketches are described in Sec-\ntion 3. Re\ufb01ned results for coordinate sketches are presented in Section 4, where we also describe\nand analyze an accelerated variant of SEGA. Experimental results can be found in Section 5. Con-\nclusions are drawn and potential extensions outlined in Appendix A. Proofs of the main results can\nbe found in Appendices B and C. An aggressive subspace variant of SEGA is described and analyzed\nin Appendix D. A simpli\ufb01ed analysis of SEGA in the case of coordinate sketches and for R \u2318 0 is\ndeveloped in Appendix E (under standard assumptions as in the main paper) and F (under alternative\nassumptions). Extra experiments for additional insights are included in Appendix G.\n\nNotation. We introduce notation where needed. We also provide a notation table in Appendix H.\n\n2 The SEGA Algorithm\n\nIn this section we introduce a learning process for estimating the gradient from the sketched infor-\nmation provided by (2); this will be used as a subroutine of SEGA.\nLet xk be the current iterate, and let hk be the current estimate of the gradient of f. The oracle\nqueried, and we receive new information in the form of the sketched gradient (2). Then, we would\nlike to update hk based on the new information. We do this using a sketch-and-project process [15,\n16, 42]: we set hk+1 to be the closest vector to hk (in a certain Euclidean norm) satisfying (2):\n\nhk+1 = arg min\n\nh2Rn kh hkk2\n\nB\n\nsubject to S>k h = S>k rf (xk).\n\n(3)\n\nThe closed-form solution of (3) is\n\nhk+1 = hk B1Zk(hk rf (xk)) = (I B1Zk)hk + B1Zkrf (xk),\n= SkS>k B1Sk\u2020 S>k . Notice that hk+1 is a biased estimator of rf (xk). In order to\n\nwhere Zk\nobtain an unbiased gradient estimator, we introduce a random variable2 \u2713k = \u2713(Sk) for which\n\n(4)\n\ndef\n\nIf \u2713k satis\ufb01es (5), it is straightforward to see that the random vector\n\nED [\u2713kZk] = B.\n\ngk def\n\n= (1 \u2713k)hk + \u2713khk+1 (4)\n\n= hk + \u2713kB1Zk(rf (xk) hk)\n\nis an unbiased estimator of the gradient:\n\n(5)\n\n(6)\n\n(7)\n\nED\u21e5gk\u21e4\n\n(5)+(6)\n\n=\n\nrf (xk).\n\nFinally, we use gk instead of the true gradient, and perform a proximal step with respect to R. This\nleads to a new optimization method, which we call SkEtched GrAdient Method (SEGA) and describe\nin Algorithm 1. We stress again that the method does not need the access to the full gradient.\n\n2Such a random variable may not exist. Some suf\ufb01cient conditions are provided later.\n\n3\n\n\fstepsize \u21b5> 0\n\nAlgorithm 1: SEGA: SkEtched GrAdient Method\n1 Initialize: x0, h0 2 Rn; B 0; distribution D;\n2 for k = 1, 2, . . . do\nSample Sk \u21e0D\n3\ngk = hk + \u2713kB1Zk(rf (xk) hk)\n4\nxk+1 = prox\u21b5R(xk \u21b5gk)\n5\nhk+1 = hk + B1Zk(rf (xk) hk)\n6\n\nFigure 1: Iterates of SEGA and CD\n\n2.1 SEGA as a variance-reduced method\nAs we shall show, both hk and gk become better at approximating rf (xk) as the iterates xk\napproach the optimum. Hence, the variance of gk as an estimator of the gradient tends to zero,\nwhich means that SEGA is a variance-reduced algorithm. The structure of SEGA is inspired by the\nJackSketch algorithm introduced in [19]. However, as JackSketch is aimed at solving a \ufb01nite-\nsum optimization problem with many components, it does not make much sense to apply it to (1).\nIndeed, when applied to (1) (with R = 0, since JackSketch was analyzed for smooth optimization\nonly), JackSketch reduces to gradient descent. While JackSketch performs Jacobian sketching\n(i.e., multiplying the Jacobian by a random matrix from the right, effectively sampling a subset of the\ngradients forming the \ufb01nite sum), SEGA multiplies the Jacobian by a random matrix from the left.\nIn doing so, SEGA becomes oblivious to the \ufb01nite-sum structure and transforms into the gradient\nsketching mechanism described in (2).\n\n2.2 SEGA versus coordinate descent\nWe now illustrate the above general setup on the simple example when D corresponds to a distribu-\ntion over standard unit basis vectors in Rn.\nExample 2.1. Let B = Diag(b1, . . . , bn) 0 and let D be de\ufb01ned as follows. We choose Sk = ei\nwith probability pi > 0, where e1, e2, . . . , en are the unit basis vectors in Rn. Then\n\nwhich can equivalently be written as hk+1\ndoes not depend on B. If we choose \u2713k = \u2713(Sk) = 1/pi, then\n\nj = hk\n\nED [\u2713kZk] =\n\neie>i\n1/bi\n\n= B\n\nwhich means that \u2713k is a bias-correcting random variable. We then get\n\nhk+1 (4)\n\n= hk + e>i (rf (xk) hk)ei,\ni = e>i rf (xk) and hk+1\nnXi=1\ne>i (rf (xk) hk)ei.\n\nei(e>i B1ei)1e>i =\n\n1\npi\n\npi\n\nnXi=1\n\ngk (6)\n\n= hk + 1\npi\n\n(8)\nj for j 6= i. Note that hk+1\n\n(9)\n\nIn the setup of Example 2.1, both SEGA and CD obtain new gradient information in the form of a\nrandom partial derivative of f. However, the two methods perform a different update: (i) SEGA\nallows for arbitrary proximal term, CD allows for separable one only [46, 27, 12]; (ii) While SEGA\nupdates all coordinates in every iteration, CD updates a single coordinate only; (iii) If we force\nhk = 0 in SEGA and use coordinate sketches, the method transforms into CD.\nBased on the above observations, we conclude that SEGA can be applied in more general settings for\nthe price of potentially more expensive iterations3. For intuition-building illustration of how SEGA\n\n3Forming vector g and computing the prox.\n\n4\n\n\fworks, Figure 1 shows the evolution of iterates of both SEGA and CD applied to minimizing a simple\nquadratic function in 2 dimensions. For more \ufb01gures of this type, including the composite case\nwhere CD does not work, see Appendix G.1.\nIn Section 4 we show that SEGA enjoys, up to a small constant factor, the same theoretical iteration\ncomplexity as CD. This remains true when comparing state-of-the-art variants of CD with importance\nsampling, parallelism/mini-batching and acceleration with the corresponding variants of SEGA.\nRemark 2.2. Nontrivial sketches S and metric B might, in some applications, bring a substantial\nspeedup against the baseline choices mentioned in Example 2.1. Appendix D provides one example:\nthere are problems where the gradient of f always lies in a particular d-dimensional subspace of\n\nRn. In such a case, suitable choice of S and B leads to O n\n\nto the setup of Example 2.1. In Section 5.3 we numerically demonstrate this claim.\n\nd\u2013times faster convergence compared\n\n3 Convergence of SEGA for General Sketches\n\nIn this section we state a linear convergence result for SEGA (Algorithm 1) for general sketch distri-\nbutions D under smoothness and strong convexity assumptions.\n3.1 Smoothness assumptions\n\nWe will use the following general version of smoothness.\nAssumption 3.1 (Q-smoothness). Function f is Q-smooth with respect to B, where Q 0 and\nB 0. That is, for all x, y, the following inequality is satis\ufb01ed:\n\nf (x) f (y) hrf (y), x yiB 1\n\n2krf (x) rf (y)k2\nQ,\n\n(10)\n\nAssumption 3.1 is not standard in the literature. However, as Lemma B.1 states, in the special case\nof B = I and Q = M1, it reduces to M-smoothness (see Assumption 3.2), which is a common\nassumption in modern analysis of CD methods.\nAssumption 3.2 (M-smoothness). Function f is M-smooth for some matrix M 0. That is, for\nall x, y, the following inequality is satis\ufb01ed:\n\nf (x) \uf8ff f (y) + hrf (y), x yi + 1\n\n2kx yk2\nM.\n\n(11)\n\nAssumption 3.2 is fairly standard in the CD literature. It appears naturally in various application\nsuch as empirical risk minimization with linear predictors and is a baseline in the development of\nminibatch CD methods [41, 38, 36, 39]. We will adopt this notion in Section 4, when comparing\nSEGA to coordinate descent. Until then, let us consider the more general Assumption 3.1.\n\n3.2 Main result\n\nNow we present one of the key theorems of the paper, stating a linear convergence of SEGA.\nTheorem 3.3. Assume that f is Q\u2013smooth with respect to B, and \u00b5\u2013strongly convex. Fix x0, h0 2\ndom(F ) and let xk, hk be the random iterates produced by SEGA. Choose stepsize \u21b5> 0 and\nLyapunov parameter > 0 so that\n\ndef\n\n= ED\u21e5\u27132\nwhere C\n\u21b5khk rf (x\u21e4)k2\n\n\u21b5 (2(C B) + \u00b5B) \uf8ff ED [Z] ,\u21b5\n2 (Q ED [Z]) ,\nkZk\u21e4. Then E\u21e5k\u21e4 \uf8ff (1 \u21b5\u00b5)k0 for Lyapunov function k def\n\nB, where x\u21e4 is a solution of (1).\n\nC \uf8ff 1\n\n(12)\n\n= kxk x\u21e4k2\n\nB +\n\n5\n\n\fNonaccelerated method\n\nimportance sampling, b = 1\n\nNonaccelerated method\n\narbitrary sampling\nAccelerated method\n\nimportance sampling, b = 1\n\nAccelerated method\narbitrary sampling\n\nCD\n\nSEGA\n\nTrace(M)\n\n\u00b5\n\nlog 1\n\n\u270f [34]\n\n\u270f\n\nvi\n\n\u21e3maxi\npi\u00b5\u2318 log 1\n1.62 \u00b7 Pi pMii\n1.62 \u00b7qmaxi\n\nvi\np2\n\np\u00b5\n\n[41]\n\nlog 1\n\n\u270f [3]\n\ni \u00b5 log 1\n\n\u270f [20]\n\nlog 1\n\u270f\n\npi\u00b5\u2318 log 1\n\n\u270f\n\nlog 1\n\u270f\n\n\u00b5\n\nvi\n\n8.55 \u00b7 Trace(M)\n8.55 \u00b7\u21e3maxi\n9.8 \u00b7 Pi pMii\n9.8 \u00b7qmaxi\n\np\u00b5\n\nvi\np2\n\ni \u00b5 log 1\n\n\u270f\n\nTable 1: Complexity results for coordinate descent (CD) and our sketched gradient method (SEGA), specialized\nto coordinate sketching, for M\u2013smooth and \u00b5\u2013strongly convex functions.\n\nNote that k ! 0 implies hk ! rf (x\u21e4). Therefore SEGA is variance reduced, in contrast to CD in\nthe non-separable proximal setup, which does not converge to the solution. If is small enough so\nthat Q ED [Z] 0, one can always choose stepsize \u21b5 satisfying\n\nand inequalities (12) will hold. Therefore, we get the next corollary.\nCorollary 3.4. If < min(Q)\n\n\u21b5 \uf8ff minn\nmax(ED[Z]), \u21b5 satis\ufb01es (13) and k 1\n\nmin(ED[Z])\n\nmax(21(CB)+\u00b5B) , min(QED[Z])\n\n\u21b5\u00b5 log 0\n\n2max(C)\n\no\n\n(13)\n\n\u270f , then E\u21e5kxk x\u21e4k2\n\nB\u21e4 \uf8ff \u270f.\n\nAs Theorem 3.3 is rather general, we also provide a simpli\ufb01ed version thereof, complete with a\nsimpli\ufb01ed analysis (Theorem E.1 in Appendix E). In the simpli\ufb01ed version we remove the proximal\nsetting (i.e., we set R = 0), assume L\u2013smoothness4, and only consider coordinate sketches with\nuniform probabilities. The result is provided as Corollary 3.5.\nCorollary 3.5. Let B = I and choose D to be the uniform distribution over unit basis vectors in\nRn. If the stepsize satis\ufb01es 0 <\u21b5 \uf8ff min{(1 L/n)/(2Ln), n1 (\u00b5 + 2(n 1)/)1}, then\nED\u21e5k+1\u21e4 \uf8ff (1 \u21b5\u00b5)k, and therefore the iteration complexity is \u02dcO(nL/\u00b5).\n\nRemark 3.6. In the fully general case, one might choose \u21b5 to be bigger than bound (13), which\ndepends on eigen properties of ED [Z] , C, Q, B, leading to a better overall complexity. However,\nin the simple case with B = I, Q = I and Sk = eik with uniform probabilities, bound (13) is tight.\n\n4 Convergence of SEGA for Coordinate Sketches\n\nIn this section we compare SEGA with coordinate descent. We demonstrate that, specialized to a par-\nticular choice of the distribution D (where S is a random column submatrix of the identity matrix),\nwhich makes SEGA use the same random gradient information as that used in modern randomized\nCD methods, SEGA attains, up to a small constant factor, the same convergence rate as CD methods.\nFirstly, in Section 4.2 we develop SEGA with in a general setup known as arbitrary sampling [41,\n40, 37, 38, 6] (Theorem 4.2). Then, in Section 4.3 we develop an accelerated variant of SEGA (see\nTheorem C.5) for arbitrary sampling as well. Lastly, Corollary 4.3 and Corollary 4.4 provide us\nwith importance sampling for both nonaccelerated and accelerated method, which matches up to\na constant factor cutting-edge CD rates [41, 3] under the same oracle and assumptions5. Table 1\nsummarizes the results of this section. We provide all proofs for this section in Appendix C.\n\n4The standard L\u2013smoothness assumption is a special case of M\u2013smoothness for M = LI, and hence is\n\nless general than both M\u2013smoothness and Q\u2013smoothness with respect to B.\n\n5There was recently introduced a notion of importance minibatch sampling for coordinate descent [20]. We\nstate, without a proof, that SEGA allows for the same importance sampling as developed in the mentioned paper.\n\n6\n\n\fWe now describe the setup and technical assumptions for this section. In order to facilitate a direct\ncomparison with CD (which does not work with non-separable regularizer R), for simplicity we\nconsider problem (1) in the simpli\ufb01ed setting with R \u2318 0. Further, function f is assumed to be\nM\u2013smooth (Assumption 3.2) and \u00b5\u2013strongly convex.\n\n4.1 De\ufb01ning D: samplings\nIn order to draw a direct comparison with general variants of CD methods (i.e., with those analyzed\nin the arbitrary sampling paradigm), we consider sketches in (3) that are column submatrices of\nthe identity matrix: S = IS, where S is a random subset (aka sampling) of [n]\n= {1, 2, . . . , n}.\nNote that the columns of IS are the standard basis vectors ei for i 2 S and hence Range (S) =\n: i 2 S) . So, distribution D from which we draw matrices is uniquely determined by\nRange (ei\nthe distribution of sampling S. Given a sampling S, de\ufb01ne p = (p1, . . . , pn) 2 Rn to be the\nvector satisfying pi = P (ei 2 Range (S)) = P (i 2 S), and P to be the matrix for which Pij =\nP ({i, j}\u2713 S) . Note that p and P are the probability vector and probability matrix of sampling S,\nrespectively [38]. We assume throughout the paper that S is proper, i.e., we assume that pi > 0 for\nall i. State-of-the-art minibatch CD methods (including the ones we compare against [41, 20]) utilize\nlarge stepsizes related to the so-called ESO Expected Separable Overapproximation (ESO) [38]\nparameters v = (v1, . . . , vn). ESO parameters play a key role in SEGA as well, and are de\ufb01ned next.\nAssumption 4.1 (ESO). There exists a vector v satisfying the following inequality\n\ndef\n\nP M Diag(p)Diag(v),\n\n(14)\n\nwhere denotes the Hadamard (i.e., element-wise) product of matrices.\nIn case of single coordinate sketches, parameters v are equal to coordinate-wise smoothness con-\nstants of f. An extensive study on how to choose them in general was performed in [38]. For\nnotational brevity, let us set \u02c6P\n\ndef\n= Diag(v) throughout this section.\n\ndef\n= Diag(p) and \u02c6V\n\n4.2 Non-accelerated method\n\nWe now state the convergence rate of (non-accelerated) SEGA for coordinate sketches with arbitrary\nsampling of subsets of coordinates. The corresponding CD method was developed in [41].\nTheorem 4.2. Assume that f is M\u2013smooth and \u00b5\u2013strongly convex. Denote k def\nkhkk2\n\n\u02c6P1. Choose \u21b5, > 0 such that\n\n= f (xk) f (x\u21e4) +\n\nwhere \n\ndef\n\n= \u21b5 \u21b52 maxi{ vi\n\nI \u21b52( \u02c6V \u02c6P1 M) \u232b \u00b5 \u02c6P1,\n\npi} . Then the iterates of SEGA satisfy E\u21e5 k\u21e4 \uf8ff (1 \u00b5)k 0.\n\n(15)\n\nWe now give an importance sampling result for a coordinate version of SEGA. We recover, up to\na constant factor, the same convergence rate as standard CD [34]. The probabilities we chose are\noptimal in our analysis and are proportional to the diagonal elements of matrix M.\nCorollary 4.3. Assume that f is M\u2013smooth and \u00b5\u2013strongly convex. Suppose that D is such that\nat each iteration standard unit basis vector ei is sampled with probability pi / Mii. If we choose\n\u21b5 = 0.232\n\nTrace(M) , = 0.061\n\nTrace(M), then E\u21e5 k\u21e4 \uf8ff\u21e31 0.117\u00b5\nTrace(M)\u2318k\n\nThe iteration complexities from Theorem 4.2 and Corollary 4.3 are summarized in Table 1. We also\nstate that , \u21b5 can be chosen so that (15) holds, and the rate from Theorem 4.2 coincides with the\nrate from Table 1. Theorem 4.2 and Corollary 4.3 hold even under a non-convex relaxation of strong\n2. Thus, SEGA works for\nconvexity \u2013 Polyak-\u0141ojasiewicz inequality: \u00b5(f (x) f (x\u21e4)) \uf8ff 1\na certain class of non-convex problems. For an overview on relaxations of strong convexity, see [23].\n\n2krf (x)k2\n\n 0.\n\n7\n\n\f4.3 Accelerated method\n\nIn this section, we propose an accelerated (in the sense of Nesterov\u2019s method [31, 32]) version of\nSEGA, which we call ASEGA. The analogous accelerated CD method, in which a single coordinate is\nsampled in every iteration, was developed and analyzed in [3]. The general variant utilizing arbitrary\nsampling was developed and analyzed in [20].\n\nAlgorithm 2: ASEGA: Accelerated SEGA\n1 Initialize: x0 = y0 = z0 2 Rn; h0 2 Rn; S; parameters \u21b5, , \u2327, \u00b5 > 0\n2 for k = 1, 2, . . . do\n3\n4\n\nxk = (1 \u2327 )yk1 + \u2327z k1\nSample Sk = ISk, where Sk \u21e0 S, and compute gk, hk+1 according to (4), (6)\nyk = xk \u21b5 \u02c6P1gk\nzk = 1\n\n5\n6\n\n1+\u00b5 (zk + \u00b5xk g k)\n\nThe method and analysis is inspired by [2]. Due to space limitations and technicality of the content,\nwe state the main theorem of this section in Appendix C.4. Here, we provide Corollary 4.4, which\nshows that Algorithm 2 with single coordinate sampling enjoys, up to a constant factor, the same\nconvergence rate as state-of-the-art accelerated coordinate descent method NUACDM [3].\n\nCorollary 4.4. Let the sampling be de\ufb01ned as follows: S = {i} w. p. pi / pMii, for i 2 [n]. Then\nthere exist acceleration parameters and a Lyapunov function \u2325k such that f (yk) f (x\u21e4) \uf8ff \u2325k and\nE\u21e5\u2325k\u21e4 \uf8ff (1 \u2327 )k\u23250 =1 O p\u00b5/Pi pMiik \u23250.\n\nThe iteration complexity provided by Theorem C.5 and Corollary 4.4 are summarized in Table 1.\n\n5 Experiments\n\nIn this section we perform numerical experiments to illustrate the potential of SEGA. Firstly, in Sec-\ntion 5.1, we compare it to projected gradient descent (PGD) algorithm. Then in Section 5.2, we study\nthe performance of zeroth-order SEGA (when sketched gradients are being estimated through func-\ntion value evaluations) and compare it to the analogous zeroth-order method. Lastly, in Section 5.3\nwe verify the claim from Remark 3.6 that in some applications, particular sketches and metric might\nlead to a signi\ufb01cantly faster convergence. In the experiments where theory-supported stepsizes were\nused, we obtained them by precomputing strong convexity and smoothness measures.\n\n5.1 Comparison to projected gradient\n\nIn this experiment, we show the potential superiority of our method to PGD. We consider the `2\nball constrained problem (R is the indicator function of the unit ball) with the oracle providing the\nsketched gradient in the random Gaussian direction. As we mentioned, a method moving in the\ngradient direction (analogue of CD), will not converge due as R is not separable. Therefore, we can\nonly compare against the projected gradient. In order to obtain the full gradient for PGD, one needs to\ngather n sketched gradients and solve a corresponding linear system. As for f, we choose 4 different\nquadratics, see Table 2 (appendix). We stress that these are synthetic problems generated for the\npurpose of illustrating the potential of our method against a natural baseline. Figure 2 compares\nSEGA and PGD under various relative cost scenarios of solving the linear system compared to the cost\nof the oracle calls. The results show that SEGA signi\ufb01cantly outperforms PGD as soon as solving the\nlinear system is expensive, and is as fast as PGD even if solving the linear system comes for free.\n\n8\n\n\fFigure 2: Convergence of SEGA and PGD on synthetic problems with n = 500. The indicator \u201cXn\u201d in the label\nindicates the setting where the cost of solving linear system is Xn times higher comparing to the oracle call.\nRecall that a linear system is solved after each n oracle calls. Stepsizes 1/max(M) and 1/(nmax(M)) were\nused for PGD and SEGA, respectively.\n\nFigure 3: Comparison of SEGA and randomized direct search for various problems. Theory supported stepsizes\nwere chosen for both methods. 500 dimensional problem.\n\n5.2 Comparison to zeroth-order optimization methods\n\nIn this section, we compare SEGA to the random direct search (RDS) method [5] under a zeroth-\norder oracle and R = 0. For SEGA, we estimate the sketched gradient using \ufb01nite differences. Note\nthat RDS is a randomized version of the classical direct search method [21, 24, 25]. At iteration k,\n\nRDS moves to argminf (xk + \u21b5ksk), f (xk \u21b5ksk), f (xk) for a random direction sk \u21e0D and a\n\nsuitable stepszie \u21b5k. For illustration, we choose f to be a quadratic problem based on Table 2 and\ncompare both Gaussian and coordinate sketches. Figure 3 shows that SEGA outperforms RDS.\n\n5.3 Subspace SEGA: a more aggressive approach\n\nAs mentioned in Remark 3.6, well designed sketches are capable of exploiting structure of f and\nlead to a better rate. We address this in detail in Appendix D where we develop and analyze a\nsubspace variant of SEGA. To illustrate this phenomenon in a simple setting, we perform experiments\nfor problem (1) with f (x) = kAx bk2, where b 2 Rd and A 2 Rd\u21e5n has orthogonal rows, and\nwith R being the indicator function of the unit ball in Rn. We assume that n d. We compare\ntwo methods: naiveSEGA, which uses coordinate sketches, and subspaceSEGA, where sketches are\nchosen as rows of A. Figure 4 indicates that subspaceSEGA outperforms naiveSEGA roughly by\nthe factor n\n\nd , as claimed in Appendix D.\n\nFigure 4: Comparison of SEGA with sketches from a correct subspace versus coordinate sketches naiveSEGA.\nStepsize chosen according to theory. 1000 dimensional problem.\n\n9\n\n\fReferences\n[1] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In\nProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages\n1200\u20131205. ACM, 2017.\n\n[2] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient\n\nand mirror descent. In Innovations in Theoretical Computer Science, 2017.\n\n[3] Zeyuan Allen-Zhu, Zheng Qu, Peter Richt\u00b4arik, and Yang Yuan. Even faster accelerated coor-\ndinate descent using non-uniform sampling. In Proceedings of The 33rd International Confer-\nence on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages\n1110\u20131119, 2016.\n\n[4] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear\n\ninverse problems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[5] El Houcine Bergou, Peter Richt\u00b4arik, and Eduard Gorbunov. Random direct search method for\n\nminimizing nonconvex, convex and strongly convex functions. Manuscript, 2018.\n\n[6] Antonin Chambolle, Matthias J Ehrhardt, Peter Richt\u00b4arik, and Carola-Bibiane Sch\u00a8oenlieb.\nStochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging appli-\ncations. SIAM Journal on Optimization, 28(4):27832808, 2018.\n\n[7] Chih-Chung Chang and Chih-Jen Lin. LibSVM: A library for support vector machines. ACM\n\ntransactions on intelligent systems and technology (TIST), 2(3):27, 2011.\n\n[8] Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free opti-\n\nmization, volume 8. Siam, 2009.\n\n[9] Alexandre d\u2019Aspremont. Smooth optimization with approximate gradient. SIAM Journal on\n\nOptimization, 19(3):1171\u20131183, 2008.\n\n[10] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems, pages 1646\u20131654, 2014.\n\n[11] Olivier Devolder, Franc\u00b8ois Glineur, and Yurii Nesterov. First-order methods of smooth convex\n\noptimization with inexact oracle. Mathematical Programming, 146(1-2):37\u201375, 2014.\n\n[12] Olivier Fercoq and Peter Richt\u00b4arik. Accelerated, parallel and proximal coordinate descent.\n\nSIAM Journal on Optimization, (25):1997\u20132023, 2015.\n\n[13] Robert M Gower, Donald Goldfarb, and Peter Richt\u00b4arik. Stochastic block BFGS: squeezing\nIn 33rd International Conference on Machine Learning, pages\n\nmore curvature out of data.\n1869\u20131878, 2016.\n\n[14] Robert M Gower, Filip Hanzely, Peter Richt\u00b4arik, and Sebastian Stich. Accelerated stochastic\nmatrix inversion: general theory and speeding up BFGS rules for faster second-order optimiza-\ntion. arXiv:1802.04079, 2018.\n\n[15] Robert M Gower and Peter Richt\u00b4arik. Randomized iterative methods for linear systems. SIAM\n\nJournal on Matrix Analysis and Applications, 36(4):1660\u20131690, 2015.\n\n10\n\n\f[16] Robert M Gower and Peter Richt\u00b4arik. Stochastic dual ascent for solving linear systems. arXiv\n\npreprint arXiv:1512.06890, 2015.\n\n[17] Robert M Gower and Peter Richt\u00b4arik. Linearly convergent randomized iterative methods for\n\ncomputing the pseudoinverse. arXiv:1612.06255, 2016.\n\n[18] Robert M Gower and Peter Richt\u00b4arik. Randomized quasi-Newton updates are linearly con-\nvergent matrix inversion algorithms. SIAM Journal on Matrix Analysis and Applications,\n38(4):1380\u20131409, 2017.\n\n[19] Robert M Gower, Peter Richt\u00b4arik, and Francis Bach. Stochastic quasi-gradient methods: Vari-\n\nance reduction via Jacobian sketching. arXiv preprint arXiv:1805.02632, 2018.\n\n[20] Filip Hanzely and Peter Richt\u00b4arik. Accelerated coordinate descent with arbitrary sampling and\n\nbest rates for minibatches. arXiv preprint arXiv:1809.09354, 2018.\n\n[21] Robert Hooke and Terry A Jeeves. \u201cDirect search\u201d solution of numerical and statistical prob-\n\nlems. Journal of the ACM (JACM), 8(2):212\u2013229, 1961.\n\n[22] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari-\nance reduction. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[23] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-\ngradient methods under the Polyak-Lojasiewicz condition. In Joint European Conference on\nMachine Learning and Knowledge Discovery in Databases, pages 795\u2013811. Springer, 2016.\n\n[24] Tamara G Kolda, Robert M Lewis, and Virginia Torczon. Optimization by direct search: New\n\nperspectives on some classical and modern methods. SIAM Review, 45(3):385\u2013482, 2003.\n\n[25] Jakub Kone\u02c7cn\u00b4y and Peter Richt\u00b4arik. Simple complexity analysis of simpli\ufb01ed direct search.\n\narXiv preprint arXiv:1410.0390, 2014.\n\n[26] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for \ufb01rst-order opti-\n\nmization. In Advances in Neural Information Processing Systems, pages 3384\u20133392, 2015.\n\n[27] Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method.\n\nIn Advances in Neural Information Processing Systems, pages 3059\u20133067, 2014.\n\n[28] Nicolas Loizou and Peter Richt\u00b4arik. Linearly convergent stochastic heavy ball method for\nminimizing generalization error. In NIPS Workshop on Optimization for Machine Learning,\n2017.\n\n[29] Nicolas Loizou and Peter Richt\u00b4arik. Momentum and stochastic momentum for stochastic gra-\n\ndient, Newton, proximal point and subspace descent methods. arXiv:1712.09677, 2017.\n\n[30] Ion Necoara, Peter Richt\u00b4arik, and Andrei Patrascu. Randomized projection methods for convex\n\nfeasibility problems: conditioning and convergence rates. arXiv:1801.04873, 2018.\n\n[31] Yurii Nesterov. A method of solving a convex programming problem with convergence rate\n\nO(1/k2). Soviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\n[32] Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Aca-\n\ndemic Publishers, 2004.\n\n11\n\n\f[33] Yurii Nesterov. Smooth minimization of nonsmooth functions. Mathematical Programming,\n\n103:127\u2013152, 2005.\n\n[34] Yurii Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization prob-\n\nlems. SIAM Journal on Optimization, 22(2):341\u2013362, 2012.\n\n[35] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Tak\u00b4a\u02c7c. SARAH: A novel method for\nmachine learning problems using stochastic recursive gradient. In Proceedings of the 34th In-\nternational Conference on Machine Learning, volume 70 of Proceedings of Machine Learning\nResearch, pages 2613\u20132621. PMLR, 2017.\n\n[36] Zheng Qu and Peter Richt\u00b4arik. Coordinate descent with arbitrary sampling I: Algorithms and\n\ncomplexity. Optimization Methods and Software, 31(5):829\u2013857, 2016.\n\n[37] Zheng Qu and Peter Richt\u00b4arik. Coordinate descent with arbitrary sampling I: Algorithms and\n\ncomplexity. Optimization Methods and Software, 31(5):829\u2013857, 2016.\n\n[38] Zheng Qu and Peter Richt\u00b4arik. Coordinate descent with arbitrary sampling II: Expected sepa-\n\nrable overapproximation. Optimization Methods and Software, 31(5):858\u2013884, 2016.\n\n[39] Zheng Qu, Peter Richt\u00b4arik, Martin Tak\u00b4a\u02c7c, and Olivier Fercoq. SDNA: Stochastic dual Newton\nascent for empirical risk minimization. In Proceedings of The 33rd International Conference\non Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1823\u2013\n1832. PMLR, 2016.\n\n[40] Zheng Qu, Peter Richt\u00b4arik, and Tong Zhang. Quartz: Randomized dual coordinate ascent with\narbitrary sampling. In Advances in Neural Information Processing Systems, pages 865\u2013873,\n2015.\n\n[41] Peter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. On optimal probabilities in stochastic coordinate descent\n\nmethods. Optimization Letters, 10(6):1233\u20131243, 2016.\n\n[42] Peter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. Stochastic reformulations of linear systems: algorithms and\n\nconvergence theory. arXiv:1706.01108, 2017.\n\n[43] Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathe-\n\nmatical Statistics, 22:400\u2013407, 1951.\n\n[44] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an\nexponential convergence rate for \ufb01nite training sets. In Advances in Neural Information Pro-\ncessing Systems, pages 2663\u20132671, 2012.\n\n[45] Mark Schmidt, Nicolas Le Roux, and Francis R Bach. Convergence rates of inexact proximal-\nIn Advances in Neural Information Processing\n\ngradient methods for convex optimization.\nSystems, pages 1458\u20131466, 2011.\n\n[46] Shai Shalev-Shwartz and Tong Zhang. Proximal stochastic dual coordinate ascent. arXiv\n\npreprint arXiv:1211.2717, 2012.\n\n[47] Stephen Tu, Shivaram Venkataraman, Ashia C. Wilson, Alex Gittens, Michael I. Jordan, and\nIn Proceedings of the\nBenjamin Recht. Breaking locality accelerates block Gauss-Seidel.\n34th International Conference on Machine Learning, volume 70 of Proceedings of Machine\nLearning Research, pages 3482\u20133491. PMLR, 2017.\n\n12\n\n\f", "award": [], "sourceid": 1069, "authors": [{"given_name": "Filip", "family_name": "Hanzely", "institution": "KAUST"}, {"given_name": "Konstantin", "family_name": "Mishchenko", "institution": "KAUST"}, {"given_name": "Peter", "family_name": "Richtarik", "institution": "KAUST"}]}