{"title": "Stochastic Spectral and Conjugate Descent Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 3358, "page_last": 3367, "abstract": "The state-of-the-art methods for solving optimization problems in big dimensions are variants of randomized coordinate descent (RCD). In this paper we introduce a fundamentally new type of acceleration strategy for RCD based on the augmentation of the set of coordinate directions by a few spectral or conjugate directions. As we increase the number of extra directions to be sampled from, the rate of the method improves, and interpolates between the linear rate of RCD and a linear rate independent of the condition number. We develop and analyze also inexact variants of these methods where the spectral and conjugate directions are allowed to be approximate only. We motivate the above development by proving several negative results which highlight the limitations of RCD with importance sampling.", "full_text": "Stochastic Spectral and Conjugate Descent Methods\n\nDmitry Kovalev1,2 Eduard Gorbunov1 Elnur Gasanov1,2 Peter Richt\u00e1rik2,3,1\n\n1Moscow Institute of Physics and Technology, Dolgoprudny, Russia\n\n2King Abdullah University of Science and Technology, Thuwal, Saudi Arabia\n\n3University of Edinburgh, Edinburgh, United Kingdom\n\nAbstract\n\nThe state-of-the-art methods for solving optimization problems in big dimensions\nare variants of randomized coordinate descent (RCD). In this paper we introduce a\nfundamentally new type of acceleration strategy for RCD based on the augmenta-\ntion of the set of coordinate directions by a few spectral or conjugate directions.\nAs we increase the number of extra directions to be sampled from, the rate of the\nmethod improves, and interpolates between the linear rate of RCD and a linear rate\nindependent of the condition number. We develop and analyze also inexact variants\nof these methods where the spectral and conjugate directions are allowed to be\napproximate only. We motivate the above development by proving several negative\nresults which highlight the limitations of RCD with importance sampling.\n\n1\n\nIntroduction\n\nAn increasing array of learning and training tasks reduce to optimization problem in very large\ndimensions. The state-of-the-art algorithms in this regime are based on randomized coordinate\ndescent (RCD). Various acceleration strategies were proposed for RCD in the literature in recent\nyears, based on techniques such as Nesterov\u2019s momentum [12, 9, 5, 1, 14], heavy ball momentum\n[16, 11], importance sampling [13, 19], adaptive sampling [4], random permutations [8], greedy rules\n[15], mini-batching [20], and locality breaking [21]. These techniques enable faster rates in theory\nand practice.\nIn this paper we introduce a fundamentally new type of acceleration strategy for RCD which relies\non the idea of enriching the set of (unit) coordinate directions {e1, e2, . . . , en} in Rn, which are\nused in RCD as directions of descent, via the addition of a few spectral or conjugate directions. The\nalgorithms we develop and analyze in this paper randomize over this enriched larger set of directions.\nFor expositional simplicity1, we focus on quadratic minimization\nx(cid:62)Ax \u2212 b(cid:62)x,\n\n(1)\nwhere A is an n \u00d7 n symmetric and positive de\ufb01nite matrix. The optimal solution is unique, and\nequal to x\u2217 = A\u22121b.\n\n1\n2\n\nmin\nx\u2208Rn\n\nf (x) :=\n\n1.1 Randomized coordinate descent\n\nApplied to (1), RCD performs the iteration\n\nxt+1 = xt \u2212 A(cid:62)\n\n:i xt \u2212 bi\nAii\n\nei,\n\n(2)\n\n1Many of our results can be extended to convex functions of the form f (x) = \u03c6(Ax) \u2212 b(cid:62)x, where \u03c6 is a\nsmooth and strongly convex function. However, due to space limitations, and the fact that we already have a lot\nto say in the special case \u03c6(y) = 1\n\n2(cid:107)y(cid:107)2, we leave these more general developments to a follow-up paper.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fMethod Name\nstochastic descent (SD)\nstochastic spectral descent (SSD)\nstochastic conjugate descent (SconD)\nrandomized coordinate descent (RCD)\nstochastic spectral coord. descent (SSCD)\nmini-batch SD (mSD)\nmini-batch SSCD (mSSCD)\ninexact SconD (iSconD)\ninexact SSD (iSSD)\n\nAlg 4\nAlg 5\nAlg 6\nAlg 7\nAlg 8\n\nLem 9\nThm 10\nThm 15\nSec F.2\nTable 1: Algorithms described in this paper.\n\nAlgorithm\n(4), Alg 1\n\nAlg 2\nSec 2.2\n(2), Alg 3\n\nRate\n\nReference\n\n(5), Lem 1 Gower and Richt\u00e1rik [6]\n(6), Thm 2\n\nThm 2\n(3), (11)\n(7), Thm 8\n\nGower and Richt\u00e1rik [6]\n\nRicht\u00e1rik and Tak\u00e1\u02c7c [18]\n\nNEW\nNEW\n\nNEW\n\nNEW\nNEW\nNEW\n\n(cid:3) \u2264 (1 \u2212 \u03c1)t(cid:107)x0 \u2212 x\u2217(cid:107)2\n\nA, where \u03c1 = \u03bbmin(A)\n\u03bbmin(A) is the minimal eigenvalue of A. That is, as long as the number of iterations t is at least\n\nwhere at each iteration, i is chosen with probability pi > 0. It was shown by Leventhal and Lewis\n[10] that if the probabilities are proportional to the diagonal elements of A (i.e., pi \u221d Aii), then the\nTr(A) and\n\nrandom iterates of RCD satisfy E(cid:2)(cid:107)xt \u2212 x\u2217(cid:107)2\n(cid:18) Tr(A)\nO\n(3)\n\u03bbmin(A) \u2265 n, and that this can be arbitrarily larger than n.\n\nA] \u2264 \u0001. Note that Tr(A)\n\nwe have E[(cid:107)xt \u2212 x\u2217(cid:107)2\n\n\u03bbmin(A)\n\nlog\n\n,\n\n(cid:19)\n\n1\n\u0001\n\nA\n\n1.2 Stochastic descent\n\nRecently, Gower and Richt\u00e1rik [6] developed an iterative \u201csketch and project\u201d framework for solving\nlinear systems and quadratic optimization; see [7] for extensions. In the context of problem (1), their\nmethod reads as\n\nxt+1 = xt \u2212 s(cid:62)\n\nt (Axt \u2212 b)\ns(cid:62)\nt Ast\n\nst,\n\n(4)\nwhere st \u2208 Rn is a random vector sampled from some \ufb01xed distribution D. In this paper we will\nrefer to this method by the name stochastic descent (SD).\nNote that xt+1 is obtained from xt by minimizing f (xt + hst) for h \u2208 R and setting xt+1 = xt + hst.\nFurther, note that RCD arises as a special case with D being a discrete probability distribution over\nthe set {e1, . . . , en}. However, SD converges for virtually any distribution D, including discrete and\ncontinuous distributions. In particular, Gower and Richt\u00e1rik [6] show that as long as Es\u223cD[H] is\ninvertible, where H := ss(cid:62)\n\ns(cid:62)As, then SD converges as\n\n,\n\n(5)\n\n(cid:18)\n\nO\n\n1\n\n\u03bbmin(W)\n\nlog\n\n1\n\u0001\n\n(cid:19)\n\n(cid:18)\n\nO\n\n(cid:19)\n\n1\n\u0001\n\nwhere W := Es\u223cD[A1/2HA1/2] (see Lemma 1 for a more re\ufb01ned result due to Richt\u00e1rik and Tak\u00e1\u02c7c\n[18]). Rate of RCD in (3) can be obtained as a special case of (5).\n\n1.3 Stochastic spectral and conjugate descent\n\nThe starting point of this paper is the new observation that stochastic descent obtains the rate\n\nn log\n\n(6)\nin the special case when D is chosen to be the uniform distribution over the eigenvectors of A (see\nTheorem 2). For obvious reasons, we refer to this new method as stochastic spectral descent (SSD).\nTo the best of our knowledge, SSD was not explicitly considered in the literature before. We should\nnote that SSD is fundamentally different from spectral gradient descent [3, 2], which refers to a\nfamily of gradient methods with a special choice of stepsize depending on the spectrum of \u22072f.\nThe rate (6) does not merely provide an improvement on the rate of RCD given in (3); what is\nremarkable is that this rate is completely independent of the properties (such as conditioning) of A.\n\n2\n\n\fResult\nUniform probabilities are optimal for n = 2\nUniform probabilities are optimal for any n \u2265 2 as long as A is diagonal\n\u201cImportance sampling\u201d pi \u221d Aii can lead to an arbitrarily worse rate than pi = 1/n\n\u201cImportance sampling\u201d pi \u221d (cid:107)Ai:(cid:107)2 can lead to an arbitrarily worse rate than pi = 1/n\nFor every n \u2265 2 and T > 0, \u2203 A : rate of RCD with opt. probabilities is O(T log 1\n\u0001 )\nFor every n \u2265 2 and T > 0, \u2203 A : rate of RCD with opt. probabilities is \u2126(T log 1\n\u0001 )\nTable 2: Summary of results on importance and optimal sampling in RCD.\n\nThm\n\n3\n4\n5\n5\n6\n7\n\nMoreover, we show that this method is optimal among the class of stochastic descent methods (4)\nparameterized by the choice of the distribution D (see Theorem 8). Despite the attractiveness of its\nrate, SSD is not a practical method. This is because once we have the eigenvectors of A available,\nthe optimal solution x\u2217 can be assembled directly without the need for an iterative method.\nWe extend all results discussed above for SSD, including the rate (6), to the more general class of\nmethods we call stochastic conjugate descent (SconD), for which D is the uniform distribution over\nvectors v1, . . . , vn which are mutually A conjugate: v(cid:62)\n\ni Avj = 0 for i (cid:54)= j and v(cid:62)\n\ni Avi = 1.\n\n1.4 Optimizing probabilities in RCD\n\nThe idea of speeding up RCD via the use of non-uniform probabilities was pioneered by Nesterov\n[13] in the context of smooth convex minimization, and later built on by many authors [19, 17, 1].\nIn the case of non-accelerated RCD, and in the context of smooth convex optimization, the most\npopular choice of probabilities is to set pi \u221d Li, where Li is the Lipschitz constant of the gradient\nof the objective corresponding to coordinate i [13, 19]. For problem (1), we have Li = Aii. Gower\nand Richt\u00e1rik [6] showed that the optimal probabilities for (1) can in principle be computed through\nsemide\ufb01nite programming (SDP); however, no theoretical properties of the optimal solution of the\nSDP were given.\nAs a warm-up, we \ufb01rst ask: how important is importance sampling? More precisely, we investigate\nRCD with probabilities pi \u221d Aii, and RCD with probabilities pi \u221d (cid:107)Ai:(cid:107)2, considered as RCD with\n\u201cimportance sampling\u201d, and compare these with the baseline RCD with uniform probabilities. Our\nresult (see Theorem 5) contradicts conventional \u201cwisdom\u201d. In particular, we show that for every n\nthere is a matrix A such that diagonal probabilities lead to the best rate. Moreover, the rate of RCD\nwith \u201cimportance\u201d can be arbitrarily worse than the rate of RCD with uniform probabilities. The\nsame result applies to probabilities proportional to the square of the norm of the ith row of A.\nWe then switch gears, and motivated by the nature of SSD, we ask the following question: in order to\nobtain a condition-number-independent rate such as (6), do we have to consider new (and hard to\ncompute) descent directions, such as eigenvectors of A, or can a similar effect be obtained using RCD\nwith a better selection of probabilities? We give two negative results to this question (see Theorems 6\nand 7). First, we show that for any n \u2265 2 and any T > 0, there is a matrix A such that the rate of\nRCD with any probabilities (including the optimal probabilities) is O(T log 1\n\u0001 ). Second, we give a\nsimilar but much stronger statement where we reach the same conclusion, but for the lower bound as\nopposed to the upper bound. That is, O is replaced by \u2126.\nAs a by-product of our investigations into importance sampling, we establish that for n = 2, uniform\nprobabilities are optimal for all matrices A (see Thm 3). For a summary of these results, see Table 2.\n\n1.5\n\nInterpolating between RCD and SSD\n\nRCD and SSD lie on opposite ends of a continuum of stochastic descent methods for solving (1).\nRCD \u201cminimizes\u201d the work per iteration without any regard for the number of iterations, while SSD\nminimizes the number of iterations without any regard for the cost per iteration (or pre-processing\ncost). Indeed, one step of RCD costs O((cid:107)Ai:(cid:107)0) (the number of nonzero entries in the ith row of\nA), and hence RCD can be implemented very ef\ufb01ciently for sparse A. If uniform probabilities are\nused, no pre-processing (for computing probabilities) is needed. These advantages are paid for by\nthe rate (3), which can be arbitrarily high. On the other hand, the rate of SSD does not depend on A.\nThis advantage is paid for by a high pre-processing cost: the computation of the eigenvectors. This\npre-processing cost makes the method utterly impractical.\n\n3\n\n\fRCD (pi \u221d Aii)\nSSCD\nSSD\n\ngeneral spectrum\n\n(k+1)\u03bbk+1+(cid:80)n\n\ni \u03bbi\n\u03bb1\n\ni=k+2 \u03bbi\n\n(cid:80)\n\n\u03bbk+1\n\nn\n\nn \u2212 k largest eigvls\nare \u03b3-clustered:\nc \u2264 \u03bbi \u2264 \u03b3c\nfor k + 1 \u2264 i \u2264 n\n\n\u03b1-exponentially\n\ndecaying eigenvalues\n\n\u03b3nc\n\u03bb1\n\u03b3n\nn\n\n1\n\n\u03b1n\u22121\n\u03b1n\u2212k\u22121\n\n1\n\nn\n\nTable 3: Comparison of complexities of RCD, SSCD (with parameter 0 \u2264 k \u2264 n \u2212 1) and SSD under various\nregimes on the spectrum of A. In all terms we suppress a factor of log 1\n\u0001 .\n\nOne of the main contributions of this paper is the development of a new parametric family of\nalgorithms that in some sense interpolate between RCD and SSD. In particular, we consider the\nstochastic descent algorithm (4) with D being a discrete distribution over the search directions\n{e1, . . . , en} \u222a {u1, . . . , uk}, where ui is the eigenvectors of A corresponding to the ith smallest\neigenvalue of A. We call this new method stochastic spectral coordinate descent (SSCD).\nWe compute the optimal probabilities of this distribution, which turn out to be unique, and show that\nfor k \u2265 1 they depend on the k + 1 smallest eigenvalues of A: 0 < \u03bb1 \u2264 \u03bb2 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03bbk+1. In\nparticular, we prove (see Theorem 8) that the rate of SSCD with optimal probabilities is\n\n(cid:18) (k + 1)\u03bbk+1 +(cid:80)n\n\nO\n\ni=k+2 \u03bbi\n\n(cid:19)\n\n1\n\u0001\n\n.\n\nlog\n\n\u03bbk+1\n\n\u0001 ) (for k = n \u2212 1).\n\n(7)\nFor k = 0, SSCD reduces to RCD with pi \u221d Aii, and the rate (7) reduces to (3). For k = n \u2212 1,\nSSCD does not reduce to SSD. However, the rates match. Indeed, in this case the rate (7) reduces to\n(6). Moreover, the rate improves monotonically as k increases, from O( Tr(A)\n\u0001 ) (for k = 0) to\nO(n log 1\n\u2022 SSCD removes the effect of the k smallest eigenvalues. Note that the rate (7) does not depend on the\nk smallest eigenvalues of A. That is, by adding the eigenvectors u1, . . . , uk corresponding to the k\nsmallest eigenvalues to the set of descent directions, we have removed the effect of these eigenvalues.\n\u2022 Clustered eigenvalues. Assume that the n \u2212 k largest eigenvalues are clustered: c \u2264 \u03bbi \u2264 \u03b3c\nfor some c > 0 and \u03b3 > 1, for all k + 1 \u2264 i \u2264 n. In this case, the rate (7) can be estimated as a\n\nfunction of the clustering \u201ctightness\u201d parameter \u03b3: O(cid:0)\u03b3n log 1\n\n(cid:1) . See Table 3. This can be arbitrarily\n\n\u03bbmin(A) log 1\n\nbetter than the rate of RCD, even for k = 1. In other words, there are situations where by enriching\nthe set of directions used by RCD by a single eigenvector only, the resulting method accelerates\ndramatically. To give a concrete and simpli\ufb01ed example to illustrate this, assume that \u03bb1 = \u03b4 > 0,\nwhile \u03bb2 = \u00b7\u00b7\u00b7 = \u03bbn = 1. In this case, RCD has the rate O((1 + n\u22121\n\u0001 ), while SSCD with\nk = 1 has the rate O(n log 1\n\u03b4 times better than RCD, and the difference grows to\nin\ufb01nity as \u03b4 approaches zero even for \ufb01xed dimension n.\n\u2022 Exponentially decaying eigenvalues. If the eigenvalues of A follow an exponential decay with factor\n\n0 < \u03b1 < 1, then the rate of RCD is O(cid:0) 1\n\n(cid:1) , while the rate of SSCD is O(cid:0)\n\n\u0001 ). So, SSCD is 1\n\n\u03b4 ) log 1\n\n(cid:1) .\n\n\u0001\n\n1\n\n\u03b1n\u2212k\u22121 log 1\n\n\u0001\n\nThis is an improvement by the factor 1\n\u03b1k , which can be very large even for small k if \u03b1 is small. See\nTable 3. For an experimental con\ufb01rmation of this prediction, see Figure 5.\n\u2022 Adding a few \u201clargest\u201d eigenvectors does not help. We show that in contrast with the situation\nabove, adding a few of the \u201clargest\u201d eigenvectors to the coordinate directions of RCD does not help.\nThis is captured formally in the supplementary material as Theorem 12.\n\u2022 Mini-batching. We extend SSCD to a mini-batch setting; we call the new method mSSCD. We show\nthat the rate of mSSCD interpolates between the rate of mini-batch RCD and rate of SSD. Moreover,\nwe show that mSSCD is optimal among a certain parametric family of methods, and that its rate\nimproves as k increases. See Theorem 10.\n\n\u03b1n\u22121 log 1\n\n\u0001\n\n1.6\n\nInexact Directions\n\nFinally, we relax the need to compute exact eigenvectors or A\u2013conjugate vectors, and analyze the\nbehavior of our methods for inexact directions. Moreover, we propose and analyze an inexact variant\nof SSD which does not arise as a special case of SD. See Sections E and F.\n\n4\n\n\f2 Stochastic Descent\n\nThe stochastic descent method was described in (4). We now formalize it as Algorithm 1, and equip\nit with a stepsize, which will be useful in Section A.1, where we study mini-batch version of SD.\n\nAlgorithm 1 Stochastic Descent (SD)\n\nParameters: Distribution D; Stepsize parameter \u03c9 > 0\nInitialize: Choose x0 \u2208 Rn\nfor t = 0, 1, 2, . . . do\n\nSample search direction st \u223c D and set xt+1 = xt \u2212 \u03c9\n\nend for\n\nt (Axt \u2212 b)\ns(cid:62)\ns(cid:62)\nt Ast\n\nst\n\nIn order to guarantee convergence of SD, we restrict our attention to the class of proper distributions.\nAssumption 1. Distribution D is proper with respect to A. That is, Es\u223cD[H] is invertible, where\n\nH :=\n\n.\n\n(8)\n\nss(cid:62)\ns(cid:62)As\n\nNext we present the main convergence result for SD.\nLemma 1 (Convergence of stochastic descent [6, 18]). Let D be proper with respect to A, and let\n0 < \u03c9 < 2. Stochastic descent (Algorithm 1) converges linearly in expectation,\nA,\n\n(cid:3) \u2264 (1 \u2212 \u03c9(2 \u2212 \u03c9)\u03bbmin(W))t(cid:107)x0 \u2212 x\u2217(cid:107)2\n\nE(cid:2)(cid:107)xt \u2212 x\u2217(cid:107)2\n\nA\n\nand we also have the lower bound (1\u2212 \u03c9(2\u2212 \u03c9)\u03bbmax(W))t(cid:107)x0 \u2212 x\u2217(cid:107)2\n\nA \u2264 E(cid:2)(cid:107)xt \u2212 x\u2217(cid:107)2\n\nA\n\n(cid:3) , where\n\n(cid:104)\n\nA1/2HA1/2(cid:105)\n\nW := Es\u223cD\n\n.\n\n(9)\n\nFinally, the statement remains true if we replace (cid:107)xt \u2212 x\u2217(cid:107)2\nIt is easy to observe that the stepsize choice \u03c9 = 1 is optimal. This is why we have decided to present\nthe SD method (4) with this choice of stepsize. Moreover, notice that due to linearity of expectation,\n\nA by f (xt) \u2212 f (x\u2217) for all t.\n(cid:19)(cid:21)\n\n(cid:19)(cid:21)\n\n(cid:20)\n\n(cid:105) (8)\n\n= E(cid:104)\n\nTr\n\n= 1,\n\n(cid:18) z(cid:62)z\n\nz(cid:62)z\n\n(9)\n\nTr(W)\n\n= E\nwhere z = A1/2s. Therefore, 0 < \u03bbmin(W) \u2264 1\n\nTr(A1/2HA1/2)\n\n(cid:18) zz(cid:62)\n\n(cid:20)\n= E\nn \u2264 \u03bbmax(W) \u2264 1.\n\nz(cid:62)z\n\nTr\n\n2.1 Stochastic Spectral Descent\n\nLet A =(cid:80)n\n\ni=1 \u03bbiuiu(cid:62)\n\ni be the eigenvalue decomposition of A. That is, 0 < \u03bb1 \u2264 \u03bb2 \u2264 . . . \u2264 \u03bbn\nare the eigenvalues of A and u1, . . . , un are the corresponding orthonormal eigenvectors. Consider\nnow the SD method with D being the uniform distribution over the set {u1, . . . , un}, and \u03c9 = 1.\nThis gives rise to a new variant of SD which we call stochastic spectral descent (SSD).\n\nAlgorithm 2 Stochastic Spectral Descent (SSD)\n\nInitialize: x0 \u2208 Rn; (u1, \u03bb1), . . . (un, \u03bbn): eigenvectors and eigenvalues of A\nfor t = 0, 1, 2, . . . do\n\nChoose i \u2208 [n] uniformly at random and set xt+1 = xt \u2212\n\nend for\n\n(cid:18)\n\ni xt \u2212 u(cid:62)\nu(cid:62)\n\ni b\n\u03bbi\n\n(cid:19)\n\nui\n\nFor SSD we can establish an unusually strong convergence result, both in terms of speed and tightness.\nTheorem 2 (Convergence of stochastic spectral descent). Let {xk} be the sequence of random\niterates produced by stochastic spectral descent (Algorithm 2). Then\n\n(cid:18)\n\n(cid:19)t (cid:107)x0 \u2212 x\u2217(cid:107)2\n\nA.\n\nE[(cid:107)xt \u2212 x\u2217(cid:107)2\n\nA] =\n\n1 \u2212 1\nn\n\n(10)\n\nThis theorem implies the rate (6) mentioned in the introduction. Up to a log factor, SSD only needs n\niterations to converge. Notice that (10) is an identity, and hence the rate is not improvable.\n\n5\n\n\f2.2 Stochastic Conjugate Descent\n\nThe same rate as in Theorem 2 holds for the stochastic conjugate descent (SconD) method, which\narises as a special case of stochastic descent for \u03c9 = 1 and D being a uniform distribution over a set\nof A-orthogonal (i.e., conjugate) vectors. The proof follows by combining Lemmas 1 and 13.\n\n2.3 Randomized Coordinate Descent\nRCD (Algorithm 3) arises as a special case of SD with unit stepsize (\u03c9 = 1) and distribution D given\nby st = ei with probability pi > 0.\n\nAlgorithm 3 Randomized Coordinate Descent (RCD)\n\nParameters: probabilities p1, . . . , pn > 0\nInitialize: x0 \u2208 Rn\nfor t = 0, 1, 2, . . . do\nChoose i \u2208 [n] with probability pi > 0 and set xt+1 = xt \u2212 Ai:xt \u2212 bi\nend for\nwe have E[H] =(cid:80)\n\nThe rate of RCD (Algorithm 3) can therefore be deduced from Lemma 1. Notice that in view of (8),\n). So, as long as all probabilities are positive,\n\nAii\n\n= Diag( p1\nA11\n\n, . . . , pn\nAnn\n\neie(cid:62)\ni\nAii\n\ni pi\n\nAssumption 1 is satis\ufb01ed. Therefore, Lemma 1 applies and RCD enjoys the rate\n\nei\n\n\u03bbmin\n\nADiag\n\n.\n\n(11)\n\n(cid:32)(cid:20)\n\nO\n\n(cid:18)\n\n(cid:19)(cid:19)(cid:21)\u22121\n\n(cid:18) pi\n\nAii\n\nlog\n\n1\n\u0001\n\n(cid:33)\n\n2.3.1 Uniform probabilities can be optimal\n\nWe \ufb01rst prove that uniform probabilities are optimal in 2D.\nTheorem 3. Let n = 2 and consider RCD (Algorithm 3) with probabilities p1 > 0 and p2 > 0,\np1 + p2 = 1. Then the choice p1 = p2 = 1\nNext we claim that uniform probabilities are optimal in any dimension n as long as A is diagonal.\nTheorem 4. Let n \u2265 2 and let A be diagonal. Then uniform probabilities (pi = 1\nn for all i) optimize\nthe rate of RCD in (11).\n\n2 optimizes the rate of RCD in (11).\n\n2.3.2\n\n\u201cImportance\u201d sampling can be unimportant\n\nIn our next result we contradict conventional wisdom about typical choices of \u201cimportance sampling\u201d\nprobabilities. We claim that diagonal and row-squared-norm probabilities can lead to an arbitrarily\nworse performance than uniform probabilities.\nTheorem 5. For every n \u2265 2 and T > 0, there exists A such that: (i) The rate of RCD with\npi \u221d Aii is T times worse than the rate of RCD with uniform probabilities. (ii) The rate of RCD with\npi \u221d (cid:107)Ai:(cid:107)2 is T times worse than the rate of RCD with uniform probabilities.\n2.3.3 Optimal probabilities can be bad\n\nFinally, we show that there is no hope for adjustment of probabilities in RCD to lead to a rate\nindependent of the data A, as is the case for SSD. Our \ufb01rst result states that such a result can\u2019t be\nobtained from the generic rate (11).\nTheorem 6. For every n \u2265 2 and T > 0, there exists A such that the number of iterations (as\nexpressed by formula (11)) of RCD with any choice of probabilities p1, . . . , pn > 0 is O(T log(1/\u0001)).\nHowever, that does not mean, by itself, that such a result can\u2019t be possibly obtained via a different\nanalysis. Our next result shatters these hopes as we establish a lower bound which can be arbitrarily\nlarger than the dimension n.\nTheorem 7. For every n \u2265 2 and T > 0, there exists an n\u00d7 n positive de\ufb01nite matrix A and starting\npoint x0, such that the number of iterations of RCD with any choice probabilities p1, . . . , pn > 0 is\n\u2126(T log(1/\u0001)).\n\n6\n\n\f3\n\nInterpolating Between RCD and SSD\n\nIn particular, \ufb01x k \u2208\nAssume now that we have some partial spectral information available.\n{0, 1, . . . , n \u2212 1} and assume we know eigenvectors ui and eigenvalues \u03bbi for i = 1, . . . , k. We\nnow de\ufb01ne a parametric distribution D(\u03b1, \u03b21, . . . , \u03b2k) with parameters \u03b1 > 0 and \u03b21, . . . , \u03b2k \u2265 0 as\nfollows. Sample s \u223c D(\u03b1, \u03b21, . . . , \u03b2k) arises through the process\nei with probability pi = \u03b1Aii\nCk\nui with probability pn+i = \u03b2i\nCk\n\n, i \u2208 [n],\n, i \u2208 [k],\n\n(cid:40)\n\ns =\n\nwhere Ck := \u03b1Tr(A) +(cid:80)k\n\ni=1 \u03b2i is a normalizing factor ensuring that the probabilities sum up to 1.\n\n3.1 The method and its convergence rate\nApplying the SD method with the distribution D = D(\u03b1, \u03b21, . . . , \u03b2k) gives rise to a new speci\ufb01c\nmethod which we call stochastic spectral coordinate descent (SSCD).\n\nAlgorithm 4 Stochastic Spectral Coordinate Descent (SSCD)\n\nParameters: Distribution D(\u03b1, \u03b21, . . . , \u03b2k)\nInitialize: x0 \u2208 Rn\nfor t = 0, 1, 2, . . . do\nSample st \u223c D(\u03b1, \u03b21, . . . , \u03b2k) and set xt+1 = xt \u2212 s(cid:62)\nend for\n\nt (Axt \u2212 b)\ns(cid:62)\nt Ast\n\nst\n\nTheorem 8. Consider Stochastic Spectral Coordinate Descent (Algorithm 4) for \ufb01xed k \u2208\n{0, 1, . . . , n \u2212 1}. The method converges linearly for all positive \u03b1 > 0 and nonnegative \u03b2i.\nThe best rate is obtained for parameters \u03b1 = 1 and \u03b2i = \u03bbk+1 \u2212 \u03bbi; and this is the unique choice of\nparameters leading to the best rate. In this case,\n\nE(cid:2)(cid:107)xt \u2212 x\u2217(cid:107)2\n\nA\n\n(cid:3) \u2264\n\n(cid:18)\n\n(cid:19)t (cid:107)x0 \u2212 x\u2217(cid:107)2\n\nA,\n\n1 \u2212 \u03bbk+1\nCk\n\nwhere Ck := (k + 1)\u03bbk+1 +(cid:80)n\n\ni=k+2 \u03bbi. Moreover, the rate improves as k grows, and we have\n\n\u03bb1\n\nTr(A)\n\n=\n\n\u03bb1\nC0\n\n\u2264 \u00b7\u00b7\u00b7 \u2264 \u03bbk+1\nCk\n\n\u2264 \u00b7\u00b7\u00b7 \u2264 \u03bbn\nCn\u22121\n\n=\n\n1\nn\n\n.\n\nIf k = 0, SSCD reduces to RCD (with diagonal probabilities). Since \u03bb1\nTr(A), we recover the\nrate of RCD of Leventhal and Lewis [10]. With the choice k = n \u2212 1 our method does not reduce to\nC0\nSSD. However, the rates match. Indeed, \u03bbn\nCn\u22121\n\nn (compare with Theorem 2).\n\n= \u03bbn\nn\u03bbn\n\n= \u03bb1\n\n= 1\n\n3.2\n\n\u201cLargest\u201d eigenvectors do not help\n\nIt is natural to ask whether there is any bene\ufb01t in considering a few \u201clargest\u201d eigenvectors instead.\nUnfortunately, for the same parametric family as in Theorem 8, the answer is negative. The optimal\nparameters suggest that RCD has better rate without these directions. See Thm 12 in suppl. material.\n\n4 Experiments\n\n4.1 Stochastic spectral coordinate descent (SSCD)\n\nIn our \ufb01rst experiment we study how the practical behavior of SSCD (Algorithm 4) depends on the\nchoice of k. What we study here does not depend on the dimensionality of the problem (n), and hence\nit suf\ufb01ces to perform the experiments on small dimensional problems (n = 30). In this experiment\nwe consider the regime of clustered eigenvalues described in the introduction and summarized in\nTable 3. In particular, we construct a synthetic matrix A \u2208 R30\u00d730 with the smallest 15 eigenvalues\nclustered in the interval (5, 5 + \u2206) and the largest 15 eigenvalues clustered in the interval (\u03b8, \u03b8 + \u2206).\n\n7\n\n\fFigure 1: Expected precision E(cid:2)(cid:107)xt \u2212 x\u2217(cid:107)2\n\nA/(cid:107)x0 \u2212 x\u2217(cid:107)2\n\n(cid:3) versus # iterations of SSCD for symmetric positive\n\nde\ufb01nite matrices A of size 30 \u00d7 30 with different structures of spectra. The spectrum of A consists of 2 equally\nsized clusters of eigenvalues; one in the interval (5, 5 + \u2206), and the other in the interval (\u03b8, \u03b8 + \u2206).\n\nA\n\nFigure 2: Expected precision versus # iterations of mini-batch SSCD for A \u2208 R30\u00d730 and several choices of\nmini-batch size \u03c4. The spectrum of A was chosen as a uniform discretization of the interval [1, 60].\n\n(cid:17)\n\nRecall that the rate of SSCD (see Theorem 8) is \u02dcO(cid:16) (k+1)\u03bbk+1+(cid:80)n\n\nWe vary the tightness parameter \u2206 and the separation parameter \u03b8, and study the performance of\nSSCD for various choices of k. See Figure 3.\nOur \ufb01rst \ufb01nding is a con\ufb01rmation of the phase transition phenomenon predicted by our theory.\n. If k < 15, we know\n\u03bbi \u2208 (5, 5 + \u2206) for i = 1, 2, . . . , k + 1, and \u03bbi \u2208 (\u03b8, \u03b8 + \u2206) for i = k + 2, . . . , n. Therefore,\nthe rate can be estimated as rsmall := \u02dcO (k + 1 + (n \u2212 k \u2212 1)(\u03b8 + \u2206)/5) . On the other hand, if\nk \u2265 15, we know that \u03bbi \u2208 (\u03b8, \u03b8 + \u2206) for i = k + 1, . . . , n, and hence the rate can be estimated as\nrlarge := \u02dcO (k + 1 + (n \u2212 k \u2212 1)(\u03b8 + \u2206)/\u03b8) . Note that if the separation \u03b8 between the two clusters\nis large, the rate rlarge is much better than the rate rsmall. Indeed, in this regime, the rate rlarge\nbecomes \u02dcO(n), while rsmall can be arbitrarily large.\nGoing back to Figure 3, notice that this can be observed in the experiments. There is a clear phase\ntransition at k = 15, as predicted be the above analysis. Methods using k \u2208 {0, 6, 12} are relatively\nslow (although still enjoying a linear rate), and tend to have similar behaviour, especially when \u2206\n\ni=k+2 \u03bbi\n\n\u03bbk+1\n\n8\n\n050100150200250Number of iterations103102101100Expected precision=50,=1k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=50,=10k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=50,=25k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=50,=50k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=100,=1k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=100,=10k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=100,=25k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=100,=50k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=200,=1k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=200,=10k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=200,=25k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=200,=50k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=400,=1k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=400,=10k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=400,=25k = 0k = 6k = 12k = 18k = 24k = 29050100150200250Number of iterations103102101100Expected precision=400,=50k = 0k = 6k = 12k = 18k = 24k = 2901020304050Number of iterations105104103102101100Expected precision=1k = 0k = 6k = 12k = 18k = 24k = 2901020304050Number of iterations105104103102101100Expected precision=5k = 0k = 6k = 12k = 18k = 24k = 2901020304050Number of iterations105104103102101100Expected precision=20k = 0k = 6k = 12k = 18k = 24k = 2901020304050Number of iterations105104103102101100Expected precision=100k = 0k = 6k = 12k = 18k = 24k = 29\fis small. On the other hand, methods using k \u2208 {18, 24, 29} are much faster, with a behaviour\nnearly independent of \u03b8 and \u2206. Moreover, as \u03b8 increases, the difference in the rates between the\nslow methods using k \u2208 {0, 6, 12} and the fast methods using k \u2208 {18, 24, 29} grows. We have\nperformed more experiments with three clusters; see Fig 4 in the supplementary material.\n\n4.2 Mini-batch SSCD\n\nIn Figure 2 we report on the behavior of mSSCD, the mini-batch version of SSCD, for four choices of\nthe mini-batch parameter \u03c4, and several choices of k. Mini-batch of size \u03c4 is processed in parallel on\n\u03c4 processors, and the cost of a single iteration of mSSCD is (roughly) the same for all \u03c4. For \u03c4 = 1,\nthe method reduces to SSCD, considered in previous experiment (but on a different dataset). Since\nthe number of iterations is small, there are no noticeable differences across using different values of\nk. As \u03c4 grows, however, all methods become faster. Mini-batching seems to be more useful as k is\nlarger. Moreover, we can observe that acceleration through mini-batching starts more aggressively for\nsmall values op k, and its added bene\ufb01t for increasing values of k is getting smaller and smaller. This\nmeans that even for relatively small values of k, mini-batching can be expected to lead to substantial\nspeed-ups.\n\n4.3 Matrix with 10 billion entries\n\nIn Figure 3 we report on an experiment using a synthetic problem with data matrix A of dimension\nn = 105 (i.e., potentially with 1010 entries). As all experiments were done on a laptop, we worked\nwith sparse matrices with 106 nonzeros only. In the \ufb01rst row of Figure 3 we consider matrix A with\nall eigenvalues distributed uniformly on the interval [1, 100]. We observe that SSCD with k = 104\n(just 10% of n) requires about an order of magnitude less iterations than SSCD with k = 0 (=RCD).\nIn the second row we consider a scenario where l eigenvalues are small, contained in [1, 2], with\nthe rest of the eigenvalues contained in [100, 200]. We consider l = 10 and l = 1000 and study the\nbehaviour of SSCD with k = l. We see that for l = 10, SSCD performs dramatically better than\nRCD: it is able to achieve machine precision while RCD struggles to reduce the initial error by a\nfactor larger than 106. For l = 1000, SSCD achieves error 10\u22129 while RCD struggles to push the\nerror below 10\u22124. These tests show that in terms of # iterations, SSCD has the capacity to accelerate\non RCD by many orders of magnitude.\n\nFigure 3: Expected precision E(cid:2)(cid:107)xt \u2212 x\u2217(cid:107)2\n\n(cid:3) versus # iterations of SSCD for a matrix A \u2208\n\nR105\u00d7105. Top row: spectrum of A is uniformly distributed on [1, 100]. Bottom row: spectrum contained in\ntwo clusters: [1, 2] and [100, 200].\n\nA/(cid:107)x0 \u2212 x\u2217(cid:107)2\n\nA\n\n5 Extensions\n\nOur algorithms and convergence results can be extended to eigenvectors and conjugate directions\nwhich are only computed approximately. Some of this development can be found in the suppl.\nmaterial (see Section E). Finally, as mentioned in the introduction, our results can be extended to the\nmore general problem of minimizing f (x) = \u03c6(Ax), where \u03c6 is smooth and strongly convex.\n\nReferences\n[1] Zeyuan Allen-Zhu, Zheng Qu, Peter Richt\u00e1rik, and Yang Yuan. Even faster accelerated\n\ncoordinate descent using non-uniform sampling. In ICML, pages 1110\u20131119, 2016.\n\n[2] Jonathan Barzilai and Borwein Jonathan M. Two point step size gradient methods. IMA Journal\n\nof Numerical Analysis, 8:141\u2013148, 1988.\n\n9\n\n0.00.51.01.52.0Number of iterations1e6104103102101100Expected precisionk = 0k = 100000.00.51.01.52.02.53.0Number of iterations1e610121010108106104102100Expected precisionk = 0k = 100.00.51.01.52.0Number of iterations1e6108106104102100Expected precisionk = 0k = 1000\f[3] Ernesto G. Birgin, Jos\u00e9 Mario Mart\u00ednez, and Marcos Raydan. Spectral projected gradient\n\nmethods: Review and perspectives. Journal of Statistical Software, 60(3):1\u201321, 2014.\n\n[4] Dominik Csiba, Zheng Qu, and Peter Richt\u00e1rik. Stochastic dual coordinate ascent with adaptive\n\nprobabilities. In ICML, pages 674\u2013683, 2015.\n\n[5] Olivier Fercoq and Peter Richt\u00e1rik. Accelerated, parallel, and proximal coordinate descent.\n\nSIAM Journal on Optimization, 25(4):1997\u20132023, 2015.\n\n[6] Robert M Gower and Peter Richt\u00e1rik. Randomized iterative methods for linear systems. SIAM\n\nJournal on Matrix Analysis and Applications, 36(4):1660\u20131690, 2015.\n\n[7] Robert Mansel Gower and Peter Richt\u00e1rik. Stochastic dual ascent for solving linear systems.\n\narXiv preprint arXiv:1512.06890, 2015.\n\n[8] Ching-Pei Lee and Stephen J. Wright. Random permutations \ufb01x a worst case for cyclic\n\ncoordinate descent. arXiv:1607.08320, 2016.\n\n[9] Yin Tat Lee and Aaron Sidford. Ef\ufb01cient accelerated coordinate descent methods and faster\n\nalgorithms for solving linear systems. In FOCS, 2013.\n\n[10] Dennis Leventhal and Adrian Lewis. Randomized methods for linear constraints: convergence\n\nrates and conditioning. Mathematics of Operations Research, 35:641\u2013654, 2010.\n\n[11] Nicolas Loizou and Peter Richt\u00e1rik. Momentum and stochastic momentum for stochastic gradi-\nent, Newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677,\n2017.\n\n[12] Yurii Nesterov. A method of solving a convex programming problem with convergence rate\n\no(1/k2). Soviet Mathematics Doklady, 27(2):372\u2013376, 1983.\n\n[13] Yurii Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\nSIAM Journal on Optimization, 22(2):341\u2013362, 2012. doi: 10.1137/100802001. URL https:\n//doi.org/10.1137/100802001. First appeared in 2010 as CORE discussion paper 2010/2.\n\n[14] Yurii Nesterov and Sebastian Stich. Ef\ufb01ciency of accelerated coordinate descent method on\n\nstructured optimization problems. SIAM Journal on Optimization, 27(1):110\u2013123, 2017.\n\n[15] Julie Nutini, Mark Schmidt, Issam H. Laradji, Michael Friedlander, and Hoyt Koepke. Coordi-\nnate descent converges faster with the Gauss-Southwell rule than random selection. In ICML,\n2015.\n\n[16] Boris Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1 \u2013 17, 1964.\n\n[17] Zheng Qu and Peter Richt\u00e1rik. Coordinate descent with arbitrary sampling I: algorithms and\n\ncomplexity. Optimization Methods and Software, 31(5):829\u2013857, 2016.\n\n[18] Peter Richt\u00e1rik and Martin Tak\u00e1\u02c7c. Stochastic reformulations of linear systems: Algorithms and\n\nconvergence theory. arXiv preprint arXiv:1706.01108, 2017.\n\n[19] Peter Richt\u00e1rik and Martin Tak\u00e1\u02c7c. On optimal probabilities in stochastic coordinate descent\n\nmethods. Optimization Letters, 10(6):1233\u20131243, 2016.\n\n[20] Peter Richt\u00e1rik and Peter Tak\u00e1\u02c7c. Parallel coordinate descent methods for big data optimization.\n\nMathematical Programming, 156(1):433\u2013484, 2016.\n\n[21] Stephen Tu, Shivaram Venkataraman, Ashia C. Wilson, Alex Gittens, Michael I. Jordan, and\n\nBenjamin Recht. Breaking locality accelerates block Gauss-Seidel. In ICML, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1695, "authors": [{"given_name": "Dmitry", "family_name": "Kovalev", "institution": "KAUST"}, {"given_name": "Peter", "family_name": "Richtarik", "institution": "KAUST"}, {"given_name": "Eduard", "family_name": "Gorbunov", "institution": "Moscow Institute of Physics and Technology"}, {"given_name": "Elnur", "family_name": "Gasanov", "institution": "MIPT"}]}