{"title": "Accelerated Stochastic Greedy Coordinate Descent by Soft Thresholding Projection onto Simplex", "book": "Advances in Neural Information Processing Systems", "page_first": 4838, "page_last": 4847, "abstract": "In this paper we study the well-known greedy coordinate descent (GCD) algorithm to solve $\\ell_1$-regularized problems and improve GCD by the two popular strategies: Nesterov's acceleration and stochastic optimization. Firstly, we propose a new rule for greedy selection based on an $\\ell_1$-norm square approximation which is nontrivial to solve but convex; then an efficient algorithm called ``SOft ThreshOlding PrOjection (SOTOPO)'' is proposed to exactly solve the $\\ell_1$-regularized $\\ell_1$-norm square approximation problem, which is induced by the new rule. Based on the new rule and the SOTOPO algorithm, the Nesterov's acceleration and stochastic optimization strategies are then successfully applied to the GCD algorithm. The resulted algorithm called accelerated stochastic greedy coordinate descent (ASGCD) has the optimal convergence rate $O(\\sqrt{1/\\epsilon})$; meanwhile, it reduces the iteration complexity of greedy selection up to a factor of sample size. Both theoretically and empirically, we show that ASGCD has better performance for high-dimensional and dense problems with sparse solution.", "full_text": "Accelerated Stochastic Greedy Coordinate Descent by\n\nSoft Thresholding Projection onto Simplex\n\nChaobing Song, Shaobo Cui, Yong Jiang, Shu-Tao Xia\n\nTsinghua University\n\n{songcb16,shaobocui16}@mails.tsinghua.edu.cn\n\n{jiangy, xiast}@sz.tsinghua.edu.cn \u21e4\n\nAbstract\n\nIn this paper we study the well-known greedy coordinate descent (GCD) algorithm\nto solve `1-regularized problems and improve GCD by the two popular strategies:\nNesterov\u2019s acceleration and stochastic optimization. Firstly, based on an `1-norm\nsquare approximation, we propose a new rule for greedy selection which is non-\ntrivial to solve but convex; then an ef\ufb01cient algorithm called \u201cSOft ThreshOlding\nPrOjection (SOTOPO)\u201d is proposed to exactly solve an `1-regularized `1-norm\nsquare approximation problem, which is induced by the new rule. Based on the\nnew rule and the SOTOPO algorithm, the Nesterov\u2019s acceleration and stochastic\noptimization strategies are then successfully applied to the GCD algorithm. The re-\nsulted algorithm called accelerated stochastic greedy coordinate descent (ASGCD)\n\nhas the optimal convergence rate O(p1/\u270f); meanwhile, it reduces the iteration\n\ncomplexity of greedy selection up to a factor of sample size. Both theoretically and\nempirically, we show that ASGCD has better performance for high-dimensional\nand dense problems with sparse solutions.\n\n1\n\nIntroduction\n\nIn large-scale convex optimization, \ufb01rst-order methods are widely used due to their cheap iteration\ncost. In order to improve the convergence rate and reduce the iteration cost further, two important\nstrategies are used in \ufb01rst-order methods: Nesterov\u2019s acceleration and stochastic optimization.\nNesterov\u2019s acceleration is referred to the technique that uses some algebra trick to accelerate \ufb01rst-\norder algorithms; while stochastic optimization is referred to the method that samples one training\nexample or one dual coordinate at random from the training data in each iteration. Assume the\nobjective function F (x) is convex and smooth. Let F \u21e4 = minx2Rd F (x) be the optimal value. In\norder to \ufb01nd an approximate solution x that satis\ufb01es F (x) F \u21e4 \uf8ff \u270f, the vanilla gradient descent\nmethod needs O(1/\u270f) iterations. While after applying the Nesterov\u2019s acceleration scheme [18],\nthe resulted accelerated full gradient method (AFG) [18] only needs O(p1/\u270f) iterations, which is\nby a factor of the sample size [17] and obtain the optimal convergence rate O(p1/\u270f) by Nesterov\u2019s\n\noptimal for \ufb01rst-order algorithms [18]. Meanwhile, assume F (x) is also a \ufb01nite sum of n sample\nconvex functions. By sampling one training example, the resulted stochastic gradient descent (SGD)\nand its variants [15, 25, 1] can reduce the iteration complexity by a factor of the sample size. As an\nalternative of SGD, randomized coordinate descent (RCD) can also reduce the iteration complexity\n\nacceleration [16, 14]. The development of gradient descent and RCD raises an interesting problem:\ncan the Nesterov\u2019s acceleration and stochastic optimization strategies be used to improve other\nexisting \ufb01rst-order algorithms?\n\n\u21e4This work is supported by the National Natural Science Foundation of China under grant Nos. 61771273,\n\n61371078.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we answer this question partly by studying coordinate descent with Gauss-Southwell\nselection, i.e., greedy coordinate descent (GCD). GCD is widely used for solving sparse optimization\nproblems in machine learning [24, 11, 19]. If an optimization problem has a sparse solution, it is\nmore suitable than its counterpart RCD. However, the theoretical convergence rate is still O(1/\u270f).\nMeanwhile if the iteration complexity is comparable, GCD will be preferable than RCD [19]. However\nin the general case, in order to do exact Gauss-Southwell selection, computing the full gradient\nbeforehand is necessary, which causes GCD has much higher iteration complexity than RCD. To be\nconcrete, in this paper we consider the well-known nonsmooth `1-regularized problem:\n\ndef\n\ndef\n=\n\ndef\n\n1\nn\n\n(1)\n\nmin\n\n2 (bj aT\n\n= f (x) + kxk1\n\nx2RdnF (x)\n\nfj(x) + kxk1o,\n\nnXj=1\nnPn\nj=1 fj(x) is a smooth convex function that is\nwhere 0 is a regularization parameter, f (x) = 1\na \ufb01nite average of n smooth convex function fj(x). Given samples {(a1, b1), (a2, b2), . . . , (an, bn)}\nwith aj 2 Rd (j 2 [n]\nj x, bj), then (1) is an `1-regularized\n= {1, 2, . . . , n}), if each fj(x) = fj(aT\nempirical risk minimization (`1-ERM) problem. For example, if bj 2 R and fj(x) = 1\nj x)2,\nj x)), `1-regularized logistic regression\n(1) is Lasso; if bj 2 {1, 1} and fj(x) = log(1 + exp(bjaT\nis obtained.\nIn the above nonsmooth case, the Gauss-Southwell rule has 3 different variants [19, 24]: GS-s, GS-r\nand GS-q. The GCD algorithm with all the 3 rules can be viewed as the following procedure: in\neach iteration based on a quadratic approximation of f (x) in (1), one minimizes a surrogate objective\nfunction under the constraint that the direction vector used for update has at most 1 nonzero entry.\nThe resulted problems under the 3 rules are easy to solve but are nonconvex due to the cardinality\nconstraint of direction vector. While when using Nesterov\u2019s acceleration scheme, convexity is needed\n\nfor the derivation of the optimal convergence rate O(p1/\u270f) [18]. Therefore, it is impossible to\n\naccelerate GCD by the Nesterov\u2019s acceleration scheme under the 3 existing rules.\nIn this paper, we propose a novel variant of Gauss-Southwell rule by using an `1-norm square\napproximation of f (x) rather than quadratic approximation. The new rule involves an `1-regularized\n`1-norm square approximation problem, which is nontrivial to solve but is convex. To exactly\nsolve the challenging problem, we propose an ef\ufb01cient SOft ThreshOlding PrOjection (SOTOPO)\nalgorithm. The SOTOPO algorithm has O(d + |Q| log |Q|) cost, where it is often the case |Q|\u2327 d.\nThe complexity result O(d + |Q| log |Q|) is better than O(d log d) of its counterpart SOPOPO [20],\nwhich is an Euclidean projection method.\nThen based on the new rule and SOTOPO, we accelerate GCD to attain the optimal convergence rate\nO(p1/\u270f) by combing a delicately selected mirror descent step. Meanwhile, we show that it is not\n\nnecessary to compute full gradient beforehand: sampling one training example and computing a noisy\ngradient rather than full gradient is enough to perform greedy selection. This stochastic optimization\ntechnique reduces the iteration complexity of greedy selection by a factor of the sample size. The\n\ufb01nal result is an accelerated stochastic greedy coordinate descent (ASGCD) algorithm.\nAssume x\u21e4 is an optimal solution of (1). Assume that each fj(x)(for all j 2 [n]) is Lp-smooth w.r.t.\nk\u00b7k p (p = 1, 2), i.e., for all x, y 2 Rd,\n\n2\n\nkrfj(x) rfj(y)kq \uf8ff Lpkx ykp,\n\n(2)\n\nwhere if p = 1, then q = 1; if p = 2, then q = 2.\nIn order to \ufb01nd an x that satis\ufb01es F (x) F (x\u21e4) \uf8ff \u270f, ASGCD needs O\u21e3pCL1kx\u21e4k1\n\n(16)), where C is a function of d that varies slowly over d and is upper bounded by log2(d). For\nhigh-dimensional and dense problems with sparse solutions, ASGCD has better performance than the\nstate of the art. Experiments demonstrate the theoretical result.\nNotations: Let [d] denote the set {1, 2, . . . , d}. Let R+ denote the set of nonnegative real number. For\nx 2 Rd, let kxkp = (Pd\np (1 \uf8ff p < 1) denote the `p-norm and kxk1 = maxi2[d] |xi|\ndenote the `1-norm of x. For a vector x, let dim(x) denote the dimension of x; let xi denote the i-th\nelement of x. For a gradient vector rf (x), let rif (x) denote the i-th element of rf (x). For a set\nS, let |S| denote the cardinality of S. Denote the simplex 4d = {\u2713 2 Rd\n\n\u2318 iterations (see\n\ni=1 \u2713i = 1}.\n\ni=1 |xi|p)\n\np\u270f\n\n1\n\n+ :Pd\n\n\f2 The SOTOPO algorithm\n\nThe proposed SOTOPO algorithm aims to solve the proposed new rule, i.e., minimize the following\n`1-regularized `1-norm square approximation problem,\n\n\u02dch\n\ndef\n= arg min\n\ng2Rd \u21e2hrf (x), gi +\n\n1\n2\u2318kgk2\n\n1 + kx + gk1 ,\n\n(3)\n\ndef\n= x + \u02dch,\n\n\u02dcx\n\n(4)\nwhere x denotes the current iteration, \u2318 a step size, g the variable to optimize, \u02dch the director vector for\nupdate and \u02dcx the next iteration. The number of nonzero entries of \u02dch denotes how many coordinates\nwill be updated in this iteration. Unlike the quadratic approximation used in GS-s, GS-r and GS-q\nrules, in the new rule the coordinate(s) to update is implicitly selected by the sparsity-inducing\nproperty of the `1-norm square kgk2\n1 rather than using the cardinality constraint kgk0 \uf8ff 1 (i.e., g\nhas at most 1 nonzero element) [19, 24]. By [8, \u00a79.4.2], when the nonsmooth term kx + gk1 in (1)\ndoes not exist, the minimizer of the `1-norm square approximation (i.e., `1-norm steepest descent)\nis equivalent to GCD. When kx + gk1 exists, generally, there may be one or more coordinates to\nupdate in this new rule. Because of the sparsity-inducing property of kgk2\n1 and kx + gk1, both the\ndirection vector \u02dch and the iterative solution \u02dcx are sparse. In addition, (3) is an unconstrained problem\nand thus is feasible.\n\n2.1 A variational reformulation and its properties\n(3) involves the nonseparable, nonsmooth term kgk2\n1 and the nonsmooth term kx + gk1. Because\nthere are two nonsmooth terms, it seems dif\ufb01cult to solve (3) directly. While by the variational\nin [5] 2, in Lemma 1, it is shown that we can transform the\nidentity kgk2\noriginal nonseparable and nonsmooth problem into a separable and smooth optimization problem on\na simplex.\nLemma 1. By de\ufb01ning\n\n1 = inf \u271324dPd\n\ng2\ni\n\u2713i\n\ni=1\n\nJ(g, \u2713)\n\ndef\n\n= hrf (x), gi +\n\n1\n2\u2318\n\n\u02dcg(\u2713)\n\u02dc\u2713\n\ndef\n= arg ming2Rd J(g, \u2713),\ndef\n= arg inf \u271324d J(\u2713),\n\ndXi=1\n\ng2\ni\n\u2713i\n\n+ kx + gk1,\n\nJ(\u2713)\n\ndef\n= J(\u02dcg(\u2713),\u2713 ),\n\n(5)\n\n(6)\n\n(7)\n\nwhere \u02dcg(\u2713) is a vector function. Then the minimization problem to \ufb01nd \u02dch in (3) is equivalent to the\nproblem (7) to \ufb01nd \u02dc\u2713 with the relation \u02dch = \u02dcg(\u02dc\u2713). Meanwhile, \u02dcg(\u2713) and J(\u2713) in (6) are both coordinate\nseparable with the expressions\n\ndef\n\n8i 2 [d], \u02dcgi(\u2713) = \u02dcgi(\u2713i)\nJ(\u2713) =\n\n= sign(xi \u2713i\u2318rif (x)) \u00b7 max{0,|xi \u2713i\u2318rif (x)| \u2713i\u2318} xi, (8)\n(9)\n\ndef\n\nJi(\u2713i), where Ji(\u2713i)\n\n= rif (x) \u00b7 \u02dcgi(\u2713i) +\n\n+ |xi + \u02dcgi(\u2713i)|.\n\n1\n2\u2318\n\ndXi=1\n\n\u02dcg2\ni (\u2713i)\n\u2713i\n\ndXi=1\n\nIn Lemma 1, (8) is obtained by the iterative soft thresholding operator [7]. By Lemma 1, we can\nreformulate (3) into the problem (5), which is about two parameters g and \u2713. Then by the joint\nconvexity, we swap the optimization order of g and \u2713. Fixing \u2713 and optimizing with respect to (w.r.t.)\ng, we can get a closed form of \u02dcg(\u2713), which is a vector function about \u2713. Substituting \u02dcg(\u2713) into J(g, \u2713),\nwe get the problem (7) about \u2713. Finally, the optimal solution \u02dch in (3) can be obtained by \u02dch = \u02dcg(\u02dc\u2713).\nThe explicit expression of each Ji(\u2713i) can be given by substituting (8) into (9). Because \u2713 2 4d, we\nhave for all i 2 [d], 0 \uf8ff \u2713i \uf8ff 1. In the following Lemma 2, it is observed that the derivate J0i(\u2713i) can\nbe a constant or have a piecewise structure, which is the key to deduce the SOTOPO algorithm.\n\n2The in\ufb01ma can be replaced by minimization if the convention \u201c0/0 = 0\u201d is used.\n\n3\n\n\fdef\n=\n\nand ri2\n\n, then J0i(\u2713i) belongs to one of the 4 cases,\n\nLemma 2. Assume that for all i 2 [d], J0i(0) and J0i(1) have been computed. Denote ri1\n|xi|p2\u2318J 0i(0)\n|xi|p2\u2318J 0i(1)\n(case a) : J0i(\u2713i) = 0,\n(case c) : J0i(\u2713i) =(J0i(0),\n x2\n\n(case b) : J0i(\u2713i) = J0i(0) < 0,\n\n0 \uf8ff \u2713i \uf8ff ri1\nri1 <\u2713 i \uf8ff 1\n\n0 \uf8ff \u2713i \uf8ff 1,\n\n0 \uf8ff \u2713i \uf8ff 1,\n0 \uf8ff \u2713i \uf8ff ri1\nri1 <\u2713 i < ri2\nri2 \uf8ff \u2713i \uf8ff 1\n\n(case d) : J0i(\u2713i) =8><>:\n\nAlthough the formulation of J0i(\u2713i) is complicated, by summarizing the property of the 4 cases in\nLemma 2, we have Corollary 1.\nCorollary 1. For all i 2 [d] and 0 \uf8ff \u2713i \uf8ff 1, if the derivate J0i(\u2713i) is not always 0, then J0i(\u2713i) is a\nnon-decreasing, continuous function with value always less than 0.\nCorollary 1 shows that except the trivial (case a), for all i 2 [d], whichever J0i(\u2713i) belong to (case b),\n(case c) or case (d), they all share the same group of properties, which makes a consistent iterative\nprocedure possible for all the cases. The different formulations in the four cases mainly have impact\nabout the stopping criteria of SOTOPO.\n\nJ0i(0),\n x2\n2\u2318\u27132\ni\nJ0i(1),\n\ni\n\ni\n\n2\u2318\u27132\ni\n\n,\n\n,\n\n,\n\ndef\n=\n\n.\n\n2.2 The property of the optimal solution\nThe Lagrangian of the problem (7) is\n\nL(\u2713, , \u21e3 )\n\ndef\n\n= J(\u2713) + \u21e3 dXi=1\n\n\u2713i 1\u2318 h\u21e3,\u2713i,\n\n(10)\n\nwhere 2 R is a Lagrange multiplier and \u21e3 2 Rd\nDue to the coordinate separable property of J(\u2713) in (9), it follows that @J (\u2713)\n@\u2713i\nKKT condition of (10) can be written as\n\n+ is a vector of non-negative Lagrange multipliers.\n= J0i(\u2713i). Then the\n\ndXi=1\n\n8i 2 [d],\n\nJ0i(\u2713i) + \u21e3i = 0,\u21e3\n\ni\u2713i = 0,\n\nand\n\n\u2713i = 1.\n\n(11)\n\ndef\n\nBy reformulating the KKT condition (11), we have Lemma 3.\nLemma 3. If (\u02dc, \u02dc\u2713, \u02dc\u21e3) is a stationary point of (10), then \u02dc\u2713 is an optimal solution of (7). Meanwhile,\ndenote S\n\n= {i : \u02dc\u2713i > 0} and T\n= {j : \u02dc\u2713j = 0}, then the KKT condition can be formulated as\nPi2S\n\u02dc\u2713i = 1;\nfor all j 2 T,\nfor all i 2 S,\n\n\u02dc\u2713j = 0;\n\u02dc = J0i(\u02dc\u2713i) maxj2T J0j(0).\n\n(12)\n\ndef\n\nBy Lemma 3, if the set S in Lemma 3 is known beforehand, then we can compute \u02dc\u2713 by simply\napplying the equations in (12). Therefore \ufb01nding the optimal solution \u02dc\u2713 is equivalent to \ufb01nding the\nset of the nonzero elements of \u02dc\u2713.\n\n8<:\n\n2.3 The soft thresholding projection algorithm\nIn Lemma 3, for each i 2 [d] with \u02dc\u2713i > 0, it is shown that the negative derivate J0i(\u02dc\u2713i) is equal to a\nsingle variable \u02dc. Therefore, a much simpler problem can be obtained if we know the coordinates of\nthese positive elements. At \ufb01rst glance, it seems dif\ufb01cult to identify these coordinates, because the\nnumber of potential subsets of coordinates is clearly exponential on the dimension d. However, the\nproperty clari\ufb01ed by Lemma 2 enables an ef\ufb01cient procedure for identifying the nonzero elements of\n\u02dc\u2713. Lemma 4 is a key tool in deriving the procedure for identifying the non-zero elements of \u02dc\u2713.\nLemma 4 (Nonzero element identi\ufb01cation). Let \u02dc\u2713 be an optimal solution of (7). Let s and t be two\ncoordinates such that J0s(0) < J0t(0). If \u02dc\u2713s = 0, then \u02dc\u2713t must be 0 as well; equivalently, if \u02dc\u2713t > 0,\nthen \u02dc\u2713s must be greater than 0 as well.\n\n4\n\n\fdef\n\n= rJ(0) such that ui1 ui2 \u00b7\u00b7\u00b7 uid, where {i1, i2, . . . , id}\nLemma 4 shows that if we sort u\nis a permutation of [d], then the set S in Lemma 3 is of the form {i1, i2, . . . , i%}, where 1 \uf8ff % \uf8ff d.\nIf % is obtained, then we can use the fact that for all j 2 [%],\n%Xj=1\n\nto compute \u02dc. Therefore, by Lemma 4, we can ef\ufb01ciently identify the nonzero elements of the optimal\nsolution \u02dc\u2713 after a sort operation, which costs O(d log d). However based on Lemmas 2 and 3, the sort\ncost O(d log d) can be further reduced by the following Lemma 5.\nLemma 5 (Ef\ufb01cient identi\ufb01cation). Assume \u02dc\u2713 and S are given in Lemma 3. Then for all i 2 S,\n\nJ0ij (\u02dc\u2713ij ) = \u02dc\n\n\u02dc\u2713ij = 1\n\n(13)\n\nand\n\n(14)\n\nJ0i(0) max\n\nj2[d]{J0j(1)}.\n\nBy Lemma 5, before ordering u, we can \ufb01lter out all the coordinates i\u2019s that satisfy J0i(0) <\nmaxj2[d] J0j(1). Based on Lemmas 4 and 5, we propose the SOft ThreshOlding PrOjection\n(SOTOPO) algorithm in Alg. 1 to ef\ufb01ciently obtain an optimal solution \u02dc\u2713. In the step 1, by Lemma 5,\nwe \ufb01nd the quantity vm, im and Q. In the step 2, by Lemma 4, we sort the elements {J0i(0)| i 2 Q}.\nIn the step 3, because S in Lemma 3 is of the form {i1, i2, . . . , i%}, we search the quantity \u21e2 from\n1 to |Q| + 1 until a stopping criteria is met. In Alg. 1, the number of nonzero elements of \u02dc\u2713 is \u21e2 or\n\u21e2 1. In the step 4, we compute the \u02dc in Lemma 3 according to the conditions. In the step 5, the\noptimal \u02dc\u2713 and the corresponding \u02dch, \u02dcx are given.\n\nAlgorithm 1 \u02dcx =SOTOPO(rf (x), x,,\u2318 )\n\n1. Find\n\n(vm, im)\n\ndef\n\n= (maxi2[d]{J0i(1)}, arg maxi2[d]{J0i(1)}), Q\n\ndef\n\n= {i 2 [d]| J0i(0) > vm}.\n(0), where\n\n2. Sort {J0i(0)| i 2 Q} such that J0i1(0) J0i2(0) \u00b7\u00b7\u00b7 J0i|Q|\n\n{i1, i2, . . . , i|Q|} is a permutation of the elements in Q. Denote\ni|Q|+1\nv\n\n= (J0i1(0),J0i2(0), . . . ,J0i|Q|\n\n(0), vm),\n\nand\n\ndef\n\ndef\n= im, v|Q|+1\n\ndef\n= vm.\n\n3. For j 2 [|Q| + 1], denote Rj = {ik|k 2 [j]}. Search from 1 to |Q| + 1 to \ufb01nd the quantity\n\n= minj 2 [|Q| + 1]| J0ij (0) = J0ij (1) or Xl2Rj |xl| p2\u2318vj or j = |Q| + 1 .\n\ndef\n\n\u21e2\n\n4. The \u02dc in Lemma 3 is given by\n\n/(2\u2318),\n\nif Pl2R\u21e21 |xl| p2\u2318v\u21e2;\n\notherwise.\n\nv\u21e2,\n\n\u02dc =(\u21e3Pl2R\u21e21 |xl|\u23182\n(\u02dc\u2713l, \u02dchl, \u02dcxl) =8><>:\n |xl|p2\u2318\u02dc ,xl, 0,\n1 Pk2 R\u21e2\\{i\u21e2}\n\n(0, 0, xl),\n\n5. Then the \u02dc\u2713 in Lemma 3 and its corresponding \u02dch, \u02dcx in (3) and (4) are obtained by\nif l 2 R\u21e2\\{i\u21e2};\nif l = i\u21e2;\nif l 2 [d]\\R\u21e2.\n\n\u02dc\u2713k, \u02dcgl(\u02dc\u2713l), xl + \u02dcgl(\u02dc\u2713l),\n\nIn Theorem 1, we give the main result about the SOTOPO algorithm.\nTheorem 1. The SOTOPO algorihtm in Alg. 1 can get the exact minimizer \u02dch, \u02dcx of the `1-regularized\n`1-norm square approximation problem in (3) and (4).\nThe SOTOPO algorithm seems complicated but is indeed ef\ufb01cient. The dominant operations in Alg.\n1 are steps 1 and 2 with the total cost O(d + |Q| log |Q|). To show the effect of the complexity\nreduction by Lemma 5, we give the following fact.\n\n5\n\n\fProposition 1. For the optimization problem de\ufb01ned in (5)-(7), where is the regularization param-\neter of the original problem (1), we have that\n\n\u2318\n\n0 \uf8ff max\n\ni2[d](s2J0i(0)\n\nAssume vm is de\ufb01ned in the step 1 of Alg. 1. By Proposition 1, for all i 2 Q,\n\nj2[d]8<:\n\u2318 9=; \uf8ff 2.\ns2J0j(1)\n) max\nj2[d]8<:\n\u2318 9=;\ns2J0j(1)\n) \uf8ff max\n+ 2 =r 2vm\nTherefore at least the coordinates j\u2019s that satisfyq2J0j (0)\n>q 2vm\n\nk2[d](s2J0k(0)\n\ns2J0i(0)\n\n\u2318\n\n\uf8ff max\n\n\u2318\n\n\u2318\n\n\u2318\n\nQ. In practice, it can considerably reduce the sort complexity.\nRemark 1. SOTOPO can be viewed as an extension of the SOPOPO algorithm [20] by changing the\nobjective function from Euclidean distance to a more general function J(\u2713) in (9). It should be noted\nthat Lemma 5 does not have a counterpart in the case that the objective function is Euclidean distance\n[20]. In addition, some extension of randomized median \ufb01nding algorithm [12] with linear time in\nour setting is also deserved to research. Due to the limited space, it is left for further discussion.\n\n(15)\n\n+ 2,\n\n\u2318 + 2 will be not contained in\n\n3 The ASGCD algorithm\n\nNow we can come back to our motivation, i.e., accelerating GCD to obtain the optimal convergence\nrate O(1/p\u270f) by Nesterov\u2019s acceleration and reducing the complexity of greedy selection by stochas-\ntic optimization. The main idea is that although like any (block) coordinate descent algorithm, the\nproposed new rule, i.e., minimizing the problem in (3), performs update on one or several coordinates,\nit is a generalized proximal gradient descent problem based on `1-norm. Therefore this rule can be\napplied into the existing Nesterov\u2019s acceleration and stochastic optimization framework \u201cKatyusha\u201d\n[1] if it can be solved ef\ufb01ciently. The \ufb01nal result is the accelerated stochastic greedy coordinate\ndescent (ASGCD) algorithm, which is described in Alg. 2.\n\nAlgorithm 2 ASGCD\n\n = log(d) 1 p(log(d) 1)2 1;\n\n;\n\n2\n1+\n\n\np = 1 + , q = p\np1 , C = d\nz0 = y0 = \u02dcx0 = #0 = 0;\n\u23272 = 1\nb e,\u2318 =\nfor s = 0, 1, 2, . . . , S 1, do\ns+4 ,\u21b5 s = \u2318\n\n2 , m = d n\n\n(1+2 nb\n\n1\nb(n1) )L1\n\n;\n\n\u23271,sC ;\n\n1. \u23271,s = 2\n2. \u00b5s = rf (\u02dcxs);\n3. for l = 0, 1, . . . , m 1, do\n\n(a) k = (sm) + l;\n(b) randomly sample a mini batch B of size b from {1, 2, . . . , n} with equal probability;\n(c) xk+1 = \u23271,szk + \u23272 \u02dcxs + (1 \u23271,s \u23272)yk;\nbPj2B(rfj(xk+1) rfj(\u02dcxs));\n(d) \u02dcrk+1 = \u00b5s + 1\n(e) yk+1 =SOTOPO( \u02dcrk+1, xk+1,,\u2318 );\n(f) (zk+1,# k+1) = pCOMID( \u02dcrk+1,# k, q,,\u21b5 s);\nend for\n\n4. \u02dcxs+1 = 1\n\nl=1 ysm+l;\n\nmPm\n\nend for\nOutput: \u02dcxS\n\n6\n\n\fAlgorithm 3 (\u02dcx, \u02dc#) = pCOMID(g, #, q, , \u21b5 )\n\n1. 8i 2 [d], \u02dc#i = sign(#i \u21b5gi) \u00b7 max{0,|#i \u21b5gi| \u21b5};\n2. 8i 2 [d], \u02dcxi = sign( \u02dc#i)|\u02dc\u2713i|q1\n3. Output: \u02dcx, \u02dc#.\n\nk \u02dc#kq2\n\n;\n\nq\n\n4\n\n2m\n\np\u270f\n\n1\n\nS2\n\n2\n1+\n\n\n(1+2 nb\n\n1 + 2(b)\n\n1\nb(n1) )L1\n\nIn Alg. 2, the gradient descent step 3(e) is solved by the proposed SOTOPO algorithm, while the\nmirror descent step 3(f ) is solved by the COMID algorithm with p-norm divergence [13, Sec. 7.2].\nWe denote the mirror descent step as pCOMID in Alg. 3. All other parts are standard steps in the\nKatyusha framework except some parameter settings. For example, instead of the custom setting\np = 1 + 1/log(d) [21, 13], a particular choice p = 1 + ( is de\ufb01ned in Alg. 2) is used to minimize\n. C varies slowly over d and is upper bounded by log2(d). Meanwhile, \u21b5k+1 depends\nthe C = d\non the extra constant C. Furthermore, the step size \u2318 =\nis used, where L1 is de\ufb01ned\nin (2). Finally, unlike [1, Alg. 2], we let the batch size b as an algorithm parameter to cover both the\nstochastic case b < n and the deterministic case b = n. To the best of our knowledge, the existing\nGCD algorithms are deterministic, therefore by setting b = n, we can compare with the existing\nGCD algorithms better.\nBased on the ef\ufb01cient SOTOPO algorithm, ASGCD has nearly the same iteration complexity with\nthe standard form [1, Alg. 2] of Katyusha. Meanwhile we have the following convergence rate.\nTheorem 2. If each fj(x)(j 2 [n]) is convex, L1-smooth in (2) and x\u21e4 is an optimum of the\n`1-regularized problem (1), then ASGCD satis\ufb01es\n\u25c6 , (16)\nb(n1), S, b, m and C are given in Alg. 2. In other words, ASGCD achieves an\n\nC\u25c6 L1kx\u21e4k2\n\u270f-additive error (i.e., E[F (\u02dcxS)] F (x\u21e4) \uf8ff \u270f ) using at most O\u21e3pCL1kx\u21e4k1\n\n1 = O\u2713 CL1kx\u21e4k2\n\u2318 iterations.\n\n(S + 3)2\u27131 +\n\nE[F (\u02dcxS)] F (x\u21e4) \uf8ff\n\nwhere (b) = nb\n\npL1kxk1 log d\n\nO\u21e3pL2kx\u21e4k2p\u270f\n\nIn Table 1, we give the convergence rate of the existing algorithms and ASGCD to solve the `1-\nregularized problem (1).\nIn the \ufb01rst column, \u201cAcc\u201d and \u201cNon-Acc\u201d denote the corresponding\nalgorithms are Nesterov\u2019s accelerated or not respectively, \u201cPrimal\u201d and \u201cDual\u201d denote the corre-\nsponding algorithms solves the primal problem (1) and its regularized dual problem [22] respectively,\n`2-norm and `1-norm denote the theoretical guarantee is based on `2-norm and `1-norm respectively.\nIn terms of `2-norm based guarantee, Katyusha and APPROX give the state of the art convergence rate\n\n\u2318. In terms of `1-norm based guarantee, GCD gives the state of the art convergence rate\n\nO( L1kxk2\n), which is only applicable for the smooth case = 0 in (1). When > 0, the generalized\nGS-r, GS-s and GS-q rules generally have worse theoretical guarantee than GCD [19]. While the\nbound of ASGCD in this paper is O(\n), which can be viewed as an accelerated version\np\u270f\nof the `1-norm based guarantee O( L1kxk2\n). Meanwhile, because the bound depends on kx\u21e4k1 rather\nthan kx\u21e4k2 and on L1 rather than L2 (L1 and L2 are de\ufb01ned in (2)), for the `1-ERM problem, if the\nsamples are high-dimensional, dense and the regularization parameter is relatively large, then it is\npossible that L1 \u2327 L2 (in the extreme case, L2 = dL1 [11]) and kx\u21e4k1 \u21e1 kx\u21e4k2. In this case, the\n`1-norm based guarantee O(\n) of ASGCD is better than the `2-norm based guarantee\n\u2318 of Katyusha and APPROX. Finally, whether the log d factor in the bound of ASGCD\nO\u21e3pL2kx\u21e4k2p\u270f\n(which also appears in the COMID [13] analysis) is necessary deserves further research.\nRemark 2. When the batch size b = n, ASGCD is a deterministic algorithm. In this case, we can use\na better smooth constant T1 that satis\ufb01es krf (x) rf (y)k1 \uf8ff T1kx yk1 rather than L1 [1].\nRemark 3. The necessity of computing the full gradient beforehand is the main bottleneck of GCD\nin applications [19]. There exists some work [11] to avoid the computation of full gradient by\nperforming some approximate greedy selection. While the method in [11] needs preprocessing,\n\npL1kxk1 log d\n\n1\n\n\u270f\n\n1\n\n\u270f\n\np\u270f\n\n7\n\n\fTable 1: Convergence rate on `1-regularized empirical risk minimization problems. (For GCD, the\nconvergence rate is applied for = 0. )\n\nALGORITHM TYPE\nNON-ACC, PRIMAL, `2-NORM\nACC, PRIMAL, `2-NORM\nACC,\nDUAL,\n`2-NORM\n\nPAPER\n\nSAGA [10]\n\nKATYUSHA [1]\nACC-SDCA [23]\n\nSPDC [26]\nAPCG [16]\n\nAPPROX [14]\n\nNON-ACC, PRIMAL, `1-NORM\nACC, PRIMAL, `1-NORM\n\nGCD [3]\n\nASGCD (THIS PAPER)\n\n2\n\n1\n\n\u270f\n\np\u270f\n\nCONVERGENCE RATE\n\n\u2318\nO\u21e3 L2kx\u21e4k2\nO\u21e3 pL2kx\u21e4k2\n\u2318\n\u270f )\u2318\nO\u21e3 pL2kx\u21e4k2\n\u2318\nO\u21e3 L1kx\u21e4k2\nO\u21e3 pL1kx\u21e4k1 log d\n\u2318\n\nlog( 1\n\n\u270f\n\np\u270f\n\np\u270f\n\nincoherence condition for dataset and is somewhat complicated. Contrary to [11], the proposed\nASGCD algorithm reduces the complexity of greedy selection by a factor up to n in terms of the\namortized cost by simply applying the existing stochastic variance reduction framework.\n\n4 Experiments\n\nIn this section, we use numerical experiments to demonstrate the theoretical results in Section 3\nand show the empirical performance of ASGCD with batch size b = 1 and its deterministic version\nwith b = n (In Fig. 1 they are denoted as ASGCD (b = 1) and ASGCD (b = n) respectively). In\naddition, following the claim to using data access rather than CPU time [21] and the recent SGD\nand RCD literature [15, 16, 1], we use the data access, i.e., the number of times the algorithm\naccesses the data matrix, to measure the algorithm performance. To show the effect of Nesterov\u2019s\nacceleration, we compare ASGCD (b = n) with the non-accelerated greedy coordinate descent\nwith GS-q rule, i.e., coordinate gradient descent (CGD) [24]. To show the effect of both Nesterov\u2019s\nacceleration and stochastic optimization strategies, we compare ASGCD (b = 1) with Katyusha\n[1, Alg. 2]. To show the effect of the proposed new rule in Section 2, which is based on `1-norm\nsquare approximation, we compare ASGCD (b = n) with the `2-norm based proximal accelerated\nfull gradient (AFG) implemented by the linear coupling framework [4]. Meanwhile, as a benchmark\nof stochastic optimization for the problems with \ufb01nite-sum structure, we also show the performance\nof proximal stochastic variance reduced gradient (SVRG) [25]. In addition, based on [1] and our\nexperiments, we \ufb01nd that \u201cKatyusha\u201d [1, Alg. 2] has the best empirical performance in general for\nthe `1-regularized problem (1). Therefore other well-known state-of-art algorithms, such as APCG\n[16] and accelerated SDCA [23], are not included in the experiments.\nThe datasets are obtained from LIBSVM data [9] and summarized in Table 2. All the algorithms are\nused to solve the following lasso problem\n\nmin\n\nx2Rd{f (x) + kxk1 =\n\n1\n2nkb Axk2\n\n2 + kxk1}\n\n(17)\n\non the 3 datasets, where A = (a1, a2, . . . , an)T = (h1, h2, . . . , hd) 2 Rn\u21e5d with each aj 2 Rd\nrepresenting a sample vector and hi 2 Rn representing a feature vector, b 2 Rn is the prediction\nvector.\n\nTable 2: Characteristics of three real datasets.\n\nDATASET NAME\nLEUKEMIA\nGISETTE\nMNIST\n\n# SAMPLES n\n\n# FEATURES d\n\n38\n6000\n60000\n\n7129\n5000\n780\n\nFor ASGCD (b = 1) and Katyusha [1, Alg. 2], we can use the tight smooth constant L1 =\n2 respectively in their implementation. While for AS-\nmaxj2[n],i2[d] |a2\n\nj,i| and L2 = maxj2[n] kajk2\n\n8\n\n\f\n\nLeu\n\nGisette\n\nMnist\n\nCGD\nAFG\nASGCD (b=n)\nSVRG\nKatyusha\nASGCD (b=1)\n\n0\n\n1\n\n2\n\n3\n\n4\n\nNumber of Passes\n\n5\n\u00d710 4\n\n102\n\n106\n\ns\ns\no\n\nl\n \n\ng\no\nL\n\ns\ns\no\nL\n\n \n\ng\no\nL\n\n0\n\n-2\n\n-4\n\n-6\n\n-8\n\n-10\n\n-12\n\n-14\n\n-16\n\n-18\n\n-20\n\n0\n\n-2\n\n-4\n\n-6\n\n-8\n\n-10\n\n-12\n\n-14\n\n-16\n\n-18\n\n-20\n\n0\n\n200\n\n400\n\n600\n\ns\ns\no\nL\n\n \n\ng\no\nL\n\ns\ns\no\nL\n\n \n\ng\no\nL\n\n0\n\n-2\n\n-4\n\n-6\n\n-8\n\n-10\n\n-12\n\n-14\n\n-16\n\n-18\n\n-20\n\n0\n\n-2\n\n-4\n\n-6\n\n-8\n\n-10\n\n-12\n\n-14\n\n-16\n\n-18\n\n-20\n\n800\n\n1000\n\n1200\nNumber of Passes\n\n1400\n\n1600\n\n1800\n\n2000\n\n0\n\n200\n\n400\n\n600\n\ns\ns\no\nL\n\n \n\ng\no\nL\n\ns\ns\no\nL\n\n \n\ng\no\nL\n\n0\n\n-2\n\n-4\n\n-6\n\n-8\n\n-10\n\n-12\n\n-14\n\n-16\n\n-18\n\n-20\n\n0\n\n-2\n\n-4\n\n-6\n\n-8\n\n-10\n\n-12\n\n-14\n\n-16\n\n-18\n\n-20\n\n800\n\n1000\n\n1200\nNumber of Passes\n\n1400\n\n1600\n\n1800\n\n2000\n\n0\n\n1000\n\n2000\n\n3000\n\n5000\n\n4000\n6000\nNumber of Passes\n\n7000\n\n8000\n\n9000 10000\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nNumber of Passes\n\n9\n\n10\n\u00d7104\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n3.5\n\n4\n\n4.5\n\nNumber of Passes\n\n5\n\u00d7104\n\nFigure 1: Comparing AGCD (b = 1) and ASGCD (b = n) with CGD, SVRG, AFG and Katyusha on\nLasso.\n\nGCD (b = n) and AFG, the better smooth constant T1 = maxi2[d] khik2\nare used re-\nspectively. The learning rate of CGD and SVRG are tuned in {106, 105, 104, 103, 102, 101}.\n\nand T2 = kAk2\n\nn\n\nn\n\n2\n\nTable 3: Factor rates of for the 6 cases\n\n\n102\n106\n\nLEU\n\n(0.85, 1.33)\n(1.45, 2.27)\n\nGISETTE\n(0.88, 0.74)\n(3.51, 2.94)\n\nMNIST\n\n(5.85, 3.02)\n(5.84, 3.02)\n\nWe use = 106 and = 102 in the experiments. In addition, for each case (Dataset, ), AFG is\nused to \ufb01nd an optimum x\u21e4 with enough accuracy.\nThe performance of the 6 algorithms is plotted in Fig. 1. We use Log loss log(F (xk) F (x\u21e4)) in the\ny-axis. x-axis denotes the number that the algorithm access the data matrix A. For example, ASGCD\n(b = n) accesses A once in each iteration, while ASGCD (b = 1) accesses A twice in an entire outer\npT2kx\u21e4k2 \u2318\niteration. For each case (Dataset, ), we compute the rate (r1, r2) = \u21e3pCL1kx\u21e4k1\npCT1kx\u21e4k1\nin Table 3. First, because of the acceleration effect, ASGCD (b = n) are always better than the\nnon-accelerated CGD algorithm; second, by comparing ASGCD(b = 1) with Katyusha and ASGCD\n(b = n) with AFG, we \ufb01nd that for the cases (Leu, 102), (Leu, 106) and (Gisette, 102), ASGCD\n(b = 1) dominates Katyusha [1, Alg.2] and ASGCD (b = n) dominates AFG. While the theoretical\nanalysis in Section 3 shows that if r1 is relatively small such as around 1, then ASGCD (b = 1)\nwill be better than [1, Alg.2]. For the other 3 cases, [1, Alg.2] and AFG are better. The consistency\nbetween Table 3 and Fig. 1 demonstrates the theoretical analysis.\n\npL2kx\u21e4k2\n\n,\n\nReferences\n[1] Zeyuan Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. ArXiv e-prints,\n\nabs/1603.05953, 2016.\n\n[2] Zeyuan Allen-Zhu, Zhenyu Liao, and Lorenzo Orecchia. Spectral sparsi\ufb01cation and regret minimization\nbeyond matrix multiplicative updates. In Proceedings of the Forty-Seventh Annual ACM on Symposium on\nTheory of Computing, pages 237\u2013245. ACM, 2015.\n\n[3] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear Coupling: An Ultimate Uni\ufb01cation of Gradient and\n\nMirror Descent. ArXiv e-prints, abs/1407.1537, July 2014.\n\n[4] Zeyuan Allen-Zhu and Lorenzo Orecchia. Linear coupling: An ultimate uni\ufb01cation of gradient and mirror\n\ndescent. ArXiv e-prints, abs/1407.1537, July 2014.\n\n9\n\n\f[5] Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski, et al. Optimization with sparsity-\n\ninducing penalties. Foundations and Trends R in Machine Learning, 4(1):1\u2013106, 2012.\n\n[6] Keith Ball, Eric A Carlen, and Elliott H Lieb. Sharp uniform convexity and smoothness inequalities for\n\ntrace norms. Inventiones mathematicae, 115(1):463\u2013482, 1994.\n\n[7] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM journal on imaging sciences, 2(1):183\u2013202, 2009.\n\n[8] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.\n\n[9] Chih-Chung Chang. Libsvm: Introduction and benchmarks. http://www. csie. ntn. edu. tw/\u02dc cjlin/libsvm,\n\n2000.\n\n[10] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in Neural Information Processing\nSystems, pages 1646\u20131654, 2014.\n\n[11] Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Nearest neighbor based greedy coordinate\n\ndescent. In Advances in Neural Information Processing Systems, pages 2160\u20132168, 2011.\n\n[12] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections onto the l\n1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine\nlearning, pages 272\u2013279. ACM, 2008.\n\n[13] John C Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent.\n\nIn COLT, pages 14\u201326, 2010.\n\n[14] Olivier Fercoq and Peter Richt\u00e1rik. Accelerated, parallel, and proximal coordinate descent. SIAM Journal\n\non Optimization, 25(4):1997\u20132023, 2015.\n\n[15] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[16] Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances\n\nin Neural Information Processing Systems, pages 3059\u20133067, 2014.\n\n[17] Yu Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\n[18] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science\n\n& Business Media, 2013.\n\n[19] Julie Nutini, Mark Schmidt, Issam H Laradji, Michael Friedlander, and Hoyt Koepke. Coordinate\ndescent converges faster with the gauss-southwell rule than random selection. In Proceedings of the 32nd\nInternational Conference on Machine Learning (ICML-15), pages 1632\u20131641, 2015.\n\n[20] Shai Shalev-Shwartz and Yoram Singer. Ef\ufb01cient learning of label ranking by soft projections onto\n\npolyhedra. Journal of Machine Learning Research, 7(Jul):1567\u20131599, 2006.\n\n[21] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for l1-regularized loss minimization. Journal\n\nof Machine Learning Research, 12(Jun):1865\u20131892, 2011.\n\n[22] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. Journal of Machine Learning Research, 14(Feb):567\u2013599, 2013.\n\n[23] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regular-\n\nized loss minimization. In ICML, pages 64\u201372, 2014.\n\n[24] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable minimiza-\n\ntion. Mathematical Programming, 117(1):387\u2013423, 2009.\n\n[25] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction.\n\nSIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[26] Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk\nminimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 951,\npage 2015, 2015.\n\n[27] Shuai Zheng and James T Kwok. Fast-and-light stochastic admm.\n\nIn The 25th International Joint\n\nConference on Arti\ufb01cial Intelligence (IJCAI-16), New York City, NY, USA, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2513, "authors": [{"given_name": "Chaobing", "family_name": "Song", "institution": "Tsinghua University"}, {"given_name": "Shaobo", "family_name": "Cui", "institution": "Tsinghua University"}, {"given_name": "Yong", "family_name": "Jiang", "institution": "Tsinghua-Berkeley Shenzhen Institute"}, {"given_name": "Shu-Tao", "family_name": "Xia", "institution": "Tsinghua University"}]}