{"title": "Improved Iteration Complexity Bounds of Cyclic Block Coordinate Descent for Convex Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1306, "page_last": 1314, "abstract": "The iteration complexity of the block-coordinate descent (BCD) type algorithm has been under extensive investigation. It was recently shown that for convex problems the classical cyclic BCGD (block coordinate gradient descent) achieves an O(1/r) complexity (r is the number of passes of all blocks). However, such bounds are at least linearly depend on $K$ (the number of variable blocks), and are at least $K$ times worse than those of the gradient descent (GD) and proximal gradient (PG) methods.In this paper, we close such theoretical performance gap between cyclic BCD and GD/PG. First we show that for a family of quadratic nonsmooth problems, the complexity bounds for cyclic Block Coordinate Proximal Gradient (BCPG), a popular variant of BCD, can match those of the GD/PG in terms of dependency on $K$ (up to a \\log^2(K) factor). Second, we establish an improved complexity bound for Coordinate Gradient Descent (CGD) for general convex problems which can match that of GD in certain scenarios. Our bounds are sharper than the known bounds as they are always at least $K$ times worse than GD. {Our analyses do not depend on the update order of block variables inside each cycle, thus our results also apply to BCD methods with random permutation (random sampling without replacement, another popular variant).", "full_text": "Improved Iteration Complexity Bounds of Cyclic\nBlock Coordinate Descent for Convex Problems\n\nRuoyu Sun\u2217, Mingyi Hong\u2020\u2021\n\nAbstract\n\nThe iteration complexity of the block-coordinate descent (BCD) type algorithm\nhas been under extensive investigation. It was recently shown that for convex\nproblems the classical cyclic BCGD (block coordinate gradient descent) achieves\nan O(1/r) complexity (r is the number of passes of all blocks). However, such\nbounds are at least linearly depend on K (the number of variable blocks), and\nare at least K times worse than those of the gradient descent (GD) and proximal\ngradient (PG) methods. In this paper, we close such theoretical performance gap\nbetween cyclic BCD and GD/PG. First we show that for a family of quadratic\nnonsmooth problems, the complexity bounds for cyclic Block Coordinate Proxi-\nmal Gradient (BCPG), a popular variant of BCD, can match those of the GD/PG\n(K) factor). Second, we establish an\nin terms of dependency on K (up to a log\nimproved complexity bound for Coordinate Gradient Descent (CGD) for general\nconvex problems which can match that of GD in certain scenarios. Our bounds\nare sharper than the known bounds as they are always at least K times worse than\nGD. Our analyses do not depend on the update order of block variables inside\neach cycle, thus our results also apply to BCD methods with random permutation\n(random sampling without replacement, another popular variant).\n\n2\n\n1 Introduction\nConsider the following convex optimization problem\n\nK(cid:2)\n\nmin f (x) = g(x1,\u00b7\u00b7\u00b7 , xK ) +\n\nhk(xk),\n\ns.t.\n\nxk \u2208 Xk, \u2200 k = 1,\u00b7\u00b7\u00b7 K,\n\n(1)\n\nk=1\n\nwhere g : X \u2192 R is a convex smooth function; h : X \u2192 R is a convex lower semi-continuous\npossibly nonsmooth function; xk \u2208 Xk \u2286 RN is a block variable. A very popular method for\nsolving this problem is the so-called block coordinate descent (BCD) method [5], where each time\na single block variable is optimized while the rest of the variables remain \ufb01xed. Using the classical\ncyclic block selection rule, the BCD method can be described below.\n\nAlgorithm 1: The Cyclic Block Coordinate Descent (BCD)\n\nAt each iteration r + 1, update the variable blocks by:\n\n(cid:3)\n\n(cid:4)\n\n(r)\nx\nk\n\n\u2208 min\nxk\u2208Xk\n\ng\n\nxk, w\n\n(r)\u2212k\n\n+ hk(xk), k = 1,\u00b7\u00b7\u00b7 , K.\n\n(2)\n\nruoyu@stanford.edu\n\n\u2217Department of Management Science\nand Engineering, Stanford University, Stanford, CA.\n\u2020Department of Industrial & Manufacturing Systems Engineering and Department of Electrical & Computer\n\u2021The authors contribute equally to this work.\n\nEngineering, Iowa State University, Ames, IA, mingyi@iastate.edu\n\n1\n\n\fwhere we have used the following short-handed notations:\n\n(cid:6)\n\n(cid:5)\n(cid:5)\n\n(r)\n\n(cid:6)\nk+1 ,\u00b7\u00b7\u00b7 , x\n(r\u22121)\n(r\u22121)\n(r)\n(r)\nk\u22121, x\n:=\nw\nk\nk\nk+1 ,\u00b7\u00b7\u00b7 , x\n(r\u22121)\n(r\u22121)\n(r)\n(r)\u2212k :=\nk\u22121, x\nw\nK\nx\u2212k := [x1,\u00b7\u00b7\u00b7 , xk\u22121, xk+1,\u00b7\u00b7\u00b7 , xK] .\n\n1 ,\u00b7\u00b7\u00b7 , x\nx\n1 ,\u00b7\u00b7\u00b7 , x\nx\n\n, x\n\n(r)\n\n, k = 1,\u00b7\u00b7\u00b7 , K,\n\n(r\u22121)\nK\n, k = 1,\u00b7\u00b7\u00b7 , K,\n\nThe convergence analysis of the BCD has been extensively studied in the literature, see [5, 14,\n19, 15, 4, 7, 6, 10, 20]. For example it is known that for smooth problems (i.e. f is continuous\ndifferentiable but possibly nonconvex, h = 0), if each subproblem has a unique solution and g is\nnon-decreasing in the interval between the current iterate and the minimizer of the subproblem (one\nspecial case is per-block strict convexity), then every limit point of {x(r)} is a stationary point [5,\nProposition 2.7.1]. The authors of [6, 19] have derived relaxed conditions on the convergence of\nBCD. In particular, when problem (1) is convex and the level sets are compact, the convergence of\nthe BCD is guaranteed without requiring the subproblems to have unique solutions [6]. Recently\nRazaviyayn et al [15] have shown that the BCD converges if each subproblem (2) is solved inexactly,\nby way of optimizing certain surrogate functions.\nLuo and Tseng in [10] have shown that when problem (1) satis\ufb01es certain additional assumptions\nsuch as having a smooth composite objective and a polyhedral feasible set, then BCD converges lin-\nearly without requiring the objective to be strongly convex. There are many recent works on showing\niteration complexity for randomized BCGD (block coordinate gradient descent), see [17, 12, 8, 16, 9]\nand the references therein. However the results on the classical cyclic BCD is rather scant. Saha\nand Tewari [18] show that the cyclic BCD achieves sublinear convergence for a family of special\nLASSO problems. Nutini et al [13] show that when the problem is strongly convex, unconstrained\nand smooth, BCGD with certain Gauss-Southwell block selection rule could be faster than the ran-\ndomized rule. Recently Beck and Tetruashvili show that cyclic BCGD converges sublinearly if the\nobjective is smooth. Subsequently Hong et al in [7] show that such sublinear rate not only can\nbe extended to problems with nonsmooth objective, but is true for a large family of BCD-type al-\ngorithm (with or without per-block exact minimization, which includes BCGD as a special case).\nWhen each block is minimized exactly and when there is no per-block strong convexity, Beck [2]\nproves the sublinear convergence for certain 2-block convex problem (with only one block having\nLipschitzian gradient). It is worth mentioning that all the above results on cyclic BCD can be used\nto prove the complexity for a popular randomly permuted BCD in which the blocks are randomly\nsampled without replacement.\n(cid:7)\nTo illustrate the rates developed for the cyclic BCD algorithm, let us de\ufb01ne X\nsolution set for problem (1), and de\ufb01ne the constant\n(cid:6)x \u2212 x\n\n(3)\nLet us assume that hk(xk) \u2261 0, Xk = RN , \u2200 k for now, and assume that g(\u00b7) has Lipschitz\ncontinuous gradient:\n\n(cid:8)\n\u2217(cid:6) | f (x) \u2264 f (x(0))\n\n\u2217 to be the optimal\n\nR0 := max\nx\u2208X\n\nmax\nx\u2217\u2208X \u2217\n\n.\n\n(cid:6)\u2207g(x) \u2212 \u2207g(z)(cid:6) \u2264 L(cid:6)x \u2212 z(cid:6), \u2200 x, z \u2208 X.\n\n(4)\n\nAlso assume that g(\u00b7, x\u2212k) has Lipschitz continuous gradient with respect to each xk, i.e.,\n\n(cid:6)\u2207kg(xk, x\u2212k) \u2212 \u2207kg(vk, x\u2212k)(cid:6) \u2264 Lk(cid:6)xk \u2212 vk(cid:6), \u2200 x, v \u2208 X, \u2200 k.\n\n(5)\nLet Lmax := maxk Lk and Lmin := mink Lk. It is known that the cyclic BCPG has the following\niteration complexity [4, 7] 1\n\nBCD := f (x(r)) \u2212 f\n\u0394\n\n(r)\n\n\u2217 \u2264 CLmax(1 + KL2/L2\n\nmin)R2\n0\n\n\u2200 r \u2265 1,\n\n1\nr\n\n,\n\n(6)\n\nwhere C > 0 is some constant independent of problem dimension. Similar bounds are provided\nfor cyclic BCD in [7, Theorem 6.1]. In contrast, it is well known that when applying the classical\n\n1Note that the assumptions made in [4] and [7] are slightly different, but the rates derived in both cases have\n\nsimilar dependency on the problem dimension K.\n\n2\n\n\fgradient descent (GD) method to problem (1) with the constant stepsize 1/L, we have the following\nrate estimate [11, Corollary 2.1.2]\nGD := f (x(r)) \u2212 f (x\n\u2217\n\n) \u2264 2(cid:6)x(0) \u2212 x\n\n\u2200 r \u2265 1, \u2200 x\n\n\u2217 \u2208 X\n\n\u2217(cid:6)2L\n\n(7)\n\n(r)\n\n\u0394\n\n\u2217\n\n.\n\n,\n\n\u2264 2R2\n0L\nr + 4\n\nr + 4\n\nNote that unlike (6), here the constant in front of the 1/(r + 4) term is independent of the problem\ndimension. In fact, the ratio of the bound given in (6) and (7) is\n\nCLmax\n\nL\n\n(1 + KL2/L2\n\nmin)\n\nr + 4\n\nr\n\nwhich is at least in the order of K. For big data related problems with over millions of variables, a\nmultiplicative constant in the order of K can be a serious issue. In a recent work by Saha and Tewari\n[18], the authors show that for a LASSO problem with special data matrix, the rate of cyclic BCD\n(with special initialization) is indeed K-independent. Unfortunately, such a result has not yet been\nextended to any other convex problems. An open question posed by a few authors [4, 3, 18] are: is\nsuch a K factor gap intrinsic to the cyclic BCD or merely an artifact of the existing analysis?\n\n(cid:9)(cid:9)(cid:9)(cid:9)(cid:9) K(cid:2)\n\n1\n2\n\nK(cid:2)\n\n(cid:9)(cid:9)(cid:9)(cid:9)(cid:9)2\n\n2 Improved Bounds of Cyclic BCPG for Nonsmooth Quadratic Problem\n\nIn this section, we consider the following nonsmooth quadratic problem\n\nAkxk \u2212 b\n\n+\n\nk=1\n\nhk(xk),\n\nmin f (x) :=\n\n(8)\nwhere Ak \u2208 RM\u00d7N; b \u2208 RM; xk \u2208 RN is the kth block coordinate; hk(\u00b7) is the same as in\n(1). Note the blocks are assumed to have equal dimension for simplicity of presentation. De\ufb01ne\nA := [A1,\u00b7\u00b7\u00b7 , Ak] \u2208 RM\u00d7KN. For simplicity, we have assumed that all the blocks have the same\nsize. Problem (8) includes for example LASSO and group LASSO as special cases.\nWe consider the following cyclic BCPG algorithm.\n\nk=1\n\ns.t. xk \u2208 Xk, \u2200 k\n\nAlgorithm 2: The Cyclic Block Coordinate Proximal Gradient (BCPG)\n\nAt each iteration r + 1, update the variable blocks by:\n\n(r+1)\nx\nk\n\n= arg min\nxk\u2208Xk\n\n(r+1)\ng(w\nk\n\n) +\n\n\u2207kg\n\n(r+1)\nw\nk\n\n, xk \u2212 x\n\n(r)\nk\n\n+\n\nPk\n2\n\n(cid:11)\n\n(cid:9)(cid:9)(cid:9)xk \u2212 x\n\n(r)\nk\n\n(cid:9)(cid:9)(cid:9)2\n\n+ hk(xk)\n(9)\n\n(cid:10)\n\n(cid:3)\n\n(cid:12)\n\n(cid:4)\n\n(cid:13)\n\nHere Pk is the inverse of the stepsize for xk, which satis\ufb01es\n\nPk \u2265 \u03bbmax\n\n= Lk, \u2200 k.\n\nAT\n\nk Ak\n\n(10)\nDe\ufb01ne Pmax := maxk Pk and Pmin = mink Pk. Note that for the least square problem (smooth\nquadratic minimization, i.e. hk \u2261 0,\u2200 k), BCPG reduces to the widely used BCGD method.\n(cid:10)\nThe optimality condition for the kth subproblem is given by\n\u2207kg(w\nIn what follows we show that the cyclic BCPG for problem (8) achieves a complexity bound that\nonly dependents on log2(N K), and apart from such log factor it is at least K times better than those\nknown in the literature. Our analysis consists of the following three main steps:\n\n) \u2265 0, \u2200 xk \u2208 Xk.\n\n+ hk(xk) \u2212 hk(x\n\nk ), xk \u2212 x\n\n(r+1)\n) + Pk(x\nk\n\n(r+1)\nk\n\n(r+1)\nk\n\n(r+1)\nk\n\n\u2212 x\n\n(cid:11)\n\n(11)\n\n(r)\n\n1. Estimate the descent of the objective after each BCPG iteration;\n2. Estimate the cost yet to be minimized (cost-to-go) after each BCPG iteration;\n3. Combine the above two estimates to obtain the \ufb01nal bound.\n\nFirst we show that the BCPG achieves the suf\ufb01cient descent.\n\n3\n\n\fLemma 2.1. We have the following estimate of the descent when using the BCPG:\n\nf (x(r)) \u2212 f (x(r+1)) \u2265 K(cid:2)\n\nProof. We have the following series of inequalities\n\nk=1\n\n(cid:6)x\n\n(r+1)\nk\n\n\u2212 x\n\n(r)\nk\n\n(cid:6)2.\n\nPk\n2\n\n(12)\n\nk=1\n\ng(w\n\n=\n\nk=1\n\n(r+1)\n\n)\n\n(r)\n\nf (x\n\n(r+1)\nk\n\n) +\n\n(r+1)\nk\n\n(r+1)\nk+1 )\n\n) + hk(x\n\n(r+1)\nf (w\nk\n\n(r+1)\nf (w\nk\n\n) \u2212 f (w\n(cid:3)\n)\u2212\n\n(cid:4)\n\u2207kg(w\n(cid:9)\n\n) \u2212 f (x\nK(cid:2)\n\u2265 K(cid:2)\nK(cid:2)\n\u2265 K(cid:2)\nTo proceed, let us introduce two matrices (cid:14)P and (cid:14)A given below, which have dimension K \u00d7 K and\n\nwhere the second inequality uses the optimality condition (11).\n\n(cid:5)\n(cid:6)(cid:6)(cid:6)x\n\nk ) \u2212 hk(x\n\n\u2212 x\n(cid:7)\n\n(cid:6)(cid:6)(cid:6)x\n\nQ.E.D.\n\n(cid:3)(cid:4)\n\n(cid:6)(cid:6)(cid:6)2\n\n(cid:6)(cid:6)(cid:6)2\n\nk (cid:4)2\n\n\u2207kg\n\n(r+1)\nk\n\n(r+1)\nk\n\n\u2212 x\n\n, x\n\n(r+1)\nk\n\n), x\n\n(r+1)\nk\n\n\u2212 x\n\n(r)\nk\n\n\u2212 x\n\n(r)\nk\n\n\u2212 x\n\n(r)\nk\n\n(r+1)\nk\n\n(cid:4)x\n\n(r+1)\nk\n\n(r+1)\nk\n\n) \u2212\n\n(r)\n\nhk(x\n\n+\n\nPk\n2\n\n+\n\nPk\n2\n\n(r+1)\nk\n\n(r)\n\n.\n\n(cid:8)\n\nw\n\n=\n\nk=1\n\nPk\n2\n\n(cid:7)\n\n(cid:5)\n\n(r)\nk\n\nk=1\n\nM K \u00d7 N K, respectively\n\n(cid:10)P :=\n\n\u23a1\n\u23a2\u23a2\u23a3 P1\n...\n\n0\n\n0\n\n0\nP2\n\n...\n\n0\n\n0\n0\n\n...\n\n0\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n0\n0\n\n...\n\n0\n0\n\n...\n\n0 PK\n\n(cid:14)P \u2297 IN (cid:12) (cid:14)A\n\n0\n\n0\nA2\n\n\u23a4\n\u23a1\n\u23a5\u23a5\u23a6 , (cid:10)A :=\n\u23a2\u23a2\u23a3 A1\n...\nT (cid:14)A (cid:12) AT A\nT (cid:14)A, K(cid:14)A\n\n...\n\n0\n\n0\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\n0\n0\n\n...\n\n0\n\n0\n0\n\n...\n\n0\n0\n\n...\n\n0 AK\n\nBy utilizing the de\ufb01nition of Pk in (10) we have the following inequalities (the second inequality\ncomes from [12, Lemma 1])\n\nwhere IN is the N \u00d7 N identity matrix and the notation \u201c\u2297\u201d denotes the Kronecker product.\nNext let us estimate the cost-to-go.\nLemma 2.2. We have the following estimate of the optimality gap when using the BCPG:\n\n(cid:3)\n\u0394(r+1) : = f (x(r+1)) \u2212 f (x\n\u2217\n\n(cid:15)\n\n)\n\n\u2264 R0log(2N K)\n\nL/\n\nPmin +\n\nPmax\n\n(cid:4)(cid:9)(cid:9)(cid:9)(cid:9)(x(r+1) \u2212 x(r))((cid:14)P\n\n1/2 \u2297 IN )\n\n(13)\n\n(14)\n\nOur third step combines the previous two steps and characterizes the iteration complexity. This is\nthe main result of this section.\nTheorem 2.1. The iteration complexity of using BCPG to solve (8) is given below.\n\n(cid:17) R2\n\n\u0394(r+1) \u2264 3 max\n\n2. When the stepsizes are chosen as Pk = \u03bbmax(AT\n\n1. When the stepsizes are chosen conservatively as Pk = L, \u2200 k, we have\n(cid:20)(cid:21)\n(cid:20)(cid:21)\n\n(cid:19)\nk Ak) = Lk, \u2200 k. Then we have\n(cid:19)\n\n(16)\nIn particular, if the problem is smooth and unconstrained, i.e., when h \u2261 0, and Xk =\nRN ,\u2200 k, then we have\n\n\u0394(r+1) \u2264 3 max\n\n\u03940, 4 log2(2N K)L\n\n\u03940, 2 log2(2N K)\n\n(cid:18)\n(cid:18)\n\nR2\n0\nr + 1\n\nL2\nLmin\n\nLmax +\n\n(15)\n\nr + 1\n\n0\n\n\u0394(r+1) \u2264 3 max\n\nL2\nLmin\n\nR2\n0\nr + 1\n\n.\n\n(17)\n\n\u23a4\n\u23a5\u23a5\u23a6 .\n\n(cid:9)(cid:9)(cid:9)(cid:9)\n\n(cid:15)\n\n(cid:16)\n\nL, 2 log2(2N K)\n\nLmax +\n\n4\n\n\f2\n\nWe comment on the bounds derived in the above theorem. The bound for BCPG with uniform\n\u201cconservative\u201d stepsize 1/L has the same order as the GD method, except for the log\n(2N K) factor\n(cf. (7)). In [4, Corollary 3.2], it is shown that the BCGD with the same \u201cconservative\u201d stepsize\n0, which is about K/(3 log2(2N K)) times\nachieves a sublinear rate with a constant of 4L(1 + K)R2\nworse than our bound. Further, our bound has the same dependency on L (i.e., 12L v.s. L/2) as\nthe one derived in [18] for BCPG with a \u201cconservative\u201d stepsize to solve an (cid:3)1 penalized quadratic\nproblem with special data matrix, but our bound holds true for a much larger class of problems (i.e.,\nall quadratic nonsmooth problem in the form of (8)). However, in practice such conservative stepsize\nis slow (compared with BCPG with Pk = Lk, for all k) hence is rarely used.\nThe rest of the bounds derived in Theorem 2.1 is again at least K/ log2(2N K) times better than\nexisting bounds of cyclic BCPG. For example, when the problem is smooth and unconstrained, the\nratio between our bound (17) and the bound (6) is given by\n6R2\n\n\u2264 6 log2(2N K)(1 + L2/(LminLmax))\n\n2\n\n0 log2(2N K)(L2/Lmin + Lmax)\nCLmax(1 + KL2/L2\n\nmin)R2\n0\n\nC(1 + KL2/L2\n\nmin)\n\n= O(log\n\n(2N K)/K)\n(18)\n\nwhere in the last inequality we have used the fact that Lmax/Lmin \u2265 1.\nFor unconstrained smooth problems, let us compare the bound derived in the second part of The-\norem 2.1 (stepsize Pk = Lk,\u2200k) with that of the GD (7). If L = KLk for all k (problem badly\nconditioned), our bound is about K log2(2N K) times worse than that of the GD. This indicates a\ncounter-intuitive phenomenon: by choosing conservative stepsize Pk = L,\u2200k the iteration complex-\nity of BCGD is K times better compared with choosing a more aggressive stepzise Pk = Lk,\u2200k. It\nalso indicates that the factor L/Lmin may hide an additional factor of K.\n\n3 Iteration Complexity for General Convex Problems\n\nIn this section, we consider improved iteration complexity bounds of BCD for general unconstrained\nsmooth convex problems. We prove a general iteration complexity result, which includes a result of\nBeck et al. [4] as a special case. Our analysis for the general case also applies to smooth quadratic\nproblems, but is very different from the analysis in previous sections for quadratic problems. For\nsimplicity, we only consider the case N = 1 (scalar blocks); the generalization to the case N > 1 is\nleft as future work.\n(x).\nLet us assume that the smooth objective g has second order derivatives Hij(x) := \u2202 2g\nWhen each block is just a coordinate, we assume |Hij(x)| \u2264 Lij,\u2200i, j. Then Li = Lii and\n\u2202xi\u2202xj\nLij \u2264 \u221a\nLj. For unconstrained smooth convex problems with scalar block variables, the BCPG\niteration reduces to the following coordinate gradient descent (CGD) iteration:\n\n(cid:15)\n\nLi\n\n(r)\nx(r) = w\n1\n\nd1\u2212\u2192 w\n\n(r)\n2\n\nd2\u2212\u2192 w\n\n(r)\n\n3 \u2212\u2192 . . . dK\u2212\u2192 w\n\n(r)\nK+1 = x(r+1),\n\n(19)\n\n(r)\nk\n\n(r)\nk\n\ndk\u2212\u2192 w\n\n(r)\nk ) and w\n\n(r)\nk+1 means that w\n\n(r)\nk+1 is a linear combination of w\n\nwhere dk = \u2207kg(w\ndkek (ek is the k-th block unit vector).\nIn the following theorem, we provide an iteration complexity bound for the general convex problem.\nThe proof framework follows the standard three-step approach that combines suf\ufb01cient descent and\ncost-to-go estimate; nevertheless, the analysis of the suf\ufb01cient descent is very different from the\nmethods used in the previous sections. The intuition is that CGD can be viewed as an inexact\ngradient descent method, thus the amount of descent can be bounded in terms of the norm of the full\ngradient. It would be dif\ufb01cult to further tighten this bound if the goal is to obtain a suf\ufb01cient descent\nbased on the norm of the full gradient. Having established the suf\ufb01cient descent in terms of the\nfull gradient \u2207g(x(r)), we can easily prove the iteration complexity result, following the standard\nanalysis of GD (see, e.g. [11, Theorem 2.1.13]).\nTheorem 3.1. For CGD with Pk \u2265 Lmax,\u2200k, we have\n\nand\n\n(cid:19)\n\ng(x(r)) \u2212 g(x\n\u2217\n\n) \u2264 2\n\nPmax +\n\n, \u2200 r \u2265 1.\n\nR2\n0\nr\n\n(20)\n\n(cid:20)\n\nk Lk)2}\n\n(cid:22)\nmin{KL2, (\nPmin\n\n5\n\n\fk only differ by the k-th block, and \u2207kg is Lipschitz continuous with\n\nk+1 and wr\n\nProof. Since wr\nLipschitz constant Lk, we have 2\nk+1) \u2264g(wr\n\ng(wr\n\n(cid:15) +\n\n(cid:6)wr\n\nk+1 \u2212 wr\n\nk\n\n(cid:6)2\n\nLk\n2\n\nk\n\nk+1 \u2212 wr\nk)(cid:6)2\n\nk), wr\n(cid:6)\u2207kg(wr\nk)(cid:6)2,\n\n2P 2\nk\n(cid:6)\u2207kg(wr\n\n=g(wr\n\u2264g(wr\n\nk) + (cid:14)\u2207kg(wr\nk) \u2212 2Pk \u2212 Lk\nk) \u2212 1\n2Pk\nr(cid:2)\n(cid:5)\n\n[g(wr\n\nk=1\n\nk+1)] \u2265 r(cid:2)\n\nk=1\n\nk = xr \u2212\nwr\n\n1\nP1\n\nd1, . . . ,\n\n1\n\nPk\u22121\n\ndk\u22121, 0, . . . , 0\n\n,\n\n(cid:6)\u2207kg(wr\n\nk)(cid:6)2.\n\n1\n2Pk\n\n(cid:6)T\n\nwhere the last inequality is due to Pk \u2265 Lk.\nThe amount of decrease can be estimated as\n\ng(xr) \u2212 g(xr+1) =\n\nk) \u2212 g(wr\n\nSince\n\nby the mean-value theorem, there must exist \u03bek such that\n(cid:18)(cid:19)\nk) = \u2207(\u2207kg)(\u03bek) \u00b7 (xr \u2212 wr\nk)\n\n(cid:17)\n\u2207kg(xr) \u2212 \u2207kg(wr\n(cid:17)\n\n\u22022g\n\n=\n\n(\u03bek), . . . ,\n\n\u2202xk\u2202x1\n1\u221a\nP1\n\n\u22022g\n\n\u2202xk\u2202xk\u22121\n1\u221a\n\nPk\u22121\n\n=\n\nHk1(\u03bek), . . . ,\n\nHk,k\u22121(\u03bek), 0, . . . , 0\n\n(\u03bek), 0, . . . , 0\n\n(cid:20)T\n\nd1, . . . ,\n\n1\n\nPk\u22121\n\n1\nP1\n\n(cid:18)(cid:19)\n\n1\u221a\n\nP1\n\nd1, . . . ,\n\ndk\u22121, 0, . . . , 0\n1\u221a\n\n(cid:20)T\n\ndK\n\n,\n\nPK\n\nwhere Hij (x) = \u2202 2g\n\u2202xi\u2202xj\n\n(x) is the second order derivative of g. Then\n\n\u2207kg(xr) = \u2207kg(xr) \u2212 \u2207kg(wr\n\n(cid:17)\n(cid:17)\n\n=\n\n=\n\n1\u221a\nP1\n1\u221a\nP1\n\nk) + \u2207kg(wr\nk)\n1\u221a\n\nPk\u22121\n\nHk1(\u03bek), . . . ,\n\nHk,k\u22121(\u03bek), 0, . . . , 0\n\nHk1(\u03bek), . . . ,\n\n1\u221a\n\nPk\u22121\n\nHk,k\u22121(\u03bek),\n\n(cid:18)(cid:19)\n\nd1, . . . ,\n\n1\u221a\nP1\n\n(cid:18)(cid:19)\n\n1\u221a\n\nPK\n\n1\u221a\nP1\n\nd1, . . . ,\n\n\u221a\n\nPk, 0, . . . , 0\n\n(cid:20)T\ndK\n1\u221a\n\nPK\n\n+ dk\n\n(cid:20)T\n\ndK\n\n= vT\n\nk d,\n\nwhere we have de\ufb01ned\n\nd :=\n\nvk :=\n\n(cid:5)\n(cid:23)\n\n(cid:6)T\n1(cid:15)\n\n,\n\nPk\u22121\n\n1\u221a\nP1\n\nd1, . . . ,\n\n1\u221a\n\nPK\n\ndK\n\n1\u221a\nP1\n\nHk1(\u03bek), . . . ,\n\n(cid:15)\n\n(cid:24)\n\nPk, . . . , 0\n\n.\n\nHk,k\u22121(\u03bek),\n\nLet\n\nV :=\n\n\u23a4\n\u23a6 =\n\n\u23a1\n\n\u23a2\u23a2\u23a2\u23a2\u23a2\u23a2\u23a2\u23a2\u23a2\u23a2\u23a3\n\n\u23a1\n\u23a3vT\n\n1\n. . .\nvT\n\nK\n\n\u221a\n\nP1\n\nH21(\u03be2)\nH31(\u03be3)\n\nH41(\u03be4)\n\n...\n\n1\u221a\nP1\n1\u221a\nP1\n1\u221a\nP1\n\n1\u221a\nP1\n\n\u221a\n0\nP2\n\nH32(\u03be3)\n\nH42(\u03be4)\n\n...\n\n1\u221a\nP2\n1\u221a\nP2\n\n1\u221a\n\nP2\n\n0\n0\n0\n\n0\n\u221a\n0\nP3\n\n1\u221a\n\nP3\n\n1\u221a\n\nP3\n\nH43(\u03be4)\n\n...\n\nHK3(\u03beK)\n\n. . .\n. . .\n. . .\n...\n...\n\n. . .\n\n1\u221a\n\nPK\u22121\n\nHK1(\u03beK)\n\nHK2(\u03beK)\n\n2 A stronger bound is g(wr\n\nk+1) \u2264 g(wr\n\nk) \u2212 1\n\n(cid:4)\u2207kg(wr\n\nk)(cid:4)2, where \u02c6Pk =\n\nPk \u2264 2Pk \u2212 Lk \u2264 2Pk, the improvement ratio of using this stronger bound is no more than a factor of 2.\n\n2Pk\n\nP 2\nk\n\n2Pk\u2212Lk\n\n0\n...\nHK,K\u22121(\u03beK)\n\n\u221a\n\nPK\n(26)\n\u2264 Pk, but since\n\n(21)\n\n(22)\n\n(23)\n\n(24)\n\n(25)\n\n\u23a4\n\n\u23a5\u23a5\u23a5\u23a5\u23a5\u23a5\u23a5\u23a5\u23a5\u23a5\u23a6\n\n0\n0\n0\n\n0\n\n...\n\n6\n\n\f(cid:2)\n\nTherefore, we have\n(cid:6)\u2207g(xr)(cid:6)2 =\n\n(cid:6)\u2207kg(xr)(cid:6)2 (24)\n=\n\nk\n\nCombining with (22), we get\n\ng(xr) \u2212 g(xr+1) \u2265\n\n(cid:2)\n(cid:6)vT\nk d(cid:6)2 = (cid:6)V d(cid:6)2 \u2264 (cid:6)V (cid:6)2(cid:6)d(cid:6)2 = (cid:6)V (cid:6)2\n(cid:2)\n\nk\n\n(cid:6)\u2207kg(wr\n\nk)(cid:6)2 \u2265 1\n\n2(cid:6)V (cid:6)2\n\n(cid:6)\u2207g(xr)(cid:6)2.\n\n1\n2Pk\n\nk\n\n(cid:2)\n\nk\n\n(cid:6)\u2207kg(wr\n\nk)(cid:6)2.\n\n1\nPk\n\n\u23a1\n\u23a2\u23a2\u23a2\u23a2\u23a3\n\nLet D (cid:2) Diag(P1, . . . , PK) and let H(\u03be) be de\ufb01ned as\n\nH(\u03be) :=\n\n0\n\nH21(\u03be2)\nH31(\u03be3)\n\n...\n\n0\n0\n\nH32(\u03be3)\n\n...\n\n0\n0\n0\n\n...\n\n. . .\n. . .\n. . .\n\nThen V = D1/2 + H(\u03be)D\n(cid:6)V (cid:6)2 = (cid:6)D1/2 + H(\u03be)D\n\nHK1(\u03beK) HK2(\u03beK) HK3(\u03beK )\n\u22121/2, which implies\n\u22121/2(cid:6)2 \u2264 2((cid:6)D1/2(cid:6)2 + (cid:6)H(\u03be)D\n\n\u23a4\n\u23a5\u23a5\u23a5\u23a5\u23a6 .\n\n0\n0\n0\n\n...\n\n0\n\n...\n. . . HK,K\u22121(\u03beK)\n\n0\n0\n0\n...\n\n(cid:19)\n\n\u22121/2(cid:6)2) \u2264 2\n\nPmax +\n\nPlugging into (27), we obtain\n\n(27)\n\n(28)\n\n(cid:20)\n\n.\n\n(29)\n\n(cid:6)H(\u03be)(cid:6)2\nPmin\n\nPmax +\n\n1\n(cid:5)H(\u03be)(cid:5)2\nPmin\n\ng(x(r)) \u2212 g(x(r+1)) \u2265 1\n2\n\n(cid:6)\u2207g(x(r))(cid:6)2.\nFrom the fact that Hkj(\u03bek) is a scalar bounded above by |Hkj (\u03bek)| \u2264 Lkj \u2264(cid:15)\n(cid:2)\n(cid:2)\nLkLj \u2264 (\n(cid:2)\n\n(30)\nWe provide the second bound of (cid:6)H(cid:6) below. Let Hk denote the k-th row of H, then (cid:6)Hk(cid:6) \u2264 L.\nTherefore, we have\n\n|Hkj(\u03bek)|2 \u2264\n(cid:2)\n\n(cid:6)H(cid:6)2 \u2264 (cid:6)H(cid:6)2\n\nLkLj, thus\n\n(cid:2)\n\nLk)2.\n\nF =\n\nk L2 = \u00b7\u00b7\u00b7 = LK = 0 the new bound is K times smaller. In fact, when L = KLk,\u2200k,\nour new bound is K times better than the bound in [4] for either Pk = Lk or Pk = L. For example,\nwhen Pk = L,\u2200k, the bound in [4] becomes O( KL\nr ), which matches GD\n(listed in Table 1 below). Another advantage of the new bound (\nk Lk)2 is that it does not increase\nif we add an arti\ufb01cial block xK+1 and perform CGD for function \u02dcg(x, xk+1) = g(x); in contrast,\nthe existing bound KL2 will increase to (K + 1)L2, even though the algorithm does not change at\nall.\nWe have demonstrated that our bound can match GD in some cases, but can possibly be K times\nworse than GD. An interesting question is: for general convex problems can we obtain an O( L\nr )\nbound for cyclic BCGD, matching the bound of GD? Removing the K-factor in (32) will lead to an\nO( L\nr ) bound for conservative stepsize Pk = L no matter how large Lk and L are. We conjecture that\nan O( L\nr ) bound for cyclic BCGD cannot be achieved for general convex problems. That being said,\nwe point out that the iteration complexity of cyclic BCGD may depend on other intrinsic parameters\nof the problem such as {Lk}k and, possibly, third order derivatives of g. Thus the question of \ufb01nding\nthe best iteration complexity bound of the form O(h(K) L\nr ), where h(K) is a function of K, may\nnot be the right question to ask for BCD type algorithms.\n4 Conclusion\nIn this paper, we provide new analysis and improved complexity bounds for cyclic BCD-type meth-\nods. For convex quadratic problems, we show that the bounds are O( L\nr ), which is independent of\nK (except for a mild log2(2K) factor) and is about Lmax/L + L/Lmin times worse than those\nfor GD/PG. By a simple example we show that it is not possible to obtain an iteration complexity\nO(L/(Kr)) for cyclic BCPG. For illustration, the main results of this paper in several simple set-\ntings are summarized in the table below. Note that different ratios of L over Lk can lead to quite\ndifferent comparison.\n\nTable 1: Comparison of Various Iteration Complexity Results\n\nLip-constant\n1/Stepsize\n\nDiagonal Hessian Li = L\n\nPi = L\n\nFull Hessian Li = L\nLarge stepsize Pi = L\n\nK\n\nK\n\nGD\n\nRandom BCGD\nCyclic BCGD [4]\n\nCyclic CGD, Cor 3.1\nCyclic BCGD (QP)\n\nL/r\nL/r\nKL/r\nKL/r\n\nlog2(2K)L/r\n\nN/A\n\nL/(Kr)\nK 2L/r\nKL/r\n\nlog2(2K)KL/r\n\n8\n\nFull Hessian Li = L\nSmall stepsize Pi = L\n\nK\n\nL/r\nL/r\nKL/r\nL/r\n\nlog2(2K)L/r\n\n\fReferences\n[1] J. R. Angelos, C. C. Cowen, and S. K. Narayan. Triangular truncation and \ufb01nding the norm of\n\na hadamard multiplier. Linear Algebra and its Applications, 170:117 \u2013 135, 1992.\n\n[2] A. Beck. On the convergence of alternating minimization with applications to iterative-\nly reweighted least squares and decomposition schemes. SIAM Journal on Optimization,\n25(1):185\u2013209, 2015.\n\n[3] A. Beck, E. Pauwels, and S. Sabach. The cyclic block coordinate gradient method for convex\n\noptimization problems. 2015. Preprint, available on arXiv:1502.03716v1.\n\n[4] A. Beck and L. Tetruashvili. On the convergence of block coordinate descent type methods.\n\nSIAM Journal on Optimization, 23(4):2037\u20132060, 2013.\n\n[5] D. P. Bertsekas. Nonlinear Programming, 2nd ed. Athena Scienti\ufb01c, Belmont, MA, 1999.\n[6] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear Gauss-Seidel method\n\nunder convex constraints. Operations Research Letters, 26:127\u2013136, 2000.\n\n[7] M. Hong, X. Wang, M. Razaviyayn, and Z.-Q. Luo. Iteration complexity analysis of block\n\ncoordinate descent methods. 2013. Preprint, available online arXiv:1310.6957.\n\n[8] Z. Lu and L. Xiao. On the complexity analysis of randomized block-coordinate descent meth-\n\nods. 2013. accepted by Mathematical Programming.\n\n[9] Z. Lu and L. Xiao. Randomized block coordinate non-monotone gradient method for a class\n\nof nonlinear programming. 2013. Preprint.\n\n[10] Z.-Q. Luo and P. Tseng. On the convergence of the coordinate descent method for convex\ndifferentiable minimization. Journal of Optimization Theory and Application, 72(1):7\u201335,\n1992.\n\n[11] Y. Nesterov. Introductory lectures on convex optimization: A basic course. Springer, 2004.\n[12] Y. Nesterov. Ef\ufb01ciency of coordiate descent methods on huge-scale optimization problems.\n\nSIAM Journal on Optimization, 22(2):341\u2013362, 2012.\n\n[13] J. Nutini, M. Schmidt, I. H. Laradji, M. Friedlander, and H. Koepke. Coordinate descent\nconverges faster with the Gauss-Southwell rule than random selection. In the Proceeding of\nthe 30th International Conference on Machine Learning (ICML), 2015.\n\n[14] M. J. D. Powell. On search directions for minimization algorithms. Mathematical Program-\n\nming, 4:193\u2013201, 1973.\n\n[15] M. Razaviyayn, M. Hong, and Z.-Q. Luo. A uni\ufb01ed convergence analysis of block suc-\ncessive minimization methods for nonsmooth optimization. SIAM Journal on Optimization,\n23(2):1126\u20131153, 2013.\n\n[16] M. Razaviyayn, M. Hong, Z.-Q. Luo, and J. S. Pang. Parallel successive convex approxima-\ntion for nonsmooth nonconvex optimization. In the Proceedings of the Neural Information\nProcessing (NIPS), 2014.\n\n[17] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c. Iteration complexity of randomized block-coordinate descent meth-\n\nods for minimizing a composite function. Mathematical Programming, 144:1\u201338, 2014.\n\n[18] A. Saha and A. Tewari. On the nonasymptotic convergence of cyclic coordinate descent\n\nmethod. SIAM Journal on Optimization, 23(1):576\u2013601, 2013.\n\n[19] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimiza-\n\ntion. Journal of Optimization Theory and Applications, 103(9):475\u2013494, 2001.\n\n[20] Y. Xu and W. Yin. A block coordinate descent method for regularized multiconvex optimiza-\ntion with applications to nonnegative tensor factorization and completion. SIAM Journal on\nImaging Sciences, 6(3):1758\u20131789, 2013.\n\n9\n\n\f", "award": [], "sourceid": 814, "authors": [{"given_name": "Ruoyu", "family_name": "Sun", "institution": "Stanford university"}, {"given_name": "Mingyi", "family_name": "Hong", "institution": null}]}