{"title": "Accelerated Mini-batch Randomized Block Coordinate Descent Method", "book": "Advances in Neural Information Processing Systems", "page_first": 3329, "page_last": 3337, "abstract": "We consider regularized empirical risk minimization problems. In particular, we minimize the sum of a smooth empirical risk function and a nonsmooth regularization function. When the regularization function is block separable, we can solve the minimization problems in a randomized block coordinate descent (RBCD) manner. Existing RBCD methods usually decrease the objective value by exploiting the partial gradient of a randomly selected block of coordinates in each iteration. Thus they need all data to be accessible so that the partial gradient of the block gradient can be exactly obtained. However, such a ``batch setting may be computationally expensive in practice. In this paper, we propose a mini-batch randomized block coordinate descent (MRBCD) method, which estimates the partial gradient of the selected block based on a mini-batch of randomly sampled data in each iteration. We further accelerate the MRBCD method by exploiting the semi-stochastic optimization scheme, which effectively reduces the variance of the partial gradient estimators. Theoretically, we show that for strongly convex functions, the MRBCD method attains lower overall iteration complexity than existing RBCD methods. As an application, we further trim the MRBCD method to solve the regularized sparse learning problems. Our numerical experiments shows that the MRBCD method naturally exploits the sparsity structure and achieves better computational performance than existing methods.\"", "full_text": "Accelerated Mini-batch Randomized Block\n\nCoordinate Descent Method\n\nTuo Zhao\u2020\u00a7\u21e4 Mo Yu\u2021\u21e4 Yiming Wang\u2020 Raman Arora\u2020 Han Liu\u00a7\n\n\u2020Johns Hopkins University \u2021Harbin Institute of Technology \u00a7Princeton University\n\n{tzhao5,myu25,freewym}@jhu.edu,arora@cs.jhu.edu,hanliu@princeton.edu\n\nAbstract\n\nWe consider regularized empirical risk minimization problems. In particular, we\nminimize the sum of a smooth empirical risk function and a nonsmooth regulariza-\ntion function. When the regularization function is block separable, we can solve\nthe minimization problems in a randomized block coordinate descent (RBCD)\nmanner. Existing RBCD methods usually decrease the objective value by ex-\nploiting the partial gradient of a randomly selected block of coordinates in each\niteration. Thus they need all data to be accessible so that the partial gradient of the\nblock gradient can be exactly obtained. However, such a \u201cbatch\u201d setting may be\ncomputationally expensive in practice. In this paper, we propose a mini-batch ran-\ndomized block coordinate descent (MRBCD) method, which estimates the partial\ngradient of the selected block based on a mini-batch of randomly sampled data\nin each iteration. We further accelerate the MRBCD method by exploiting the\nsemi-stochastic optimization scheme, which effectively reduces the variance of\nthe partial gradient estimators. Theoretically, we show that for strongly convex\nfunctions, the MRBCD method attains lower overall iteration complexity than ex-\nisting RBCD methods. As an application, we further trim the MRBCD method to\nsolve the regularized sparse learning problems. Our numerical experiments shows\nthat the MRBCD method naturally exploits the sparsity structure and achieves\nbetter computational performance than existing methods.\n\nIntroduction\n\n1\nBig data analysis challenges both statistics and computation. In the past decade, researchers have\ndeveloped a large family of sparse regularized M-estimators, such as Sparse Linear Regression [16,\n23], Group Sparse Linear Regression [21], Sparse Logistic Regression [9], Sparse Support Vector\nMachine [22, 18], and etc. These estimators are usually formulated as regularized empirical risk\nminimization problems in a generic form as follows [10],\nP(\u2713) = argmin\n\nF(\u2713) + R(\u2713),\n\n(1.1)\n\n\u2713\n\nwhere \u2713 is the parameter of the working model. Here we assume the empirical risk function F(\u2713)\nis smooth, and the regularization function R(\u2713) is non-differentiable. Some \ufb01rst order algorithms,\nmostly variants of proximal gradient methods [11], have been proposed for solving (1.1) . For\nstrongly convex P(\u2713), these methods achieve linear rates of convergence [1].\nThe proximal gradient methods, though simple, are not necessarily ef\ufb01cient for large problems. Note\nthat empirical risk function F(\u2713) is usually composed of many smooth component functions:\n\nb\u2713 = argmin\n\n\u2713\n\n\u21e4Both authors contributed equally.\n\nF(\u2713) =\n\n1\nn\n\nnXi=1\n\nnXi=1\n\nrfi(\u2713),\n\nfi(\u2713)\n\nand rF(\u2713) =\n\n1\nn\n\n1\n\n\fwhere each fi is associated with a few samples of the whole date set. Since the proximal gradient\nmethods need to calculate the gradient of F in every iteration, the computational complexity scales\nlinearly with the sample size (or the number of components functions). Thus the overall computation\ncan be expensive especially when the sample size is very large in such a \u201cbatch\u201d setting [15].\nTo overcome the above drawback, recent work has focused on stochastic proximal gradient methods\nIn particular, the\n(SPG), which exploit the additive nature of the empirical risk function F(\u2713).\nSPG methods randomly sample only a few fi\u2019s to estimate the gradient rF(\u2713), i.e., given an index\nset B, also as known as a mini-batch [15], where all elements are independently sampled from\n|B|Pi2B rfi(\u2713). Thus calculating\n{1, ..., n} with replacement, we consider a gradient estimator 1\nsuch a \u201cstochastic\u201d gradient can be far less expensive than the proximal gradient methods within\neach iteration. Existing literature has established the global convergence results for the stochastic\nproximal gradient methods [3, 7] based on the unbiasedness of the gradient estimator, i.e.,\n\nEB\" 1\n\n|B|Xi2B\n\nrfi(\u2713)# = rF(\u2713)\n\nfor 8 \u2713 2 Rd.\n\nHowever, owing to the variance of the gradient estimator introduced by the stochastic sampling, SPG\nmethods only achieve sublinear rates of convergence even when P(\u2713) is strongly convex [3, 7].\nA second line of research has focused randomized block coordinate descent (RBCD) methods.\nThese methods exploit the block separability of the regularization function R, i.e., given a parti-\ntion {G1, ...,Gk} of d coordinates, we use vGj to denote the subvector of v with all indices in Gj,\nand then we can write\n\nR(\u2713) =\n\nrj(\u2713Gj ) with \u2713 = (\u2713T\n\nG1, ..., \u2713T\n\nGk )T .\n\nkXj=1\n\nAccordingly, they develop the randomized block coordinate descent (RBCD) methods. In particular,\nthe block coordinate descent methods randomly select a block of coordinates in each iteration, and\nthen only calculate the gradient of F with respect to the selected block [14, 12]. Since the variance\nintroduced by the block selection asymptotically goes to zero, the RBCD methods also attain lin-\near rates of convergence when P(\u2713) is strongly convex. For sparse learning problems, the RBCD\nmethods have a natural advantage over the proximal gradient methods. Because many blocks of\ncoordinates stay at zero values throughout most of iterations, we can integrate the active set strategy\ninto the computation. The active set strategy maintains an only iterates over a small subset of all\nblocks [2], which greatly boosts the computational performance. Recent work has corroborated the\nempirical advantage of RBCD methods over the proximal gradient method [4, 19, 8]. The RBCD\nmethods, however, still requires that all component functions are accessible within every iteration\nso that the partial gradient can be exactly obtained.\nTo address this issue, we propose a stochastic variant of the RBCD methods, which shares the ad-\nvantage with both the SPG and RBCD methods. More speci\ufb01cally, we randomly select a block of\ncoordinates in each iteration, and estimate the corresponding partial gradient based on a mini-batch\nof fi\u2019s sampled from all component functions. To address the variance introduced by stochastic sam-\npling, we exploit the semi-stochastic optimization scheme proposed in [5, 6]. The semi-stochastic\noptimization scheme contains two nested loops: For each iteration of the outer loop, we calculate\nan exact gradient. Then in the follow-up inner loop, we adjust all estimated partial gradients by the\nobtained exact gradient. Such a modi\ufb01cation, though simple, has a profound impact: the amortized\ncomputational complexity in each iteration is similar to the stochastic optimization, but the rate of\nconvergence is not compromised. Theoretically, we show that when P(\u2713) is strongly convex, the\nMRBCD method attains better overall iteration complexity than existing RBCD methods. We then\napply the MRBCD method combined with the active set strategy to solve the regularized sparse\nlearning problems. Our numerical experiments shows that the MRBCD method achieves much bet-\nter computational performance than existing methods.\nA closely related method is the stochastic proximal variance reduced gradient method proposed in\n[20]. Their method is a variant of the stochastic proximal gradient methods using the same semi-\nstochastic optimization scheme as ours, but their method inherits the same drawback as the proximal\ngradient method, and does not fully exploit the underlying sparsity structure for large sparse learning\nproblems. We will compare its computational performance with the MRBCD method in numerical\n\n2\n\n\fexperiments. Note that their method can be viewed as a special example of the MRBCD method\nwith one single block.\nWhile this paper was under review, we learnt that a similar method was independently proposed by\n[17]. They also apply the variance reduction technique into the randomized block coordinate descent\nmethod, and obtain similar theoretical results to ours.\n\n2 Notations and Assumptions\n\nGiven a vector v = (v1, ..., vd)T 2 Rd, we de\ufb01ne vector norms: ||v||1 =Pj |vj|, ||v||2 =Pj v2\nj ,\nand ||v||1 = maxj |vj|. Let {G1, ...,Gk} be a partition of all d coordinates with |Gj| = pj and\nPk\nj=1 pj = d. We use vGj to denote the subvector of v with all indices in Gj, and v\\Gj to denote\nthe subvector of v with all indices in Gj removed.\nThroughout the rest of the paper, if not speci\ufb01ed, we make the following assumptions on P(\u2713).\nAssumption 2.1. Each fi(\u2713) is convex and differentiable. Given the partition {G1, ...,Gk}, all\nrGj fi(\u2713) = [rfi(\u2713)]Gj \u2019s are Lipschitz continuous, i.e., there exists a positive constants Lmax such\nthat for all \u2713, \u27130 2 Rd and \u2713Gj 6= \u27130\n\nGj , we have\n\n||rGj fi(\u2713) rGj fi(\u27130)|| \uf8ff Lmax||\u2713Gj \u27130\n\nGj||.\n\nMoreover, rfi(\u2713) is Lipschitz continuous, i.e., there exists a positive constant Tmax for all \u2713, \u27130 2\nRd and \u2713 6= \u27130, we have\n\n||rfi(\u2713) rfi(\u27130)|| \uf8ff Tmax||\u2713 \u27130||.\n\nAssumption 2.1 also implies that rF(\u2713) is Lipschitz continuous, and given the tightest Tmax and\nLmax in Assumption 2.1, we have Tmax \uf8ff kLmax.\nAssumption 2.2. F (\u2713) is strongly convex, i.e., for all \u2713 and \u27130, there exists a positive constant \u00b5\nsuch that\n\nF(\u27130) F (\u2713) + rF(\u2713)T (\u27130 \u2713) \n\n\u00b5\n2||\u27130 \u2713||2.\n\nNote that Assumption 2.2 also implies that P(\u2713) is strongly convex.\nAssumption 2.3. R(\u2713) is a simple convex nonsmooth function such that given some positive con-\nstant \u2318, we can obtain a closed form solution to the following optimization problem,\n\n\u2318 (\u27130\nT j\n\nGj ) = argmin\n\u2713Gj 2Rpj\n\n1\n2\u2318||\u2713Gj \u27130\n\nGj||2 + rj(\u2713).\n\nAssumptions 2.1-2.3 are satis\ufb01ed by many popular regularized empirical risk minimization prob-\nlems. We give some examples in the experiments section.\n\n3 Method\n\nThe MRBCD method is doubly stochastic, in the sense that we not only randomly select a block\nof coordinates, but also randomly sample a mini-batch of component functions from all fi\u2019s. The\npartial gradient of the selected block is estimated based on the selected component functions, which\nyields a much lower computational complexity than existing RBCD methods in each iteration.\nA naive implementation of the MRBCD method is summarized in Algorithm 1. Since the variance\nintroduced by stochastic sampling over component functions does not go to zero as the number of\niteration increases, we have to choose a sequence of diminishing step sizes (e.g. \u2318t = \u00b51t1) to\nensure the convergence. When t is large, we only gain very limited descent in each iteration. Thus\nthe MRBCD-I method can only attain a sublinear rate of convergence.\n\n3\n\n\fAlgorithm 1 Mini-batch Randomized Block Coordinate Descent Method-I: A Naive Implementa-\ntion. The stochastic sampling over component functions introduces variance to the partial gradient\nestimator. To ensure the convergence, we adopt a sequence of diminishing step sizes, which eventu-\nally leads to sublinear rates of convergence.\n\nParameter: Step size \u2318t\nInitialize: \u2713(0)\nFor t = 1, 2, ...\n\nRandomly sample a mini-batch B from {1, ..., n} with equal probability\nRandomly sample j from {1, ..., k} with equal probability\n\\Gj \u2713(t1)\n\u2713(t)\nGj T j\n\\Gj\nEnd for\n\nGj \u2318trGj fB(\u2713(t1))\u2318, \u2713(t)\n\n\u2318t\u21e3\u2713(t1)\n\n3.1 MRBCD with Variance Reduction\nA recent line of work shows how to reduce the variance in the gradient estimation without deterio-\nrating rates of convergence using a semi-stochastic optimization scheme [5, 6]. The semi-stochastic\noptimization contains two nested loops: In each iteration of the outer loop, we calculate an exact\ngradient; Then within the follow-up inner loop, we use the obtained exact gradient to adjust all esti-\nmated partial gradients. These adjustments can guarantee that the variance introduced by stochastic\nsampling over component functions asymptotically goes to zero (see [5]).\n\nAlgorithm 2 Mini-batch Randomized Block Coordinate Descent Method-II: MRBCD + Variance\nReduction. We periodically calculate the exact gradient at the beginning of each outer loop, and\nthen use the obtained exact gradient to adjust all follow-up estimated partial gradients. These ad-\njustments guarantee that the variance introduced by stochastic sampling over component functions\nasymptotically goes to zero, and help the MRBCD II method attain linear rates of convergence.\n\nParameter: update frequency m and step size \u2318\n\nThe MRBCD method with variance reduction is summarized in Algorithm 2. In the next section,\nwe will show that the MRBCD II method attains linear rates of convergence, and the amortized\ncomputational complexity within each iteration is almost the same as that of the MRBCD I method.\nRemark 3.1. Another option for the variance reduction is the stochastic averaging scheme as pro-\nposed in [13], which stores the gradients of most recently subsampled component functions. But the\nMRBCD method iterates randomly over different blocks of coordinates, which makes the stochastic\naveraging scheme inapplicable.\n\n3.2 MRBCD with Variance Reduction and Active Set Strategy\nWhen applying the MRBCD II method to regularized sparse learning problems, we further incor-\nporate the active set strategy to boost the empirical performance. Different from existing RBCD\nmethods, which usually identify the active set by cyclic search, we exploit a proximal gradient pilot\nto identify the active set. More speci\ufb01cally, within each iteration of the outer loop, we conduct a\nproximal gradient descent step, and select the support of the resulting solution as the active set. This\nis very natural to the MRBCD-II method. Because at the beginning of each outer loop, we always\ncalculate an exact gradient, and delivering a proximal gradient pilot will not introduce much addi-\n\n4\n\nFor s = 1,2,...\n\nInitialize: e\u2713(0)\n\nEnd for\n\ne\u2713(s) Pm\n\ne\u2713 e\u2713(s1),e\u00b5 rF (e\u2713(s1)), \u2713(0) e\u2713(s1)\n\nFor t = 1, 2, ..., m\nRandomly sample a mini-batch B from {1, ..., n} with equal probability\nRandomly sample j from {1, ..., k} with equal probability\n\u2713(t)\nGj T j\nEnd for\n\n\u2318 \u21e3\u2713(t1)\nGj \u2318hrGj fB(\u2713(t1)) rGj fB(e\u2713) +e\u00b5Gji\u2318, \u2713(t)\n\nl=1 \u2713(l)\n\n\\Gj \u2713(t1)\n\\Gj\n\n\fFor s = 1,2,...\n\nInitialize: e\u2713(0)\n\nFor j = 1, 2, ..., k\n\n6= 0}, |B| = |A|\n\ne\u2713 e\u2713(s1),e\u00b5 rF (e\u2713(s1))\n\u2318/k\u21e3e\u2713Gj \u2318e\u00b5Gj /k\u2318\n\u2713(0)\nGj T j\nEnd for\nA { j | \u2713(0)\nGj\nFor t = 1, 2, ..., m|A|/k\nRandomly sample a mini-batch B from {1, ..., n} with equal probability\nRandomly sample j from {1, ..., k} with equal probability\nFor all j 2 eA\n\u2318 \u21e3\u2713(t1)\nGj \u2318hrGj fB(\u2713(t1)) rGj fB(e\u2713) +e\u00b5Gji\u2318, \u2713(t)\n\u2713(t)\nGj T j\nEnd for\ne\u2713(s) Pm\n\nl=1 \u2713(l)\n\nEnd for\n\n\\Gj \u2713(t1)\n\\Gj\n\ntional computational cost. Once the active set is identi\ufb01ed, all randomized block coordinate descent\nsteps within the follow-up inner loop only iterates over blocks of coordinates in the active set.\n\nAlgorithm 3 Mini-batch Randomized Block Coordinate Descent Method-III: MRBCD with Vari-\nance Reduction and Active Set. To fully take advantage of the obtained exact gradient, we adopt\na proximal gradient pilot \u2713(0) to identify the active set at each iteration of the outer loop. Then\nall randomized coordinate descent steps within the follow-up inner loop only iterate over blocks of\ncoordinates in the active set.\n\nParameter: update frequency m and step size \u2318\n\nThe MRBCD method with variance reduction and active set strategy is summarized in Algorithm 3.\nSince we integrate the active set into the computation, a successive |A| coordinate decent iterations\nin MRBCD-III will have similar performance as k iterations in MRBCD-II. Therefore we change\nthe maximum number of iterations within each inner loop to |A|m/k. Moreover, since the support\nis only |A| blocks of coordinates, we only need to take |B| = |A| to guarantee suf\ufb01cient variance\nreduction. These modi\ufb01cations will further boost the computational performance of MRBCD-III.\nRemark 3.2. The exact gradient can be also used to determine the convergence of the MRBCD-\nIII method. We terminate the iteration when the approximate KKT condition is satis\ufb01ed,\n\nmin\u21e02@R(e\u2713) ||e\u00b5 + \u21e0|| \uf8ff \", where \" is a positive preset convergence parameter. Since evaluat-\n\ning whether the approximate KKT condition holds is based on the exact gradient obtained at each\niteration of the outer loop, it does not introduce much additional computational cost, either.\n\n4 Theory\nBefore we proceed with our main results of the MRBCD-II method, we \ufb01rst introduce the important\nlemma for controlling the variance introduced by stochastic sampling.\nLemma 4.1. Let B be a mini-batch sampled from {1, ..., n}. De\ufb01ne vB = 1\n\n|B|Pi2|B| rfi(e\u2713) +e\u00b5. Conditioning on \u2713(t1), we have EBvB = rF(\u2713(t1)) and\nhP(\u2713(t1)) P (b\u2713) + P(e\u2713) P (b\u2713)i .\n\nEB||vB rF (\u2713(t1))||2 \uf8ff\n\n4Tmax\n|B|\n\n|B|Pi2|B| rfi(\u2713(t1))\n\n1\n\nThe proof of Lemma 4.1 is provided in Appendix A. Lemma 4.1 guarantees that v is an unbiased\nestimator of F(\u2713), and its variance is bounded by the objective value gap. Therefore we do not need\nto choose a sequence diminishing step sizes to reduce the variance.\n\n4.1 Strongly Convex Functions\nWe then present the concrete rates of convergence of MRBCD-II when P is strongly convex.\n\n5\n\n\fTheorem 4.2. Suppose that Assumptions 2.1-2.3 hold. Lete\u2713(s) be a random point generated by the\nMRBCD-II method in Algorithm 2. Given a large enough batch B and a small enough learning rate\n\u2318 such that |B| Tmax/Lmax and \u2318< L 1\n\nmax/4, we have\n\nEP(e\u2713(s)) P (b\u2713) \uf8ff\u2713\n\nk\n\n\u00b5\u2318(1 4\u2318Lmax)m\n\n+\n\n4\u2318Lmax(m + 1)\n\n(1 4\u2318Lmax)m\u25c6s\n\n[P(e\u2713(0)) P (b\u2713)].\n\nHere we only present a sketch. The detailed proof is provided in Appendix B. The expected succes-\nsive descent of the objective value is composed of two terms: The \ufb01rst one is the same as the ex-\npected successive descent of the \u201cbatch\u201d RBCD methods; The second one is the variance introduced\nby the stochastic sampling. The descent term can be bounded by taking the average of the successive\ndescent over all blocks of coordinates. The variance term can be bounded using Lemma 4.1. The\nmini-batch sampling and adjustments of \u00b5\u2019s guarantees that the variance asymptotically goes to zero\nat a proper scale. By taking expectation over the randomness of component functions and blocks of\ncoordinates throughout all iterations, we derive a geometric rate of convergence.\nThe next corollary present the concrete iteration complexity of the MRBCD-II method.\nCorollary 4.3. Suppose that Assumptions 2.1-2.3 hold. Let |B| = Tmax/Lmax, m = 65kLmax/\u00b5,\nand \u2318 = L1\n\nmax/16. Given the target accuracy \u270f and some \u21e2 2 (0, 1), for any\ns 3 log[P(e\u2713(0)) P (b\u2713)/\u21e2] + 3 log(1/\u270f),\n\nwe have P(e\u2713(s)) P (b\u2713) \uf8ff \u270f with at last probability 1 \u21e2.\n\nCorollary 4.3 is a direct result of Theorem 4.2 and Markov inequality. The detailed proof is provided\nin Appendix C.\nTo characterize the overall iteration complexity, we count the number of partial gradients we es-\ntimate. In each iteration of the outer loop, we calculate an exact gradient. Thus the number of\nestimated partial gradients is O(nk). Within each iteration of the inner loop (m in total), we esti-\nmate the partial gradients based on a mini-batch B. Thus the number of estimate partial gradients\nis O(m|B|). If we choose \u2318, m, and B as in Corollary (4.3) and consider \u21e2 as a constant, then\nthe iteration complexity of the MRBCD-II method with respect to the number of estimated partial\ngradients is\n\nO ((nk + kTmax/\u00b5) \u00b7 log(1/\u270f)) ,\n\nwhich is much lower than that of existing \u201cbatch\u201d RBCD methods, O (nkLmax/\u00b5 \u00b7 log(1/\u270f)).\nRemark 4.4 (Connection to the MRBCD-III method). There still exists a gap between the theory\nand empirical success of the active set strategy and its in existing literature, even for the \u201cbatch\u201d\nRBCD methods. When incorporating the active set strategy to the RBCD-style methods, we have\nknown that the empirical performance can be greatly boosted. How to exactly characterize the\ntheoretical speed up is still largely unknown. Therefore Theorem 4.2 and 4.3 can only serve as an\nimprecise characterization of the MRBCD-III method. A rough understanding is that if the solution\nhas at most q nonzero entries throughout all iterations, then the MRBCD-III method should have an\napproximate overall iteration complexity\n\nO ((nk + qTmax/\u00b5) \u00b7 log(1/\u270f)) .\n\n4.2 Nonstrongly Convex Functions\nWhen P(\u2713) is not strongly convex, we can adopt a perturbation approach. Instead of solving (1.1),\nwe consider the minimization problem as follows,\n\n~\u2713 = argmin\n\n\u27132Rd F(\u2713) + ||\u2713(0) \u2713||2 + R(\u2713),\n\n(4.1)\n\nwhere is some positive perturbation parameter, and \u2713(0) is the initial value. If we consider eF(\u2713) =\nF(\u2713) + ||\u2713(0) \u2713||2 in (4.1) as the smooth empirical risk function, then eF(\u2713) is a strongly convex\nfunction. Thus Corollary 4.3 can be applied to (4.1): When B, m, \u2318, and \u21e2 are suitably chosen, given\n\ns 3 log([P(\u2713(0)) P (~\u2713) ||\u2713(0) ~\u2713||2]/\u21e2) + 3 log(2/\u270f),\n\n6\n\n\fwe have P(e\u2713(s)) P (~\u2713) ||\u2713(0) ~\u2713||2 \uf8ff \u270f/2 with at least probability 1 \u21e2. We then have\nP(e\u2713(s)) P (b\u2713) \uf8ffP (e\u2713(s)) P (b\u2713) ||\u2713(0) b\u2713||2 + ||\u2713(0) b\u2713||2\n\uf8ffP (e\u2713(s)) P (~\u2713) ||\u2713(0) ~\u2713||2 + ||\u2713(0) b\u2713||2 \uf8ff \u270f/2 + ||\u2713(0) b\u2713||2.\nwhere the second inequality comes from the fact that P(~\u2713)+||\u2713(0) ~\u2713||2 \uf8ffP (\u2713)+||\u2713(0)b\u2713||2,\nbecause ~\u2713 is the minimizer to (4.1). If we choose = \u270f/||\u2713(0) b\u2713||2, we have P(e\u2713(s))P (b\u2713) \uf8ff \u270f.\non \u270f. Thus if we consider ||\u2713(0) b\u2713||2 as a constant, then the overall iteration complexity of the\nperturbation approach becomes O ((nk + kTmax/\u270f) \u00b7 log(1/\u270f)).\n5 Numerical Simulations\nThe \ufb01rst sparse learning problem of our interest is Lasso, which solves\n\nSince depends on the desired accuracy \u270f, the number of estimated partial gradients also depends\n\nb\u2713 = argmin\n\n\u27132Rd\n\n1\nn\n\nnXi=1\n\nfi(\u2713) + ||\u2713||1 with fi =\n\n1\n2\n\n(yi xT\n\ni \u2713)2.\n\n(5.1)\n\nWe set n = 2000 and d = 1000, and all covariate vectors xi\u2019s are independently sampled from a\n1000-dimensional Gaussian distribution with mean 0 and covariance matrix \u2303, where \u2303jj = 1 and\n\u2303jk = 0.5 for all k 6= j. The \ufb01rst 50 entries of the regression coef\ufb01cient vector \u2713 are independently\nsamples from a uniform distribution over support (2,1)S(+1, +2). The responses yi\u2019s are\ngenerated by the linear model yi = xT\ni \u2713 + \u270fi, where all \u270fi\u2019s are independently sampled from a\nstandard Gaussian distribution N (0, 1).\nWe choose =plog d/n, and compare the proposed MRBCD-I and MRBCD-II methods with the\n\n\u201cbatch\u201d proximal gradient (BPG) method [11], the stochastic proximal variance reduced gradient\nmethod (SPVRG) [20], and the \u201cbatch\u201d randomized block coordinate descent (BRBCD) method\n[12]. We set k = 100. All blocks are of the same size (10 coordinates). For BPG, the step\nsize is 1/T , where T is the largest singular value of 1\ni . For BRBCD, the step size\nas 1/L, where L is the maximum over the largest singular values of 1\ni=1[xi]Gj of all blocks.\nFor SPVRG, we choose m = n, and the step size is 1/(4T ). For MRBCD-I, the step size is\n1/(Ldt/8000e), where t is the iteration index. For MRBCD-II, we choose m = n, and the step size\nis 1/(4L). Note that the step size and number of iterations m within each inner loop for MRBCD-II\nand SPVRG are tuned over a re\ufb01ned grid such that the best computational performance is obtained.\n\nnPn\n\nnPn\n\ni=1 xixT\n\n2\n10\n\n0\n10\n\n\u22122\n\n10\n\n\u22124\n\n10\n\n\u22126\n\n10\n\n\u22128\n\n10\n\nl\n\np\na\nG\n \ne\nu\na\nV\ne\nv\ni\nt\nc\ne\nb\nO\n\n \n\nj\n\n10\n\n \n\n\u221210\n0\n\nMRBCD\u2212II\nSPVRG\nBRBCD\nBPG\nMRBCD\u2212I\n\n1\n\n2\n\n \n\ns\ne\nt\na\nm\n\ni\nt\ns\ne\n \ns\nt\nn\ne\nd\na\nr\ng\n\ni\n\n \nl\n\na\n\ni\nt\nr\na\np\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n8\n\n9\n\n10\n6\nx 10\n\n9\n10\n\n8\n10\n\n7\n10\n\n6\n10\n \n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\nRegularization Index\n\n \n\nMRBCD\u2212III\nSPVRG\nBRBCD\n\n14\n\n16\n\n18\n\n20\n\n3\n\nNumber of partial gradient estimates\n\n4\n\n5\n\n6\n\n7\n\n(a) Comparison between different methods for a sin-\ngle regularization parameter.\n\n(b) Comparison between different methods for a se-\nquence of regularization parameters.\n\nFigure 5.1: [a] The vertical axis corresponds to objective value gaps P(\u2713) P (b\u2713) in log scale.\n\nThe horizontal axis corresponds to numbers of partial gradient estimates. [b] The horizontal axis\ncorresponds to indices of regularization parameters. The vertical axis corresponds to numbers of\npartial gradient estimates in log scale. We see that MRBCD attains the best performance among all\nmethods for both settings\n\nWe evaluate the computational performance by the number of estimated partial gradients, and the\nresults averaged over 100 replications are presented in Figure 5.1 [a]. As can be seen, MRBCD-II\noutperforms SPVRG, and attains the best performance among all methods. The BRBCD and BPG\n\n7\n\n\fnPi yixi||1, which yields a null solution (all entries are zero), and 20 =plog d/n.\n\nperform worse than MRBCD-II and SPVRG due to high computational complexity within each\niteration. MRBCD-I is actually the fastest among all methods at the \ufb01rst few iterations, and then\nfalls behind SPG and SPVRG due to its sublinear rate of convergence.\nWe then compare the proposed MRBCD-III method with SPVRG and BRBCD for a sequence of\nregularization parameters. The sequence contains 21 regularization parameters {0, ..., 20}. We\nset 0 = || 1\nFor K = 1, ..., 19, we set K = \u21b5K1, where \u21b5 = (20/0)1/20. When solving (5.1) with\nrespect to K, we use the output solution for K1 as the initial solution. The above setting is\noften referred to the warm start scheme in existing literature, and it is very natural to sparse learning\nproblems, since we always need to tune the regularization parameter to secure good \ufb01nite sample\nperformance. For each regularization parameter, the algorithm terminates the iteration when the\napproximate KKT condition is satis\ufb01ed with \u270f = 1010.\nThe results over 50 replications are presented in Figure 5.1 [b]. As can be seen, MRBCD-III outper-\nforms SPVRG and BRBCD, and attains the best performance among all methods. Since BRBCD\nis also combined with the active set strategy, it attains better performance than SPVRG. See more\ndetailed results in Table E.1 in Appendix E\n\n6 Real Data Example\n\nThe second sparse learning problem is the elastic-net regularized logistic regression, which solves\n\nb\u2713 = argmin\n\n\u27132Rd\n\n1\nn\n\nnXi=1\n\nfi(\u2713) + 1||\u2713||1 with fi = log(1 + exp(yixT\n\ni \u2713)) +\n\n2\n2 ||\u2713||2.\n\ni=1[xi]2\n\ni=1 xixT\n\nnPn\n\nnPn\n\nn maxjPn\n\nWe adopt the rcv1 dataset with n = 20242 and d = 47236. We set k = 200, and each block contains\napproximately 237 coordinates.\nWe choose 2 = 104, and 1 = 104, and compare MRBCD-II with SPVRG and BRBCD.\nFor BRBCD, the step size as 1/(4L), where L is the maximum of the largest singular values of\ni=1[xi]Gj over all blocks for BRBCD. For SPVRG, m = n and the step size is 1/(16T ), where\n1\ni . For MRBCD-II, m = n and the step size is\nT is the largest singular value of 1/ 1\n1/(16T ). For BRBCD, the step size as 1/(4L), where L = 1\nj for BRBCD. Note\nthat the step size and number of iterations m within each inner loop for MRBCD-II and SPVRG are\ntuned over a re\ufb01ned grid such that the best computational performance is obtained.\nThe results averaged over 30 replications are presented in Figure F.1 [a] of Appendix F. As can be\nseen, MRBCD-II outperforms SPVRG, and attains the best performance among all methods. The\nBRBCD performs worse than MRBCD-II and SPVRG due to high computational complexity within\neach iteration.\nWe then compare the proposed MRBCD-III method with SPVRG and BRBCD for a sequence of\nregularization parameters. The sequence contains 11 regularization parameters {0, ..., 10}. We set\n0 = || 1Pi rfi(0)||1, which yields a null solution (all entries are zero), and 10 = 1e 4. For\nK = 1, ..., 9, we set K = \u21b5K1, where \u21b5 = (10/0)1/10. For each regularization parameter,\nwe set \u270f = 107 for the approximate KKT condition.\nThe results over 30 replications are presented in Figure F.1 [b] of Appendix F. As can be seen,\nMRBCD-III outperforms SPVRG and BRBCD, and attains the best performance among all meth-\nods. Since BRBCD is also combined with the active set strategy, it attains better performance than\nSPVRG.\n\nAcknowledgements This work is partially supported by the grants NSF IIS1408910, NSF\nIIS1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841. Yu is supported\nby China Scholarship Council and by NSFC 61173073.\n\n8\n\n\fReferences\n[1] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse prob-\n\nlems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2009.\n[3] John Duchi and Yoram Singer. Ef\ufb01cient online and batch learning using forward backward splitting. The\n\nJournal of Machine Learning Research, 10:2899\u20132934, 2009.\n\n[4] Jerome Friedman, Trevor Hastie, Holger H\u00a8o\ufb02ing, and Robert Tibshirani. Pathwise coordinate optimiza-\n\ntion. The Annals of Applied Statistics, 1(2):302\u2013332, 2007.\n\n[5] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduc-\n\ntion. In Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[6] Jakub Kone\u02c7cn`y and Peter Richt\u00b4arik.\n\narXiv:1312.1666, 2013.\n\nSemi-stochastic gradient descent methods.\n\narXiv preprint\n\n[7] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of\n\nMachine Learning Research, 10(777-801):65, 2009.\n\n[8] Han Liu, Mark Palatucci, and Jian Zhang. Blockwise coordinate descent procedures for the multi-task\nlasso, with applications to neural semantic basis discovery. In Proceedings of the 26th Annual Interna-\ntional Conference on Machine Learning, pages 649\u2013656, 2009.\n\n[9] L. Meier, S. Van De Geer, and P B\u00a8uhlmann. The group lasso for logistic regression. Journal of the Royal\n\nStatistical Society: Series B, 70(1):53\u201371, 2008.\n\n[10] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A uni\ufb01ed framework\nStatistical Science,\n\nfor high-dimensional analysis of m-estimators with decomposable regularizers.\n27(4):538\u2013557, 2012.\n\n[11] Yu Nesterov. Gradient methods for minimizing composite objective function. Technical report, Universit\u00b4e\n\ncatholique de Louvain, Center for Operations Research and Econometrics (CORE), 2007.\n\n[12] Peter Richt\u00b4arik and Martin Tak\u00b4a\u02c7c. Iteration complexity of randomized block-coordinate descent methods\n\nfor minimizing a composite function. Mathematical Programming, pages 1\u201338, 2012.\n\n[13] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential\nconvergence rate for \ufb01nite training sets. In Advances in Neural Information Processing Systems, pages\n2672\u20132680, 2012.\n\n[14] Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for `1-regularized loss minimization. The\n\nJournal of Machine Learning Research, 12:1865\u20131892, 2011.\n\n[15] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit Press,\n\n2012.\n\n[16] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[17] Huahua Wang and Arindam Banerjee. Randomized block coordinate descent for online and stochastic\n\noptimization. CoRR, abs/1407.0107, 2014.\n\n[18] Li Wang, Ji Zhu, and Hui Zou. The doubly regularized support vector machine. Statistica Sinica,\n\n16(2):589, 2006.\n\n[19] Tong Tong Wu and Kenneth Lange. Coordinate descent algorithms for lasso penalized regression. The\n\nAnnals of Applied Statistics, 2:224\u2013244, 2008.\n\n[20] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction.\n\narXiv preprint arXiv:1403.4699, 2014.\n\n[21] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika,\n\n94(1):19\u201335, 2007.\n\n[22] Ji Zhu, Saharon Rosset, Trevor Hastie, and Robert Tibshirani. 1-norm support vector machines. In NIPS,\n\nvolume 15, pages 49\u201356, 2003.\n\n[23] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1705, "authors": [{"given_name": "Tuo", "family_name": "Zhao", "institution": "Princeton University and Johns Hopkins University"}, {"given_name": "Mo", "family_name": "Yu", "institution": "Johns Hopkins University"}, {"given_name": "Yiming", "family_name": "Wang", "institution": "Johns Hopkins University"}, {"given_name": "Raman", "family_name": "Arora", "institution": "Johns Hopkins University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}