{"title": "Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 2368, "page_last": 2376, "abstract": "Over the past decades, Linear Programming (LP) has been widely used in different areas and considered as one of the mature technologies in numerical optimization. However, the complexity offered by state-of-the-art algorithms (i.e. interior-point method and primal, dual simplex methods) is still unsatisfactory for problems in machine learning with huge number of variables and constraints. In this paper, we investigate a general LP algorithm based on the combination of Augmented Lagrangian and Coordinate Descent (AL-CD), giving an iteration complexity of $O((\\log(1/\\epsilon))^2)$ with $O(nnz(A))$ cost per iteration, where $nnz(A)$ is the number of non-zeros in the $m\\times n$ constraint matrix $A$, and in practice, one can further reduce cost per iteration to the order of non-zeros in columns (rows) corresponding to the active primal (dual) variables through an active-set strategy. The algorithm thus yields a tractable alternative to standard LP methods for large-scale problems of sparse solutions and $nnz(A)\\ll mn$. We conduct experiments on large-scale LP instances from $\\ell_1$-regularized multi-class SVM, Sparse Inverse Covariance Estimation, and Nonnegative Matrix Factorization, where the proposed approach finds solutions of $10^{-3}$ precision orders of magnitude faster than state-of-the-art implementations of interior-point and simplex methods.", "full_text": "Sparse Linear Programming via\n\nPrimal and Dual Augmented Coordinate Descent\n\nIan E.H. Yen \u2217 Kai Zhong \u2217 Cho-Jui Hsieh \u2020\n\u2217 {ianyen,pradeepr,inderjit}@cs.utexas.edu\n\n\u2217 University of Texas at Austin\n\n\u2020 chohsieh@ucdavis.edu\n\nzhongkai@ices.utexas.edu\n\nPradeep Ravikumar \u2217\n\n\u2020 University of California at Davis\n\nInderjit S. Dhillon \u2217\n\nAbstract\n\nOver the past decades, Linear Programming (LP) has been widely used in different\nareas and considered as one of the mature technologies in numerical optimization.\nHowever, the complexity offered by state-of-the-art algorithms (i.e. interior-point\nmethod and primal, dual simplex methods) is still unsatisfactory for problems in\nmachine learning with huge number of variables and constraints. In this paper,\nwe investigate a general LP algorithm based on the combination of Augmented\nLagrangian and Coordinate Descent (AL-CD), giving an iteration complexity of\nO((log(1/\u0001))2) with O(nnz(A)) cost per iteration, where nnz(A) is the number\nof non-zeros in the m\u00d7 n constraint matrix A, and in practice, one can further re-\nduce cost per iteration to the order of non-zeros in columns (rows) corresponding\nto the active primal (dual) variables through an active-set strategy. The algorithm\nthus yields a tractable alternative to standard LP methods for large-scale problems\nof sparse solutions and nnz(A) (cid:28) mn. We conduct experiments on large-scale\nLP instances from (cid:96)1-regularized multi-class SVM, Sparse Inverse Covariance Es-\ntimation, and Nonnegative Matrix Factorization, where the proposed approach\n\ufb01nds solutions of 10\u22123 precision orders of magnitude faster than state-of-the-art\nimplementations of interior-point and simplex methods. 1\n\n1\n\nIntroduction\n\nLinear Programming (LP) has been studied since the early 19th century and has become one of\nthe representative tools of numerical optimization with wide applications in machine learning such\nas (cid:96)1-regularized SVM [1], MAP inference [2], nonnegative matrix factorization [3], exemplar-\nbased clustering [4, 5], sparse inverse covariance estimation [6], and Markov Decision Process [7].\nHowever, as the demand for scalability keeps increasing, the scalability of existing LP solvers has\nbecome unsatisfactory. In particular, most algorithms in machine learning targeting large-scale data\nhave a complexity linear to the data size [8, 9, 10], while the complexity of state-of-the-art LP\nsolvers (i.e. Interior-Point method and Primal, Dual Simplex methods) is still at least quadratic in\nthe number of variables or constraints [11].\nThe quadratic complexity comes from the need to solve each linear system exactly in both simplex\nand interior point method. In particular, the simplex method, when traversing from one corner point\nto another, requires solution to a linear system that has dimension linear to the number of variables\nor constraints, while in an Interior-Point method, \ufb01nding the Newton direction requires solving a\nlinear system of similar size. While there are sparse variants of LU and Cholesky decomposition that\ncan utilize the sparsity pattern of matrix in a linear system, the worst-case complexity for solving\nsuch system is at least quadratic to the dimension except for very special cases such as a tri-diagonal\nor band-structured matrix.\n\n1Our solver has been released here: http://www.cs.utexas.edu/\u02dcianyen/LPsparse/\n\n1\n\n\fFor interior point method (IPM), one remedy to the high complexity is employing an iterative method\nsuch as Conjugate Gradient (CG) to solve each linear system inexactly. However, this can hardly\ntackle the ill-conditioned linear systems produced by IPM when iterates approach boundary of con-\nstraints [12]. Though substantial research has been devoted to the development of preconditioners\nthat can help iterative methods to mitigate the effect of ill-conditioning [12, 13], creating a precon-\nditioner of tractable size is a challenging problem by itself [13]. Most commercial LP software thus\nstill relies on exact methods to solve the linear system.\nOn the other hand, some dual or primal (stochastic) sub-gradient descent methods have cheap cost\nfor each iteration, but require O(1/\u00012) iterations to \ufb01nd a solution of \u0001 precision, which in practice\ncan even hardly \ufb01nd a feasible solution satisfying all constraints [14].\nAugmented Lagrangian Method (ALM) was invented early in 1969, and since then there have been\nseveral works developed Linear Program solver based on ALM [15, 16, 17]. However, the challenge\nof ALM is that it produces a series of bound-constrained quadratic problems that, in the traditional\nsense, are harder to solve than linear system produced by IPM or Simplex methods [17]. Speci\ufb01cally,\nin a Projected-CG approach [18], one needs to solve several linear systems via CG to \ufb01nd solution\nto the bound-constrained quadratic program, while there is no guarantee on how many iterations\nit requires. On the other hand, Projected Gradient Method (PGM), despite its guaranteed iteration\ncomplexity, has very slow convergence in practice. More recently, Multi-block ADMM [19, 20]\nwas proposed as a variant of ALM that, for each iteration, only updates one pass (or even less)\nblocks of primal variables before each dual update, which however, requires a much smaller step\nsize in the dual update to ensure convergence [20, 21] and thus requires large number of iterations\nfor convergence to moderate precision. To our knowledge, there is still no report on a signi\ufb01cant\nimprovement of ALM-based methods over IPM or Simplex method for Linear Programming.\nIn the recent years, Coordinate Descent (CD) method has demonstrated ef\ufb01ciency in many machine\nlearning problems with bound constraints or other non-smooth terms [9, 10, 22, 23, 24, 25] and has\nsolid analysis on its iteration complexity [26, 27]. In this work, we show that CD algorithm can\nbe naturally combined with ALM to solve Linear Program more ef\ufb01ciently than existing methods\non large-scale problems. We provide an O((log(1/\u0001))2) iteration complexity of the Augmented\nLagrangian with Coordinate Descent (AL-CD) algorithm that bounds the total number of CD up-\ndates required for an \u0001-precise solution, and describe an implementation of AL-CD that has cost\nO(nnz(A)) for each pass of CD. In practice, an active-set strategy is introduced to further reduce\ncost of each iteration to the active size of variables and constraints for primal-sparse and dual-sparse\nLP respectively, where a primal-sparse LP has most of variables being zero, and a dual-sparse LP\nhas few binding constraints at the optimal solution. Note, unlike in IPM, the conditioning of each\nsubproblem in ALM does not worsen over iterations [15, 16]. The AL-CD framework thus provides\nan alternative to interior point and simplex methods when it is infeasible to exactly solving an n\u00d7 n\n(or m \u00d7 m) linear system.\n\n2 Sparse Linear Program\n\nWe are interested in solving linear programs of the form\n\nmin\nx\u2208Rn\ns.t.\n\nf (x) = cT x\nAI x \u2264 bI , AEx = bE\nxj \u2265 0, j \u2208 [nb]\n\n(1)\n\nwhere AI is mI by n matrix of coef\ufb01cients and AE is mE by n. Without loss of generality, we\nassume non-negative constraints are imposed on the \ufb01rst nb variables, denoted as xb, such that\nx = [xb; xf ] and c = [cb; cf ]. The inequality and equality coef\ufb01cient matrices can then be\npartitioned as AI = [AI,b AI,f ] and AE = [AE,b AE,f ]. The dual problem of (1) then takes the\nform\n\nmin\ny\u2208Rm\ns.t.\n\ng(y) = bT y\n\u2212 AT\nyi \u2265 0,\n\nb y \u2264 cb , \u2212AT\ni \u2208 [mI ].\n\nf y = cf\n\n2\n\n(2)\n\n\fwhere m = mI + mE, b = [bI ; bE], Ab = [AI,b; AE,b], Af = [AI,f ; AE,f ], and y = [yI ; yE]. In\nmost of LP occur in machine learning, m and n are both at scale in the order 105 \u02dc106, for which an\nalgorithm with cost O(mn), O(n2) or O(m2) is unacceptable. Fortunately, there are usually various\ntypes of sparsity present in the problem that can be utilized to lower the complexity.\nFirst, the constraint matrix A = [AI ; AE] are usually pretty sparse in the sense that nnz(A) (cid:28) mn,\nand one can compute matrix-vector product Ax in O(nnz(A)). However, in most of current LP\nsolvers, not only matrix-vector product but also a linear system involving A needs to be solved,\nwhich in general, has cost much more than O(nnz(A)) and can be up to O(min(n3, m3)) in the\nworst case.\nIn particular, the simplex-type methods, when moving from one corner to another,\nrequires solving a linear system that involves a sub-matrix of A with columns corresponding to the\nbasic variables [11], while in an interior point method (IPM), one also needs to solve a normal\nequation system of matrix ADtAT to obtain the Newton direction, where Dt is a diagonal matrix\nthat gradually enforces complementary slackness as IPM iteration t grows [11]. While one remedy\nto the high complexity is to employ iterative method such as Conjugate Gradient (CG) to solve the\nsystem inexactly within IPM, this approach can hardly handle the ill-conditionedness occurs when\nIPM iterates approaches boundary [12]. On the other hand, the Augmented Lagrangian approach\ndoes not have such asymptotic ill-conditionedness and thus an iterative method with complexity\nlinear to O(nnz(A)) can be used to produce suf\ufb01ciently accurate solution for each sub-problem.\nBesides sparsity in the constraint matrix A, two other types of structures, which we termed primal\nand dual sparsity, are also prevalent in the context of machine learning. A primal-sparse LP refers\nto an LP with optimal solution x\u2217 comprising only few non-zero elements, while a dual-sparse LP\nrefers to an LP with few binding constraints at optimal, which corresponds to the non-zero dual\nvariables. In the following, we give two examples of sparse LP.\n\nL1-Regularized Support Vector Machine The problem of L1-regularized multi-class Support\nVector Machine [1]\n\nk(cid:88)\n\nm=1\n\n(cid:107)wm(cid:107)1 +\n\nl(cid:88)\n\u03bei\nmxi \u2265 em\ni \u2212 \u03bei, \u2200(i, m)\n\ni=1\n\nmin\nwm,\u03bei\n\n\u03bb\n\nxi \u2212 wT\n\nwT\nyi\n\n(3)\n\ni = 0 if yi = m, em\n\ns.t.\nwhere em\ni = 1 otherwise. The task is dual-sparse since among all samples i and\nclass k, only those leads to misclassi\ufb01cation will become binding constraints. The problem (3) is\nalso primal-sparse since it does feature selection through (cid:96)1-penalty. Note the constraint matrix in\n(3) is also sparse since each constraint only involves two weight vectors, and the pattern xi can be\nalso sparse.\n\nSparse Inverse Covariance Estimation The Sparse Inverse Covariance Estimation aims to \ufb01nd\na sparse matrix \u2126 that approximate the inverse of Covariance matrix. One of the most popular\napproach to this solves a program of the form [6]\n(cid:107)\u2126(cid:107)1\n(cid:107)S\u2126 \u2212 Id(cid:107)max \u2264 \u03bb\n\n(4)\nwhich is primal-sparse due to the (cid:107).(cid:107)1 penalty. The problem has a dense constraint matrix, which\nhowever, has special structure where the coef\ufb01cient matrix S can be decomposed into a product of\ntwo low-rank and (possibly) sparse n by d matrices S = Z T Z. In case Z is sparse or n (cid:28) d, this\ndecomposition can be utilized to solve the Linear Program much more ef\ufb01ciently. We will discuss\non how to utilize such structure in section 4.3.\n\nmin\n\u2126\u2208Rd\u00d7d\ns.t.\n\n3 Primal and Dual Augmented Coordinate Descent\n\nIn this section, we describe an Augmented Lagrangian method (ALM) that carefully tackles the\nsparsity in a LP. The choice between Primal and Dual ALM depends on the type of sparsity present\nin the LP. In particular, a primal AL method can solve a problem of few non-zero variables more\nef\ufb01ciently, while dual ALM will be more ef\ufb01cient for problem with few binding constraints. In the\nfollowing, we describe the algorithm only from the primal point of view, while the dual version can\nbe obtained by exchanging the roles of primal (1) and dual (2).\n\n3\n\n\fAlgorithm 1 (Primal) Augmented Lagrangian Method\n\nInitialization: y0 \u2208 Rm and \u03b70 > 0.\nrepeat\n\n(cid:20) AI xt+1 \u2212 bI + \u03bet+1\n\n1. Solve (6) to obtain (xt+1, \u03bet+1) from yt.\nAExt+1 \u2212 bE\n\n2. Update yt+1 = yt + \u03b7t\n3. t = t + 1.\n4. Increase \u03b7t by a constant factor if necessary.\n\nuntil (cid:107)[AI xt \u2212 bI ]+(cid:107)\u221e \u2264 \u0001p and (cid:107)AExt \u2212 bE(cid:107)\u221e \u2264 \u0001.\n\n(cid:21)\n\n.\n\n3.1 Augmented Lagrangian Method (Dual Proximal Method)\nLet g(y) be the dual objective function (2) that takes \u221e if y is infeasible. The primal AL algorithm\ncan be interpreted as a dual proximal point algorithm [16] that for each iteration t solves\n\nyt+1 = argmin\n\ng(y) +\n\ny\n\n(cid:107)y \u2212 yt(cid:107)2.\n\n1\n2\u03b7t\n\n(5)\n\nSince g(y) is nonsmooth, (5) is not easier to solve than the original dual problem. However, the dual\nof (5) takes the form:\n\n(cid:13)(cid:13)(cid:13)(cid:13) AI x \u2212 bI + \u03be\n\nAEx \u2212 bE\n\n(cid:20) yt\n\nI\nyt\nE\n\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n+\n\n1\n\u03b7t\n\n(6)\n\nmin\nx, \u03be\n\ns.t.\n\nF (x, \u03be) = cT x +\nxb \u2265 0, \u03be \u2265 0,\n\n\u03b7t\n2\n\nwhich is a bound-constrained quadratic problem. Note given (x, \u03be) as Lagrangian Multipliers of (5),\nthe corresponding y minimizing Lagrangian L(x, \u03be, y) is\n\n(cid:20) AI x \u2212 bI + \u03be\n\nAEx \u2212 bE\n\n(cid:21)\n\n(cid:21)\n\n(cid:20) yt\n\nI\nyt\nE\n\ny(x, \u03be) = \u03b7t\n\n+\n\n,\n\n(7)\n\nand thus one can solve (x\u2217, \u03be\u2217) from (6) and \ufb01nd yt+1 through (7). The resulting algorithm is\nsketched in Algorithm 1. For problem of medium scale, (6) is not easier to solve than a linear\nsystem due to non-negative constraints, and thus an ALM is not preferred to IPM in the traditional\nsense. However, for large-scale problem with m \u00d7 n (cid:29) nnz(A), the ALM becomes advantageous\nsince: (i) the conditioning of (6) does not worsen over iterations, and thus allows iterative methods\nto solve it approximately in time proportional to O(nnz(A)). (ii) For a primal-sparse (dual-sparse)\nproblem, most of primal (dual) variables become binding at zero as iterates approach to the optimal\nsolution, which yields a potentially much smaller subproblem.\n\n3.2 Solving Subproblem via Coordinate Descent\n\nGiven a dual solution yt, we employ a variant of Randomized Coordinate Descent (RCD) method\nto solve subproblem (6). First, we note that, given x, the part of variables in \u03be can minimized in\nclosed-form as\n\n(8)\nwhere function [v]+ truncates each element of vector v to be non-negative as [v]+i = max{vi, 0}.\nThen (6) can be re-written as\n\nI /\u03b7t]+,\n\n\u03be(x) = [bI \u2212 AI x \u2212 yt\n\n(cid:13)(cid:13)(cid:13)(cid:13) [AI x \u2212 bI + yt\n\nAEx \u2212 bE + yt\n\nI /\u03b7t]+\nE/\u03b7t\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n(9)\n\nmin\n\nx\n\ns.t.\n\n\u02c6F (x) = cT x +\nxb \u2265 0.\n\n\u03b7t\n2\n\n4\n\n\fAlgorithm 2 RCD for subproblem (6)\n\nAlgorithm 3 PN-CG for subproblem (6)\n\nINPUT: \u03b7t > 0 and (xt,0, wt,0, vt,0) satisfying\nrelation (11), (12).\nOUTPUT: (xt,k, wt,k, vt,k)\nrepeat\n\nj\n\n1. Pick a coordinate j uniformly at random.\n2. Compute \u2207j \u02c6F (x), \u22072\n\u02c6F (x).\n3. Obtain Newton direction d\u2217\nj .\n4. Do line search (15) to \ufb01nd step size.\n5. Update xt,k+1 \u2190 xt,k + \u03b2rd\u2217\nj .\n6. Maintain relation (11), (12).\n7. k \u2190 k + 1.\n\nINPUT: \u03b7t > 0 and (xt,0, wt,0, vt,0) satisfying\nrelation (11), (12).\nOUTPUT: (xt,k, wt,k, vt,k)\nrepeat\n\n1. Identify active variables At,k.\n2. Compute [\u2207jF (x)]At,k and set Dt,k.\n3. Find Newton direction d\u2217\nAt,k with CG.\n4. Find step size via projected line search.\n5. Update xt,k+1 \u2190 (xt,k + \u03b2rd\u2217\n6. Maintain relation (11), (12).\n7. k \u2190 k + 1.\n\nj )+.\n\nuntil (cid:107)d\u2217(x)(cid:107)\u221e \u2264 \u0001t.\n\nAt,k(cid:107)\u221e \u2264 \u0001t.\nDenote the objective function as \u02c6F (x). The gradient of (9) can be expressed as\n\nuntil (cid:107)d\u2217\n\n\u2207 \u02c6F (x) = c + \u03b7tAT\n\nI [w]+ + \u03b7tAT\n\nEv\n\nwhere\n\nw = AI x \u2212 bI + yt\nv = AEx \u2212 bE + yt\n\nI /\u03b7t\nE/\u03b7t,\n\nand the (generalized) Hessian of (9) is\n\n\u22072 \u02c6F (x) = \u03b7tAT\n\nI D(w)AI + \u03b7tAT\n\nEAE,\n\n(10)\n\n(11)\n(12)\n\n(13)\n\nwhere D(w) is an mI by mI diagonal matrix with Dii(w) = 1 if wi > 0 and Dii = 0 otherwise.\nThe RCD algorithm then proceeds as follows. In each iteration k, it picks a coordinate from j \u2208\n{1, .., n} uniformly at random and minimizes w.r.t. the coordinate. The minimization is conducted\nby a single-variable Newton step, which \ufb01rst \ufb01nds the Newton direction d\u2217\nj through minimizing a\nquadratic approximation\n\nd\u2217\nj = argmin\n\nd\n\ns.t.\n\n\u2207j \u02c6F (xt,k)d +\nj + d \u2265 0,\nxt,k\n\n\u22072\n\nj\n\n1\n2\n\n\u02c6F (xt,k)d2\n\n(14)\n\nand then conducted a line search to \ufb01nd the smallest r \u2208 {0, 1, 2, ...} satisfying\n\n\u02c6F (xt,k + \u03b2rd\u2217\n\n(15)\nfor some line-search parameter \u03c3 \u2208 (0, 1/2], \u03b2 \u2208 (0, 1), where ej denotes a vector with only jth\nelement equal to 1 and all others equal to 0. Note the single-variable problem (14) has closed-form\nsolution\n\nj ej) \u2212 \u02c6F (xt,k) \u2264 \u03c3\u03b2r(\u2207j \u02c6F (xt,k)d\u2217\nj ).\n\nd\u2217\nj =\n\nj \u2212 \u2207j \u02c6F (xt,k\nxt,k\n\nj )/\u22072\n\nj\n\n\u02c6F (xt,k\nj )\n\n\u2212 xt,k\n\nj\n\n,\n\n(16)\n\n(cid:104)\n\n(cid:105)\n\n+\n\nwhich in a naive implementation, takes O(nnz(A)) time due to the computation of (11) and (12).\nHowever, in a clever implementation, one can maintain the relation (11), (12) as follows whenever\na coordinate xj is updated by \u03b2rd\u2217\n\n(cid:21)\n\n(cid:20) wt,k+1\n\nj\n\nvt,k+1\n\n=\n\n(cid:21)\n\n(cid:20) wt,k\n\nvt,k\n\n+ \u03b2rd\u2217\n\nj\n\n(cid:21)\n\n(cid:20) aI\n\nj\naE\nj\n\n,\n\n(17)\n\nwhere aj = [aI\nsecond-derivative of jth coordinate\n\nj ; aE\n\nj ] denotes the jth column of AI and AE. Then the gradient and (generalized)\n\n\u2207j \u02c6F (x) = cj + \u03b7t(cid:104)aI\n\u22072\n\n(cid:32) (cid:88)\n\n\u02c6F (x) = \u03b7t\n\nj\n\nj , v(cid:105)\n\n(cid:88)\nj , [w]+(cid:105) + \u03b7t(cid:104)aE\n\n(aI\n\ni,j)2 +\n\n(aE\n\ni,j)2\n\n(cid:33)\n\n(18)\n\ni:wi>0\n\ni\n\n5\n\n\fcan be computed in O(nnz(aj)) time. Similarly, for each coordinate update, one can evaluate the\nj ej) \u2212 \u02c6F (xt,k) in O(nnz(aj)) by only computing terms\ndifference of function value \u02c6F (xt,k + d\u2217\nrelated to the jth variable.\nThe overall procedure for solving subproblem is summarized in Algorithm 2. In practice, a random\npermutation is used instead of uniform sampling to ensure that every coordinate is updated once\nbefore proceeding to the next round, which can speed up convergence and ease the checking of stop-\nping condition (cid:107)d\u2217(x)(cid:107)\u221e \u2264 \u0001t, and an active-set strategy is employed to avoid updating variables\nwith d\u2217\n\nj = 0. We describe details in section 4\n\n3.3 Convergence Analysis\n\nIn this section, we prove the iteration complexity of AL-CD method. Existing analysis [26, 27]\nshows that Randomized Coordinate Descent can be up to n times faster than Gradient-based methods\nin certain conditions. However, to prove a global linear rate of convergence the analysis requires\nobjective function to be strongly convex, which is not true for our sub-problem (6). Here we follow\nthe approach in [28, 29] to show global linear convergence of Algorithm 2 by utilizing the fact that,\nwhen restricted to a constant subspace, (6) is strongly convex. All proofs will be included in the\nappendix.\nTheorem 1 (Linear Convergence). Denote F \u2217 as the optimum of (6) and \u00afx = [x; \u03be]. The iterates\n{\u00afxk}\u221e\n\nk=0 of the RCD Algorithm 2 has\n\nE[F (\u00afxk+1)] \u2212 F \u2217 \u2264\n\n\u03b3 = max(cid:8)16\u03b7tM \u03b8(F 0 \u2212 F \u2217) , 2M \u03b8(1 + 4L2\n\nwhere\nM = maxj\u2208[\u00afn] (cid:107)\u00afaj(cid:107)2 is an upper bound on coordinate-wise second derivative, and Lg is local\nLipschitz-continuous constant of function g(z) = \u03b7t(cid:107)z\u2212 b + yt/\u03b7t(cid:107)2, and \u03b8 is constant of Hoffman\u2019s\nbound that depends on the polyhedron formed by the set of optimal solutions.\n\n(19)\n\n(cid:19)(cid:0)E[F (\u00afxk)] \u2212 F \u2217(cid:1) ,\ng) , 6(cid:9) ,\n\n(cid:18)\n\n1 \u2212 1\n\u03b3n\n\nThen the following theorem gives a bound on the number of iterations required to \ufb01nd an \u00010-precise\nsolution in terms of the proximal minimization (5).\nTheorem 2 (Inner Iteration Complexity). Denote y(\u00afxk) as the dual solution (7) corresponding to\nthe primal iterate \u00afxk. To guarantee\n\nwith probability 1 \u2212 p, it suf\ufb01ces running RCD Algorithm 2 for number of iterations\n\n(cid:107)y(\u00afxk) \u2212 yt+1(cid:107) \u2264 \u00010\n\n(cid:32)(cid:115)\n\n2(F (\u00afx0) \u2212 F \u2217)\n\n\u03b7tp\n\n1\n\u00010\n\n(cid:33)\n\n.\n\nk \u2265 2\u03b3n log\n\nNow we prove the overall iteration complexity of AL-CD. Note that existing linear convergence\nanalysis of ALM on Linear Program [16] assumes exact solutions of subproblem (6), which is\nnot possible in practice. Our next theorem extends the linear convergence result to cases when\nsubproblems are solved inexactly, and in particular, shows the total number of coordinate descent\nupdates required to \ufb01nd an \u0001-accurate solution.\nTheorem 3 (Iteration Complexity). Denote {\u02c6yt}\u221e\nact dual proximal updates, {yt}\u221e\nof y to the set of optimal dual solutions. To guarantee (cid:107)\u02c6yt \u2212 \u02c6yt\nsuf\ufb01ces to run Algorithm 1 for\n\nt=1 as the sequence of iterates obtained from inex-\nt=1 as that generated by exact updates, and yS\u2217 as the projection\nS\u2217(cid:107) \u2264 2\u0001 with probability 1 \u2212 p, it\n\n(20)\n\n(21)\n\n(22)\n\n(cid:19)\n\n(cid:18) LR\n(cid:18)\n\n\u0001\n\n(cid:18)\n\nT = (1 +\n\n) log\n\n1\n\u03b1\n\n(cid:16) \u03c9\n\n(cid:17)\n\n\u0001\n\n+\n\n3\n2\n\nk \u2265 2\u03b3n\n\nlog\n\nlog\n\n(1 +\n\n1\n\u03b1\n\n) log\n\nLR\n\n\u0001\n\n(cid:19)(cid:19)\n\nouter iterations with \u03b7t = (1 + \u03b1)L, and solve each sub-problem (6) by running Algorithm 2 for\n\n(cid:113) 2(1+\u03b1)L(F 0\u2212F \u2217)\n\ninner iterations, where L is a constant depending on the polyhedral set of optimal solutions, \u03c9 =\n, R = (cid:107)prox\u03b7tg(y0) \u2212 y0(cid:107), and F 0, F \u2217 are upper and lower bounds on the\n\np\n\ninitial and optimal function values of subproblem respectively.\n\n6\n\n\f3.4 Fast Asymptotic Convergence via Projected Newton-CG\n\nThe RCD algorithm converges to a solution of moderate precision ef\ufb01ciently, but in some problems\na higher precision might be required. In such case, we transfer the subproblem solver from RCD\nto a Projected Newton-CG (PN-CG) method after iterates are close enough to the optimum. Note\nthe Projected Newton method does not have global iteration complexity but has fast convergence for\niterates very close to the optimal.\nDenote F (x) as the objective in (9). Each iterate of PN-CG begins by \ufb01nding the set of active\nvariables de\ufb01ned as\n\nAt,k = {j|xt,k\n\nj > 0 \u2228 \u2207jF (xt,k) < 0}.\n\n(23)\n\nThen the algorithm \ufb01xes xt,k\n\nj = 0,\u2200j /\u2208 At,k and solves a Newton linear system w.r.t. j \u2208 At,k\n\n[\u22072At,k F (xt,k)]d = \u2212[\u2207At,k F (xt,k)]\n\n(24)\nto obtain direction d\u2217 for the current active variables. Let dAt,k denotes a size-n vector taking value\nin d\u2217 for j \u2208 At,k and taking value 0 for j /\u2208 At,k. The algorithm then conducts a projected line\nsearch to \ufb01nd smallest r \u2208 {0, 1, 2, ...} satisfying\n\nF ([xt,k + \u03b2rdAt,k ]+) \u2212 F (xt,k) \u2264 \u03c3\u03b2r(\u2207jF (xt,k)dAt,k ),\n\n(25)\nand update x by xt,k+1 \u2190 (xt,k + \u03b2rd\u2217\nj )+. Compared to interior point method, one key to the\ntractability of this approach lies on the conditioning of linear system (24), which does not worsen\nas outer iteration t increases, so an iterative Conjugate Gradient (CG) method can be used to obtain\naccurate solution without factorizing the Hessian matrix. The only operation required within CG is\nthe Hessian-vector product\n\n[\u22072At,k F (xt,k)]s = \u03b7t [AT\n\n(26)\nwhere the operator [.]At,k takes the sub-matrix with row and column indices belonging to At,k. For\na primal or dual-sparse LP, the product (26) can be evaluated very ef\ufb01ciently, since it only involves\nnon-zero elements in columns of AI, AE belonging to the active set, and rows of AI corresponding\nto the binding constraints for which Dii(wt,k) > 0. The overall cost of the product (26) is only\n\nI D(wt,k)AI + AT\n\nEAE]At,k s,\n\nO(cid:0)nnz([AI ]Dt,k,At,k ) + nnz([AE]:,At,k )(cid:1) ,\n\nwhere Dt,k = {i|wt,k\ni > 0} is the set of current binding constraints. Considering that the com-\nputational bottleneck of PN-CG is on the CG iterations for solving linear system (24), the ef\ufb01cient\ncomputation of product (26) reduces the overall complexity of PN-CG signi\ufb01cantly. The whole\nprocedure is summarized in Algorithm 3.\n\n4 Practical Issues\n\n4.1 Precision of Subproblem Minimization\n\nIn practice, it is unnecessary to solve subproblem (6) to high precision, especially for iterations\nof ALM in the beginning. In our implementation, we employ a two-phase strategy, where in the\n\ufb01rst phase we limit the cost spent on each sub-problem (6) to be a constant multiple of nnz(A),\nwhile in the second phase we dynamically increment the AL parameter \u03b7t and inner precision \u0001t to\nensure suf\ufb01cient decrease in the primal and dual infeasibility respectively. The two-phase strategy\nis particularly useful for primal or dual-sparse problem, where sub-problem in the latter phase has\nsmaller active set that results in less computation cost even when solved to high precision.\n\n4.2 Active-Set Strategy\nOur implementation of Algorithm 2 maintains an active set of variables A, which initially contains\nall variables, but during the RCD iterates, any variable xj binding at 0 with gradient \u2207jF greater\nthan a threshold \u03b4 will be excluded from A till the end of each subproblem solving. A will be\nre-initialized after each dual proximal update (7). Note in the initial phase, the cost spent on each\nsubproblem is a constant multiple of nnz(A), so if |A| is small one would spend more iterations on\nthe active variables to achieve faster convergence.\n\n7\n\n\f4.3 Dealing with Decomposable Constraint Matrix\n\nWhen we have a m by n constraint matrix A = U V T that can be decomposed into product of an\nm \u00d7 r matrix U and a r \u00d7 n matrix V T , if r (cid:28) min{m, n} or nnz(U ) + nnz(V ) (cid:28) nnz(A), we\ncan re-formulate the constraint Ax \u2264 b as U z \u2264 b , V T x = z with auxiliary variables z \u2208 Rr.\nThis new representation reduce the cost of Hessian-vector product in Algorithm 3 and the cost of\neach pass of CD in Algorithm 2 from O(nnz(A)) to O(nnz(U ) + nnz(V )).\n\n5 Numerical Experiments\n\nTable 1: Timing Results (in sec. unless speci\ufb01ed o.w.) on Multiclass L1-regularized SVM\n\nData\nrcv1\nnews\nsector\nmnist\n\ncod-rna.rf\nvehicle\nreal-sim\n\nnb\n\n4,833,738\n2,498,415\n11,597,992\n\n75,620\n69,537\n79,429\n114,227\n\nmI\n\n778,200\n302,765\n666,848\n540,000\n59,535\n157,646\n72,309\n\nP-Simp. D-Simp.\n> 48hr\n> 48hr\n37,912\n> 48hr\n9,282\n> 48hr\n2,556\n6,454\n5,738\n86,130\n3,296\n143.33\n49,405\n> 48hr\n\nBarrier D-ALCD P-ALCD\n> 48hr\n> 48hr\n> 48hr\n73,036\n> 48hr\n8,858\n89,476\n\n3,155\n395\n2,029\n7,207\n2,676\n598\n297\n\n3,452\n148\n1,419\n146\n3,130\n31\n179\n\nTable 2: Timing Results (in sec. unless speci\ufb01ed o.w.) on Sparse Inverse Covariance Estimation\n\nData\n\ntextmine\nE2006\ndorothea\n\nnb\n\n60,876\n55,834\n47,232\n\nmI\n\n60,876\n55,834\n47,232\n\nmE\n43,038\n32,174\n1,600\n\nnf\n\n43,038\n32,174\n1,600\n\nP-Simp D-Simp\n> 48hr > 48hr > 48hr\n> 48hr > 48hr\n94623\n3,980\n\nBarrier D-ALCD P-ALCD\n18,507\n4,207\n38\n\n103\n\n43,096\n> 48hr\n\n47\n\n82\n\nTable 3: Timing Results (in sec. unless speci\ufb01ed o.w.) for Nonnegative Matrix Factorization.\n\nData\n\nmicromass\n\nocr\n\nnb\n\n2,896,770\n6,639,433\n\nmI\n\n4,107,438\n13,262,864\n\nP-Simp. D-Simp.\n> 96hr\n> 96hr\n> 96hr\n> 96hr\n\nBarrier\n280,230\n284,530\n\nD-ALCD P-ALCD\n12,119\n12,966\n40,242\n> 96hr\n\nIn this section, we compare the AL-CD algorithm with state-of-the-art implementation of interior\npoint and primal, dual Simplex methods in commercial LP solver CPLEX, which is of top ef\ufb01ciency\namong many LP solvers as investigated in [30]. For all experiments, the stopping criteria is set to\nrequire both primal and dual infeasibility (in the (cid:96)\u221e-norm) smaller than 10\u22123 and set the initial sub-\nproblem tolerance \u0001t = 10\u22122 and \u03b7t = 1. The LP instances are generated from L1-SVM (3), Sparse\nInverse Covariance Estimation (4) and Nonnegative Matrix Factorization [3]. For the Sparse Inverse\nCovariance Estimation problem, we use technique introduced in section 4.3 to decompose the low-\nrank matrix S, and since (4) results in d independent problems for each column of the estimated\nmatrix, we report result on only one of them. The data source and statistics are included in the\nappendix.\nAmong all experiments, we observe that the proposed primal, dual AL-CD methods become partic-\nularly advantageous when the matrix A is sparse. For example, for text data set rcv1, real-sim and\nnews in Table 1, the matrix A is particularly sparse and AL-CD can be orders of magnitude faster\nthan other approaches by avoiding solving n \u00d7 n linear system exactly. In addition, the dual-ALCD\n(also dual simplex) is more ef\ufb01cient in L1-SVM problem due to the problem\u2019s strong dual sparsity,\nwhile the primal-ALCD is more ef\ufb01cient on the primal-sparse Inverse Covariance estimation prob-\nlem. For the Nonnegative Matrix Factorization problem, both the dual and primal LP solutions are\nnot particularly sparse due to the choice of matrix approximation tolerance (1% of #samples), but\nthe AL-CD approach is still comparably more ef\ufb01cient.\nAcknowledgement We acknowledge the support of ARO via W911NF-12-1-0390, and the support\nof NSF via grants CCF-1320746, CCF-1117055, IIS-1149803, IIS-1320894, IIS-1447574, DMS-\n1264033, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS Initiative to Support\nResearch at the Interface of the Biological and Mathematical Sciences.\n\n8\n\n\fReferences\n[1] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. NIPS, 2004.\n[2] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.\n[3] N. Gillis and R. Luce. Robust near-separable nonnegative matrix factorization using linear optimization.\n\nJMLR, 2014.\n\n[4] A. Nellore and R. Ward. Recovery guarantees for exemplar-based clustering. arXiv., 2013.\n[5] I. Yen, X. Lin, K. Zhong, P. Ravikumar, and I. Dhillon. A convex exemplar-based approach to MAD-\n\nBayes dirichlet process mixture models. In ICML, 2015.\n\n[6] M. Yuan. High dimensional inverse covariance matrix estimation via linear programming. JMLR, 2010.\nIn Systems and\n[7] D. Bello and G. Riano. Linear programming solvers for Markov decision processes.\n\nInformation Engineering Design Symposium, pages 90\u201395, 2006.\n\n[8] T. Joachims. Training linear svms in linear time. In KDD. ACM, 2006.\n[9] C. Hsieh, K. Chang, C. Lin, S.S. Keerthi, and S. Sundararajan. A dual coordinate descent method for\n\nlarge-scale linear SVM. In ICML, volume 307. ACM, 2008.\n\n[10] G. Yuan, C. Hsieh K. Chang, and C. Lin. A comparison of optimization methods and software for large-\n\nscale l1-regularized linear classi\ufb01cation. JMLR, 11, 2010.\n\n[11] J. Nocedal and S.J. Wright. Numerical Optimization. Springer, 2006.\n[12] J. Gondzio. Interior point methods 25 years later. EJOR, 2012.\n[13] J. Gondzio. Matrix-free interior point method. Computational Optimization and Applications, 2012.\n[14] V.Eleuterio and D.Lucia. Finding approximate solutions for large scale linear programs. Thesis, 2009.\n[15] Evtushenko, Yu. G, Golikov, AI, and N. Mollaverdy. Augmented lagrangian method for large-scale linear\n\nprogramming problems. Optimization Methods and Software, 20(4-5):515\u2013524, 2005.\n\n[16] F. Delbos and J.C. Gilbert. Global linear convergence of an augmented lagrangian algorithm for solving\n\nconvex quadratic optimization problems. 2003.\n\n[17] O. G\u00a8uler. Augmented lagrangian algorithms for linear programming. Journal of optimization theory and\n\napplications, 75(3):445\u2013470, 1992.\n\n[18] J. Mor\u00b4e J and G. Toraldo. On the solution of large quadratic programming problems with bound con-\n\nstraints. SIAM Journal on Optimization, 1(1):93\u2013113, 1991.\n\n[19] M. Hong and Z. Luo. On linear convergence of alternating direction method of multipliers. arXiv, 2012.\n[20] H. Wang, A. Banerjee, and Z. Luo. Parallel direction method of multipliers. In NIPS, 2014.\n[21] C.Chen, B.He, Y.Ye, and X.Yuan. The direct extension of admm for multi-block convex minimization\n\nproblems is not necessarily convergent. Mathematical Programming, 2014.\n\n[22] I.Dhillon, P.Ravikumar, and A.Tewari. Nearest neighbor based greedy coordinate descent. In NIPS, 2011.\nIndexed block coordinate descent for large-scale linear\n[23] I. Yen, C. Chang, T. Lin, S. Lin, and S. Lin.\n\nclassi\ufb01cation with limited memory. In KDD. ACM, 2013.\n\n[24] I. Yen, S. Lin, and S. Lin. A dual-augmented block minimization framework for learning with limited\n\nmemory. In NIPS, 2015.\n\n[25] K. Zhong, I. Yen, I. Dhillon, and P. Ravikumar. Proximal quasi-Newton for computationally intensive\n\nl1-regularized m-estimators. In NIPS, 2014.\n\n[26] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c.\n\nIteration complexity of randomized block-coordinate descent methods for\n\nminimizing a composite function. Mathematical Programming, 144(1-2):1\u201338, 2014.\n\n[27] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\n[28] P. Wang and C. Lin.\n\nIteration complexity of feasible descent methods for convex optimization. The\n\nJournal of Machine Learning Research, 15(1):1523\u20131548, 2014.\n\n[29] I. Yen, C. Hsieh, P. Ravikumar, and I.S. Dhillon. Constant nullspace strong convexity and fast convergence\n\nof proximal methods under high-dimensional settings. In NIPS, 2014.\n\n[30] B. Meindl and M. Templ. Analysis of commercial and free and open source solvers for linear optimization\n\nproblems. Eurostat and Statistics Netherlands, 2012.\n\n[31] A.J. Hoffman. On approximate solutions of systems of linear inequalities. Journal of Research of the\n\nNational Bureau of Standards, 49(4):263\u2013265, 1952.\n\n[32] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n[33] I. Yen, T. Lin, S. Lin, P. Ravikumar, and I. Dhillon. Sparse random feature algorithm as coordinate descent\n\nin Hilbert space. In NIPS, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1396, "authors": [{"given_name": "Ian En-Hsu", "family_name": "Yen", "institution": "University of Texas at Austin"}, {"given_name": "Kai", "family_name": "Zhong", "institution": "UT Austin"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UC Davis"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "University of Texas at Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}]}