{"title": "Stochastic Primal-Dual Method for Empirical Risk Minimization with O(1) Per-Iteration Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 8366, "page_last": 8375, "abstract": "Regularized empirical risk minimization problem with linear predictor appears frequently in machine learning. In this paper, we propose a new stochastic primal-dual method to solve this class of problems. Different from existing methods, our proposed methods only require O(1) operations in each iteration. We also develop a variance-reduction variant of the algorithm that converges linearly. Numerical experiments suggest that our methods are faster than existing ones such as proximal SGD, SVRG and SAGA on high-dimensional problems.", "full_text": "Stochastic Primal-Dual Method for Empirical Risk\nMinimization with O(1) Per-Iteration Complexity\n\nConghui Tan\u2217\n\nThe Chinese University of Hong Kong\n\nchtan@se.cuhk.edu.hk\n\nTong Zhang\nTencent AI Lab\n\ntongzhang@tongzhang-ml.org\n\nShiqian Ma\n\nUniversity of California, Davis\n\nsqma@math.ucdavis.edu\n\nJi Liu\n\nTencent AI Lab, University of Rochester\n\nji.liu.uwisc@gmail.com\n\nAbstract\n\nRegularized empirical risk minimization problem with linear predictor appears\nfrequently in machine learning. In this paper, we propose a new stochastic primal-\ndual method to solve this class of problems. Different from existing methods, our\nproposed methods only require O(1) operations in each iteration. We also develop\na variance-reduction variant of the algorithm that converges linearly. Numerical\nexperiments suggest that our methods are faster than existing ones such as proximal\nSGD, SVRG and SAGA on high-dimensional problems.\n\n(cid:40)\n\nn(cid:88)\n\ni=1\n\n(cid:41)\n\nd(cid:88)\n\n1\n\nIntroduction\n\nIn this paper, we consider the convex regularized empirical risk minimization with linear predictors:\n\nmin\nx\u2208X\n\nP (x) (cid:44) 1\nn\n\n\u03c6i(a(cid:62)\n\ni x) + g(x)\n\n,\n\n(1)\n\nwhere X \u2282 Rd is a convex closed feasible set, ai \u2208 Rd is the i-th data sample, \u03c6i is its corresponding\nconvex closed loss function, and g(x) : X \u2192 R is a convex closed regularizer for model parameter x.\nHere we assume the feasible set X and the regularizer function g(x) are both separable, i.e.,\n\nX = X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xd\n\nand g(x) =\n\ngj(xj).\n\n(2)\n\nj=1\n\n2(cid:107)x(cid:107)2\n\nProblem (1) with structure (2) generalizes many well-known classi\ufb01cation and regression problems.\nFor example, support vector machine is with this form by choosing \u03c6i(u) = max{0, 1 \u2212 biu} and\n2. Other examples include (cid:96)1 logistic regression, (cid:96)2 logistic regression, and LASSO.\ng(x) = \u03bb\nOne popular choice for solving (1) is the proximal stochastic gradient descent method (PSGD). In\neach iteration of PSGD, an index i is randomly sampled from {1, 2, . . . , n}, and then the iterates are\nupdated using only the information of ai and \u03c6i. As a result, the per-iteration cost of PSGD is O(d)\nand independent of n.\nIt is well known that PSGD converges at a sub-linear rate [9] even for strongly convex problems, due\nto non-diminishing variance of the stochastic gradients. One line of research tried is dedicated to\nimprove the convergence rate of PSGD by utilizing the \ufb01nite sum structure in (1). Some representative\n\n\u2217This work was done while Conghui Tan was a research intern at Tencent AI lab.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fworks include SVRG [5, 16], SDCA [13, 14], SAGA [4] and SPDC [18]. All these accelerated\nvariants enjoy linear convergence when g(x) is strongly convex and all \u03c6i\u2019s are smooth.\nSince all of these algorithms need to sample at least one data ai in each iteration (so their per-\niteration cost is at least O(d)), their potential drawbacks include: 1) they are not suitable for the\ndistributed learning with features distributed; 2) it may incur heavy computation per iteration in the\nhigh-dimensional case, i.e., when d is very large.\nOur contributions. In this paper, we explore the possibility of accelerating PSGD by making each\niteration more light-weighted: only one coordinate of one data is sampled, i.e., one entry aij, in\neach iteration of the algorithm. This leads to a new algorithm, named SPD1 (stochastic primal-dual\nmethod with O(1) per-iteration complexity), whose per-iteration cost is only O(1). We prove that\n\u221a\nthe convergence rate of the new method is O(1/\nt) for convex problems and O(ln t/t) for strongly\nconvex and smooth problems, where t is the iteration counter. Moreover, the overall computational\ncost is the same as PSGD in high-dimensional settings. Therefore, we managed to reduce the\nper-iteration complexity from O(d) to O(1) while keep the total computational cost at the same\norder. Furthermore, by incorporating the variance reduction technique, we develop a variant of\nSPD1, named SPD1-VR, that converges linearly for strongly convex problems. Comparing with\nexisting methods, our SPD1 and SPD1-VR algorithms are more suitable for distributed systems\nby allowing the \ufb02exibility of either feature distributed or data distributed. An additional advantage\nof our O(1) per-iteration complexity algorithms is that they are more favorable by asynchronous\nparallelization and bring more speedup since they admit much better tolerance to the staleness caused\nby asynchronity. Our numerical tests indicate that our methods are faster than both PSGD and SVRG\non high-dimensional problems, even in single-machine setting.\nWe notice that [6] and [17] used similar ideas in the sense that in each iteration only one coordinate\nof the iterate is updated using one sampled data. However, we need to point out that their algorithms\nstill need to sample the full vector ai to compute the directional gradient, and thus the per-iteration\ncost is still O(d).\n\n1.1 Notation\nWe use A \u2208 Rn\u00d7d to denote the data matrix, whose rows are denoted by ai, i = 1, 2, . . . , n. We use\naij to denote the j-th entry of ai. [n] denotes the set {1, 2, . . . , n}. xt denotes the iterate in the t-th\nj is its j-th entry. We always use (cid:107)w(cid:107) to denote the (cid:96)2 norm of w unless otherwise\niteration and xt\nspeci\ufb01ed.\nFor function f (z) whose domain is Z, its proximal mapping is de\ufb01ned as\n\n(cid:26)\n\n(cid:27)\n\nproxf (z) = arg min\n\nz(cid:48)\u2208Z\n\nf (z(cid:48)) +\n\n1\n2\n\n(cid:107)z(cid:48) \u2212 z(cid:107)2\n\n.\n\n(3)\n\nWe use \u2202f (z) to denote the subdifferential of f at point z. f is said to be \u00b5-strongly convex (\u00b5 > 0)\nif\n\nf (z(cid:48)) \u2265 f (z) + s(cid:62)(z(cid:48) \u2212 z) +\n\n(cid:107)z(cid:48) \u2212 z(cid:107)2 \u2200s \u2208 \u2202f (z),\u2200z, z(cid:48) \u2208 Z.\n\n\u00b5\n2\n\nThe conjugate function of \u03c6i : R \u2192 R is de\ufb01ned as\n\n\u03c6\u2217\ni (y) = sup\nx\u2208R\n\n{y \u00b7 x \u2212 \u03c6i(x)} .\n\n(4)\n\nFunction \u03c6i is L-Lipschitz continuous if\n\nwhich is equivalent to\n\n|\u03c6i(x) \u2212 \u03c6i(y)| \u2264 L|x \u2212 y|,\n\n\u2200x, y \u2208 R,\n\n|s| \u2264 L,\n\n\u2200s \u2208 \u2202\u03c6i(x),\u2200x.\n\n\u03c6i is said to be (1/\u03b3)-smooth if it is differentiable and its derivative is (1/\u03b3)-Lipschitz continuous.\nWe use R (cid:44) maxi\u2208[n] (cid:107)ai(cid:107) to denote the maximum row norm of A, and R(cid:48) (cid:44) maxj\u2208[d] (cid:107)a(cid:48)\nj(cid:107) to\ndenote the maximum column norm of A, where a(cid:48)\n\nd are the columns of matrix A.\n\n1, . . . , a(cid:48)\n\n2\n\n\f2 Stochastic Primal-Dual Method with O(1) Per-Iteration Cost\n(cid:41)\n\nOur algorithm solves the following equivalent primal-dual reformulation of (1):\n\n(cid:40)\n\nn(cid:88)\n\n\u03c6\u2217\ni (yi) + g(x)\n\nF (x, y) (cid:44) 1\nn\n\ny(cid:62)Ax \u2212 1\nn\n\n,\n\ni=1\n\nmin\nx\u2208X\n\nmax\ny\u2208Y\n\n(5)\ni (yi) < \u221e} is the dual feasible set resulted by the\ni . For example, when \u03c6i is L-Lipschitz continuous, we have Yi \u2282 [\u2212L, L] (see\n\nwhere Y = Y1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Yn and Yi = {yi \u2208 Rn|\u03c6\u2217\nconjugate function \u03c6\u2217\nLemma 4 in the Supplementary Materials.)\nOur SPD1 algorithm for solving (5) is presented in Algorithm 1. In each iteration of the algorithm,\nonly one coordinate of x and one coordinate of y are updated by using a randomly sampled data entry\naitjt. Therefore, SPD1 only requires O(1) time per iteration. Because of this, SPD1 is a kind of\nrandomized coordinate descent method.\nAlgorithm 1 Stochastic Primal-Dual Method with O(1) Per-Iteration Cost (SPD1)\n\nParameters: primal step sizes {\u03b7t}, dual step sizes {\u03c4t}\nInitialize x0 = arg minx\u2208X g(x) and y0\nfor t = 0, 1,\u00b7\u00b7\u00b7 , T \u2212 1 do\n\nRandomly sample it \u2208 [n] and jt \u2208 [d] independently and uniformly\nxt+1\nj =\n\nj \u2212 \u03b7t \u00b7 aitjyt\n\nit\n\ni = arg minyi\u2208Yi \u03c6\u2217\n(cid:1)\ni + \u03c4t \u00b7 aijtxt\n\nif j = jt\nif j (cid:54)= jt\n\nif i = it\nif i (cid:54)= it\n\ni (yi) for all i \u2208 [n]\n\ni\n\nxt\nj\n\n(cid:0)yt\n\n(cid:0)xt\n(cid:26) prox\u03b7tgj\n(cid:26) prox(\u03c4t/d)\u03c6\u2217\n(cid:80)T\u22121\n\uf8f1\uf8f2\uf8f3F (x, y) =\n\nmax\ny\u2208Y\n\nmin\nx\u2208X\n\njt\n\n(cid:1)\n(cid:80)T\u22121\n(cid:20)\n\nd(cid:88)\n\nn(cid:88)\n\n1\nn\n\ni=1\n\nj=1\n\nt=0 xt and \u02c6yT = 1\n\nT\n\nt=0 yt\n\nyt+1\ni =\nend for\nOutput: \u02c6xT = 1\nT\n\nyt\ni\n\nThe intuition of this algorithm is as follows. One can rewrite (5) into:\n\naijyixj \u2212 1\nd\n\n\u03c6\u2217\ni (yi) + gj(xj)\n\n(cid:21)\uf8fc\uf8fd\uf8fe ,\n\nwhich has a two-layer \ufb01nite-sum structure. Then Algorithm 1 can be viewed as a primal-dual version\nof stochastic gradient descent (SGD) on this \ufb01nite-sum problem, which samples a pair of induces\n(it, jt) and then only utilizes the corresponding summand to do updates in each iteration. Hence, one\ncan view SPD1 as a combination of randomized coordinate descent method and stochastic gradient\ndescent method applied to the primal-dual reformulation (5).\nNote that in the initialization stage, we need to minimize g(x) and \u03c6\u2217\ni (yi). Since we assume the\nproximal mappings of g(x) and \u03c6\u2217\ni are easy to solve, these two direct minimization problems should\nbe even easier, and thus would not bring any trouble in implementation of this algorithm. For example,\nwhen g(x) = \u03bb(cid:107)x(cid:107)1, it is well known its proximal mapping is the soft thresholding operator. While\nits direct minimizer, namely x0 = minx g(x), is simply x0 = 0.\nAs a \ufb01nal remark, we point out that the primal-dual reformulation (5) is a convex-concave bilinear\nsaddle point problem (SSP). This problem has drawn a lot of research attentions recently. For example,\nChambolle and Pock developed an ef\ufb01cient primal-dual gradient method for solving bilinear SSP in\n[2], which has an accelerated rate in certain circumstances. Besides, in [3], Dang and Lan proposed a\nrandomized algorithm for solving (5). However, their algorithm needs to use the full-dimensional\ngradient in each iteration, so the per-iteration cost is much higher than SPD1.\n\n3 SPD1 with Variance Reduction\n\nAs we have discussed above, SPD1 has some close connection to SGD. Hence, we can incorporate the\nvariance reduction technique [5] to reduce the variance of the stochastic gradients so that to improve\nthe convergence rate of SPD1. This new algorithm, named SPD1-VR, is presented in Algorithm 2.\nSimilar to SVRG [5], SPD1-VR has a two-loop structure. In the outer loop, the snapshots of the full\n\n3\n\n\fgradients are computed for both \u02dcxk and \u02dcyk. In the inner loop, the updates are similar to Algorithm 1,\nbut the stochastic gradient is replaced by its variance reduced version. That is, aitjtyt\nit is replaced by\nx,jt is the jt-th coordinate of the latest snapshot of full gradient.\naitjt (yt\nit\nThis variance reduced stochastic gradient is still an unbiased estimator of the full gradient along\ndirection xjt, i.e.,\n\nx,jt, where Gk\n\n\u2212 \u02dcyk\n\n) + Gk\n\nit\n\n(cid:2)aitjt(yt\n\nit\n\nEit\n\n\u2212 \u02dcyk\n\nit\n\n) + Gk\n\nx,jt\n\n(cid:3) = Eit\n\n(cid:2)aitjtyt\n\nit\n\n(cid:3) =\n\nn(cid:88)\n\ni=1\n\n1\nn\n\naijtyt\ni .\n\nBecause of the variance reduction technique, \ufb01xed step sizes \u03b7 and \u03c4 can be used instead of diminish-\ning ones.\n\nAlgorithm 2 SPD1 with Variance Reduction (SPD1-VR)\n\ny = (1/d)A\u02dcxk\n\nParameters: primal step size \u03b7, dual step size \u03c4\nInitialize \u02dcx0 \u2208 X and \u02dcy0 \u2208 Y\nfor k = 0, 1, . . . , K \u2212 1 do\nCompute full gradients Gk\nLet (x0, y0) = (\u02dcxk, \u02dcyk)\nfor t = 0, 1, . . . , T \u2212 1 do\nRandomly sample it, i(cid:48)\n\nx = (1/n)A(cid:62) \u02dcyk and Gk\n\n(cid:40)\n(cid:40)\n\nt\n\nxt\n\nai(cid:48)\n\n(cid:104)\n\ntj(yt\ni(cid:48)\n\nprox\u03b7gj\nxt\nj\n\nt \u2208 [n] and jt, j(cid:48)\n\nj \u2212 \u03b7 \u00b7(cid:16)\n(cid:104)\ni + \u03c4 \u00b7(cid:16)\n(cid:2)xt\nj \u2212 \u03b7 \u00b7(cid:0)aitj(\u00afyt\n(cid:26) prox\u03b7gj\n(cid:2)yt\ni + \u03c4 \u00b7(cid:0)aijt(\u00afxt\n(cid:26) prox(\u03c4 /d)\u03c6\u2217\n\nprox(\u03c4 /d)\u03c6\u2217\nyt\ni\n\naij(cid:48)\n\nxt\nj\n\nyt\n\nit\n\nt\n\ni\n\nt\n\n(xt\nj(cid:48)\n\u2212 \u02dcyk\n\njt\n\n(cid:17)(cid:105)\n\nt \u2208 [d] independently\n\u2212 \u02dcyk\n(cid:105)\ni(cid:48)\n\n) + Gk\n\n(cid:17)\n\nx,j\n\nt\n\n\u2212 \u02dcxk\nj(cid:48)\n\nt\n\n)\n\n+ Gk\ny,i\n\n(cid:1)(cid:3)\n\nx,j\n\nit\n\n) + Gk\n\u2212 \u02dcxk\n\njt\n\n) + Gk\ny,i\n\n(cid:1)(cid:3)\n\n\u00afxt\nj =\n\n\u00afyt\ni =\n\nxt+1\nj =\n\nif j = jt\nif j (cid:54)= jt\n\nif i = it\nif i (cid:54)= it\nif j = jt\nif j (cid:54)= jt\n\nif i = it\nif i (cid:54)= it\n\nyt+1\ni =\nend for\nSet (\u02dcxk+1, \u02dcyk+1) = (xT , yT )\n\nyt\ni\n\ni\n\nend for\nOutput: \u02dcxK and \u02dcyK\n\nt, j(cid:48)\n\nBesides the variance reduction technique, another crucial difference between SPD1 and SPD1-VR\nis that the latter is in fact an extragradient method [7]. Note that each iteration of the inner loop\nof SPD1-VR consists two gradient steps: the \ufb01rst step is a normal gradient descent/ascent, while\nin the second step, it starts from xt and yt but uses the gradient estimations at (\u00afxt, \u00afyt). For saddle\npoint problems, extragradient method has stronger convergence guarantees than simple gradient\nmethods [8]. Moreover, in each iteration of SPD1-VR, two independent pairs of random indices\n(it, jt) and (i(cid:48)\nt) are drawn. This is because two stochastic gradients are needed for the extragradient\nframework. Similar to the classical analysis of stochastic algorithms, we need the stochastic gradients\nto be independent. However, when updating \u00afxt and xt+1, we choose the same coordinate jt, so the\nindependence property is only required for two directional stochastic gradients along coordinate jt.\nWe note that every iteration of the inner loop only involves O(1) operations in SPD1-VR. Full\ngradients are computed in each outer loop, whose computational cost is O(nd).\nFinally, we have to mention that [12] also developed a variance-reduction method for solving convex-\nconcave saddle point problems, which is related to Algorithm 2. However, except for the common\nvariance reduction ideas used in both methods, the method in [12] and SPD1-VR are quite different.\nFirst, there is no coordinate descent counterpart in their method, so the per-iteration cost is much\nhigher than our method. Second, their method is a gradient method instead of extragradient method\nlike SPD1-VR. Third, their method has quadratic dependence on the problem condition number\nunless extra acceleration technique is combined, while our method depends only linearly on condition\nnumber as shown in Section 4,.\n\n4\n\n\f4\n\nIteration Complexity Analysis\n\n4.1\n\nIteration Complexity of SPD1\n\nIn this subsection, we analyze the convergence rate of SPD1 (Algorithm 1). We measure the optimality\nof the solution by primal-dual gap, which is de\ufb01ned as\n\nG(\u02c6xT , \u02c6yT ) (cid:44) sup\ny\u2208Y\n\nF (\u02c6xT , y) \u2212 inf\nx\u2208X\n\nF (x, \u02c6yT ).\n\nNote that primal-dual gap equals 0 if and only if (\u02c6xT , \u02c6yT ) is a pair of primal-dual optimal solutions\nto problem (5). Besides, primal-dual gap is always an upper bound of primal sub-optimality:\n\nG(\u02c6xT , \u02c6yT ) \u2265 sup\ny\u2208Y\n\u2265 sup\ny\u2208Y\n=P (\u02c6xT ) \u2212 inf\nx\u2208X\n\nF (\u02c6xT , y) \u2212 sup\ny\u2208Y\nF (\u02c6xT , y) \u2212 inf\nx\u2208X\n\nP (x).\n\nF (x, y)\n\nF (x, y)\n\ninf\nx\u2208X\n\nsup\ny\u2208Y\n\nOur main result for the iteration complexity of SPD1 is summarized in Theorem 1.\nTheorem 1. Assume each \u03c6i is L-Lipschitz continuous, and the primal feasible set X is bounded,\ni.e.,\n\nD (cid:44) sup\nx\u2208X\n\n(cid:107)x(cid:107) < \u221e.\n\nIf we choose the step sizes in SPD1 as\n\u221a\n\u221a\n2dD\n\n\u03b7t =\n\nLR\n\nt + 1\n\nand\n\n\u03c4t =\n\n\u221a\nDR(cid:48)\u221a\n\n2dLn\n\n,\n\nt + 1\n\nthen we have the following convergence rate for SPD1:\n\nE(cid:2)G(\u02c6xT , \u02c6yT )(cid:3) \u2264\n\n\u221a\n\n2dLD \u00b7 (R + R(cid:48))\n\n\u221a\n\nT\n\n.\n\n(6)\n\nNote that when the problem is high-dimensional, i.e., d \u2265 n, it usually holds that R \u2265 R(cid:48). In this\ncase, Theorem 1 implies that SPD1 requires\nO\n\n(cid:18) dL2D2R2\n\n(cid:19)\n\n(7)\n\n\u00012\n\niterations to ensure that the primal-dual gap is smaller than \u0001.\nUnder the same assumptions, if we directly apply the classical result by Nemirovski et al. [9] for\nPSGD on the primal problem (1), the number of iterations needed by PSGD is\n\n(cid:18) L2D2R2\n\n(cid:19)\n\n,\n\n\u00012\n\nO\n\nin order to reduce the primal sub-optimality to be smaller than \u0001. Considering that each iteration of\nPSGD costs O(d) computation, its overall complexity is actually the same as (7).\nIf we further impose strong convexity and smoothness assumptions, we get an improved iteration\ncomplexity shown in Theorem 2.\nTheorem 2. We assume the same assumptions as in Theorem 1. Moreover, we assume that g(x) is\n\u00b5-strongly convex (\u00b5 > 0), and all \u03c6i are (1/\u03b3)-smooth (\u03b3 > 0). If the step sizes in SPD1 are chosen\nas\n\n2\n\n\u03b7t =\n\n\u00b5(t + 4)\n\nand \u03c4t =\n\n2nd\n\n,\n\n\u03b3(t + 4)\n\nwe have the following convergence rate for SPD1:\n\nE(cid:2)G(\u02c6xT , \u02c6yT )(cid:3) \u2264 4dD2\u00b5 + 4L2\u03b3 + 2L2R2/\u00b5 \u00b7 ln(T + 4) + 2dD2R(cid:48)2/\u03b3 \u00b7 ln(T + 4)\n\n.\n\n(8)\n\nT\n\n5\n\n\fComparing with the classical convergence rate result of PSGD, we note that the convergence rate\nof SPD1 not only depends on \u00b5, but also depends on the dual strong convexity parameter \u03b3. This\nis reasonable because SPD1 has stochastic updates for both primal and dual variables, and this is\nwhy \u03b3 > 0 is necessary for ensuring the O(ln T /T ) convergence rate. Furthermore, we believe that\nthe factor ln T is removable in (8) so O(1/T ) convergence rate can be obtained, by applying more\nsophisticated analysis technique such as optimal averaging [15]. We do not dig into this to keep the\npaper more succinct.\n\n4.2\n\nIteration Complexity of SPD1-VR\n\nIn this subsection, we analyze the iteration complexity of SPD1-VR (Algorithm 2). Before stating the\nmain result, we \ufb01rst introduce the notion of condition number. When each \u03c6i is (1/\u03b3)-smooth, and\ng(x) is \u00b5-strongly convex, the condition number of the primal problem (1) de\ufb01ned in the literature is\n(see, e.g., [18]):\n\n\u03ba =\n\nR2\n\u00b5\u03b3\n\n.\n\nHere we also de\ufb01ne another condition number:\n\u03ba(cid:48) =\n\ndR(cid:48)2\nn\u00b5\u03b3\n\n.\n\nSince R is the maximum row norm and R(cid:48) is the maximum column norm of data matrix A \u2208 Rn\u00d7d,\nusually R2 and (d/n)R(cid:48)2 should be in the same order, which means \u03ba(cid:48) \u2248 \u03ba. Without loss of\ngenerality, we assume that \u03ba \u2265 1 and \u03ba(cid:48) \u2265 1.\nTheorem 3. Assume each \u03c6i is (1/\u03b3)-smooth (\u03b3 > 0), and g(x) is \u00b5-strongly convex (\u00b5 > 0). If we\nchoose the step sizes in SPD1-VR as\n\n\u03b3\n\n128R2 \u00b7 min\n\n\u03b7 =\n\n(9)\nand let T \u2265 c \u00b7 max{d\u03ba, n\u03ba(cid:48)} for some uniform constant c > 0 independent of problem setting,\nSPD1-VR converges linearly in expectation:\n\nn\u03ba(cid:48) , 1\n\n\u03c4 =\n\nand\n\nd\u03ba\n\n, 1\n\n,\n\nn\u00b5\n\n128R(cid:48)2 \u00b7 min\n\n(cid:26) d\u03ba\n\n(cid:27)\n\n(cid:26) n\u03ba(cid:48)\n\n(cid:27)\n\n(cid:19)K \u00b7 \u22060,\n\n(cid:18) 3\n\n5\n\nE [\u2206K] \u2264\n\n(cid:107)\u02dcxk \u2212 x\u2217(cid:107)2\n\n(cid:107)\u02dcyk \u2212 y\u2217(cid:107)2\n\n\u2206k =\n\n+\n\n(cid:18)\n\n(cid:18)\n\nO\n\n(cid:19)\n\n1\n\u0001\n\n(cid:19)\n\n1\n\u0001\n\n6\n\nwhere\n\n\u03c4\nand (x\u2217, y\u2217) is a pair of primal-dual optimal solution to (5).\nThis theorem implies that we need O(log(1/\u0001)) outer loops, or O(max{d\u03ba, n\u03ba(cid:48)} log(1/\u0001)) inner\nloops to ensure E [\u2206k] \u2264 \u0001. Considering that there are \u0398(nd) extra computation cost for computing\nthe full gradient in every outer loop, the total complexity of this algorithm is\n\n\u03b7\n\nO\n\n(nd + max{d\u03ba, n\u03ba(cid:48)}) log\n\n.\n\n(10)\n\nAs a comparison, the complexity of SVRG under same setting is [16]:\n\n,\n\nd(n + \u03ba) log\n\n(11)\nSince usually it holds \u03ba(cid:48) \u2248 \u03ba, d\u03ba will dominate n\u03ba(cid:48) when d > n. In this case, the two complexity\nbounds (10) and (11) are the same.\nAlthough the theoretical complexity of SPD1-VR is same as SVRG when d \u2265 n, we empirically\nfound that SPD1-VR is signi\ufb01cantly faster than SVRG in high-dimensional problems (see Section\n5), by allowing much larger step sizes than the ones in (9) suggested by theory. We conjecture that\nthis is due to the power of coordinate descent. Nesterov\u2019s seminal work [11] has rigorously proved\nthat coordinate descent can reduce the Lipschitz constant of the problem, and thus allows larger step\n\n\fFigure 1: Experimental results on synthetic data. n is set to 1000 in all \ufb01gures, while d varies\nfrom 100 to 10000. \u03bb is \ufb01xed as \u03bb = 10\u22123. The y-axis here is the primal sub-optimality, namely\nP (xt) \u2212 P (x\u2217).\n\nsizes than gradient descent. However, due to the sophisticated coupling of primal and dual variable\nupdates in our algorithm, our analysis is currently unable to re\ufb02ect this property.\nWe point out that the existing accelerated algorithms such as SPDC [18] and Katyusha [1] have better\ncomplexity given by\n\n(cid:18)\n\nO\n\nd(n +\n\n\u221a\n\nn\u03ba) log\n\n1\n\u0001\n\n(cid:19)\n\n.\n\n(12)\n\nThese accelerated algorithms employ Nesterov\u2019s extrapolation techniques [10] to accelerate the\nalgorithms. We believe that it is also possible to incorporate the same technique to further accelerate\nSPD1-VR, but we leave this as a future work at this moment.\n\n5 Experiments\n\nIn this section, we conduct numerical experiments of our proposed algorithms. Due to space limitation,\nwe only present part of the results here, more experiments can be found in supplementary materials.\nHere we consider solving a classi\ufb01cation problem with logistic loss function\n\n\u03c6i(u) = log (1 + exp{\u2212biu}) ,\n\nwhere bi \u2208 {\u00b11} is the class label. Note that this loss function is smooth. Although this \u03c6\u2217\ni does not\nadmit closed-form solution for its proximal mapping, following [13], we apply Newton\u2019s method\nto compute its proximal mapping, which can achieve very high accuracy in very few (say, 5) steps.\nSince the proximal sub-problem of \u03c6\u2217\ni is a 1-dimensional optimization problem, using Newton\u2019s\nmethod here is actually quite cheap. Besides, We use g(x) = \u03bb\nWe compare our SPD1 and SPD1-VR with some standard stochastic algorithms for solving (1),\nincluding PSGD, SVRG and SAGA. We always set T = nd for SPD1-VR and T = n for SVRG,\nwhere T is the number of inner loops in each outer loop.\n\n2(cid:107)x(cid:107)2 as the regularizer.\n\n5.1 Results on Synthetic Data\n\nSince our theory in Section 4 suggest that the performance of our proposed methods relies on the\nrelationship between n and d. Here we will test our methods on synthetic dataset with different n\nand d to see their effects to the performance. To generate the data, we \ufb01rst randomly sample the data\nmatrix A and a vector \u00afx \u2208 Rd with entries i.i.d. drawn from N (0, 1), and then generate the labels as\n\nbi = sign(cid:0)a(cid:62)\n\n(cid:1) ,\n\ni \u00afx + \u03b5i\n\n\u03b5i \u223c N (0, \u03c32),\n\nfor some constant \u03c32 > 0. Since the focus here is the relationship between n and d, in order to\nsimplify the experiments, we \ufb01x n as n = 1000, but vary the value of d with values chosen from\nd \u2208 {100, 1000, 10000}.\nThe results are presented in Figure 1. When d = 100 < n, it is clear that SPD1 is slower than PSGD,\nand SPD1-VR is also inferior than both SVRG and SAGA. While when d = 1000 = n, even though\nSPD1 falls behind PSGD at the beginning, their \ufb01nal performance is quite close at last, and SPD1-VR\n\n7\n\n0102030405060#Pass through data108107106105104103102101d=100SAGASVRGPSGDSPD1SPD1-VR020406080100#Pass through data108107106105104103102101100d=1000SAGASVRGPSGDSPD1SPD1-VR020406080100#Pass through data104103102101100d=10000SAGASVRGPSGDSPD1SPD1-VR\fTable 1: Summary of datasets\n\nDataset\n\ncolon-cancer\n\ngisette\n\nrcv1.binary\n\nn\n62\n\n6,000\n20,242\n\nd\n\n2,000\n5,000\n47,236\n\n\u03bb\n1\n10\u22122\n10\u22123\n\nFigure 2: Numerical results on three real datasets. The y-axis is also primal sub-optimality.\n\nbegins to beat both SVRG and SAGA. Finally, when d > n, SPD1 becomes obviously faster than\nPSGD, and SPD1-VR is also signi\ufb01cantly better than SVRG and SAGA. This indicates that our\nalgorithms SPD1 and SPD1-VR are preferable in practice for high-dimensional problems.\n\n5.2 Results on Real Datasets\n\nIn this part, we will demonstrate the ef\ufb01ciency of our proposed methods on real datasets. Here we\nonly focus on the high-dimensional case where d > n or d \u2248 n.\nWe will test all the algorithms on three real datasets: colon-cancer, gisette and rcv1.binary,\ndownloaded from the LIBSVM website 2. The attributes of these data and \u03bb used for each dataset are\nsummarized in Table 1.\nThe experimental results on these real datasets are shown in Figure 2. For colon-cancer dataset,\nwhere d is much larger than n, the performance of SPD1-VR is really dominating over other methods,\nand SPD1 also performs better than PSGD. For gisette dataset, where n is slightly larger than d,\nSPD1-VR still outperforms all other competitors, but this time SPD1 is slower than PSGD. Besides,\nfor rcv1.binary, both SPD1 and SPD1-VR are better than PSGD and SVRG/SAGA respectively.\nThese results on real datasets further con\ufb01rm that our proposed methods, especially SPD1-VR, are\nfaster than existing algorithms on high-dimensional problems.\n\n6 Conclusion\n\nIn this paper, we developed two stochastic primal-dual algorithms, named SPD1 and SPD1-VR for\nsolving regularized empirical risk minimization problems. Different from existing methods, our\nproposed algorithms have a brand-new updating style, which only need to use one coordinate of one\nsampled data in each iteration. As a result, the per-iteration cost is very low and the algorithms are\nvery suitable for distributed systems. We proved that the overall convergence property of SPD1 and\nSPD1-VR resembles PSGD and SVRG respectively under certain condition, and empirically showed\nthat they are faster than existing methods such as PSGD, SVRG and SAGA in high-dimensional\nsettings. We believe that our new methods have great potential to be used in large-scale distributed\noptimization applications.\n\n2www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets\n\n8\n\n020406080100#Pass through data105104103102101100colon-cancerSAGASVRGPSGDSPD1SPD1-VR020406080100#Pass through data105104103102101100gisetteSAGASVRGPSGDSPD1SPD1-VR01020304050#Pass through data1010108106104102100rcv1.binarySAGASVRGPSGDSPD1SPD1-VR\f7 Acknowledgement\n\nS. Ma is partly supported by a startup package in Department of Mathematics at UC Davis and the\nNational Natural Science Foundation of China under Grant 11631013. J. Liu is in part supported by\nNSF CCF1718513, IBM faculty award, and NEC fellowship.\n\nReferences\n[1] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In\n\nInternational Conference on Machine Learning, pages 699\u2013707, 2016.\n\n[2] Antonin Chambolle and Thomas Pock. A \ufb01rst-order primal-dual algorithm for convex problems\nwith applications to imaging. Journal of mathematical imaging and vision, 40(1):120\u2013145,\n2011.\n\n[3] Cong Dang and Guanghui Lan. Randomized \ufb01rst-order methods for saddle point optimization.\n\nhttps://arxiv.org/abs/1409.8625, 2014.\n\n[4] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in neural\ninformation processing systems, pages 1646\u20131654, 2014.\n\n[5] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[6] Jakub Kone\u02c7cn`y, Zheng Qu, and Peter Richt\u00e1rik. Semi-stochastic coordinate descent. Optimiza-\n\ntion Methods and Software, 32(5):993\u20131005, 2017.\n\n[7] GM Korpelevich. The extragradient method for \ufb01nding saddle points and other problems.\n\nMatecon, 12:747\u2013756, 1976.\n\n[8] Arkadi Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequali-\nties with lipschitz continuous monotone operators and smooth convex-concave saddle point\nproblems. SIAM Journal on Optimization, 15(1):229\u2013251, 2004.\n\n[9] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic\napproximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574\u2013\n1609, 2009.\n\n[10] Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Applied Opti-\n\nmization. Kluwer Academic Publishers, Boston, MA, 2004.\n\n[11] Yurii Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\n\nSIAM Journal on Optimization, 22(2):341\u2013362, 2012.\n\n[12] Balamurugan Palaniappan and Francis Bach. Stochastic variance reduction methods for saddle-\npoint problems. In Advances in Neural Information Processing Systems, pages 1416\u20131424,\n2016.\n\n[13] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss minimization. Journal of Machine Learning Research, 14(Feb):567\u2013599, 2013.\n\n[14] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent\nfor regularized loss minimization. In International Conference on Machine Learning, pages\n64\u201372, 2014.\n\n[15] Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization:\nConvergence results and optimal averaging schemes. In International Conference on Machine\nLearning, pages 71\u201379, 2013.\n\n[16] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n9\n\n\f[17] Adams Wei Yu, Qihang Lin, and Tianbao Yang. Doubly stochastic primal-dual coordinate\nmethod for empirical risk minimization and bilinear saddle-point problem. arXiv preprint\narXiv:1508.03390, 2015.\n\n[18] Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical\n\nrisk minimization. The Journal of Machine Learning Research, 18(1):2939\u20132980, 2017.\n\n10\n\n\f", "award": [], "sourceid": 5067, "authors": [{"given_name": "Conghui", "family_name": "Tan", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Tong", "family_name": "Zhang", "institution": "Tencent AI Lab"}, {"given_name": "Shiqian", "family_name": "Ma", "institution": null}, {"given_name": "Ji", "family_name": "Liu", "institution": "University of Rochester, Tencent AI lab"}]}