{"title": "Stochastic Proximal Gradient Descent with Acceleration Techniques", "book": "Advances in Neural Information Processing Systems", "page_first": 1574, "page_last": 1582, "abstract": "Proximal gradient descent (PGD) and stochastic proximal gradient descent (SPGD) are popular methods for solving regularized risk minimization problems in machine learning and statistics. In this paper, we propose and analyze an accelerated variant of these methods in the mini-batch setting. This method incorporates two acceleration techniques: one is Nesterov's acceleration method, and the other is a variance reduction for the stochastic gradient. Accelerated proximal gradient descent (APG) and proximal stochastic variance reduction gradient (Prox-SVRG) are in a trade-off relationship. We show that our method, with the appropriate mini-batch size, achieves lower overall complexity than both APG and Prox-SVRG.", "full_text": "Stochastic Proximal Gradient Descent with\n\nAcceleration Techniques\n\nAtsushi Nitanda\n\nNTT DATA Mathematical Systems Inc.\n\n1F Shinanomachi Rengakan, 35,\n\nShinanomachi, Shinjuku-ku, Tokyo,\n\n160-0016, Japan\n\nnitanda@msi.co.jp\n\nAbstract\n\nProximal gradient descent (PGD) and stochastic proximal gradient descent\n(SPGD) are popular methods for solving regularized risk minimization problems\nin machine learning and statistics. In this paper, we propose and analyze an ac-\ncelerated variant of these methods in the mini-batch setting. This method incor-\nporates two acceleration techniques: one is Nesterov\u2019s acceleration method, and\nthe other is a variance reduction for the stochastic gradient. Accelerated proxi-\nmal gradient descent (APG) and proximal stochastic variance reduction gradient\n(Prox-SVRG) are in a trade-off relationship. We show that our method, with the\nappropriate mini-batch size, achieves lower overall complexity than both APG and\nProx-SVRG.\n\n1\n\nIntroduction\n\nThis paper consider the following optimization problem:\n\nminimize\n\nx\u2208Rd\n\nf (x) def= g(x) + h(x),\n\n(1)\n\nwhere g is the average of the smooth convex functions g1, . . . , gn from Rd to R, i.e., g(x) =\nnPn\n1\ni=1 gi(x) and h : Rd \u2192 R is a relatively simple convex function that can be non-differentiable.\nIn machine learning, we often encounter optimization problems of this form. For example, given\na sequence of training examples (a1, b1), . . . , (an, bn), where ai \u2208 Rd and bi \u2208 R, if we set\ngi(x) = 1\n2kxk2 or we obtain\nLasso by setting h(x) = \u03bb|x|. If we set gi(x) = log(1 + exp(\u2212bixT ai)), then we obtain regular-\nized logistic regression.\n\ni x \u2212 bi)2, then we obtain ridge regression by setting h(x) = \u03bb\n\n2 (aT\n\nTo solve the optimization problem (1), one popular method is proximal gradient descent (PGD),\nwhich can be described by the following update rule for k = 1, 2, . . .:\nxk+1 = prox\u03b7kh (xk \u2212 \u03b7k\u2207g(xk)) ,\n\nwhere prox is the proximity operator,\n\nprox\u03b7h(y) = arg min\n\nx\u2208Rd (cid:26) 1\n\n2kx \u2212 yk2 + \u03b7h(x)(cid:27) .\n\nA stochastic variant of PGD is stochastic proximal gradient descent (SPGD), where at each iteration\nk = 1, 2, . . ., we pick ik randomly from {1, 2, . . . , n}, and take the following update:\n\nxk+1 = prox\u03b7kh (xk \u2212 \u03b7k\u2207gik (xk)) .\n\n1\n\n\fThe advantage of SPGD over PGD is that at each iteration, SPGD only requires the computation\nof a single gradient \u2207gik (xk). In contrast, each iteration of PGD evaluates the n gradients. Thus\nthe computational cost of SPGD per iteration is 1/n that of the PGD. However, due to the variance\nintroduced by random sampling, SPGD obtains a slower convergence rate than PGD. In this paper\nwe consider problem (1) under the following assumptions.\nAssumption 1. Each convex function gi(x) is L-Lipschitz smooth, i.e., there exist L > 0 such that\nfor all x, y \u2208 Rd,\n\nk\u2207gi(x) \u2212 \u2207gi(y)k \u2264 Lkx \u2212 yk.\n\nFrom (2), one can derive the following inequality,\n\n(2)\n\n(3)\n\n(4)\n\ngi(x) \u2264 gi(y) + (\u2207gi(y), x \u2212 y) +\n\nL\n2 kx \u2212 yk2.\n\nAssumption 2. g(x) is \u00b5-strongly convex; i.e., there exists \u00b5 > 0 such that for all x, y \u2208 Rd,\n\ng(x) \u2265 g(y) + (\u2207g(y), x \u2212 y) +\n\n\u00b5\n2 kx \u2212 yk2.\n\nNote that it is obvious that L \u2265 \u00b5.\nAssumption 3. The regularization function h(x) is a lower semi-continuous proper convex function;\nhowever, it can be non-differentiable or non-continuous.\n\nUnder the Assumptions 1, 2, and h(x) \u2261 0, PGD (which is equivalent to gradient descent in this\ncase) with a constant learning rate \u03b7k = 1\nL achieves a linear convergence rate. On the other hand, for\nstochastic (proximal) gradient descent, because of the variance introduced by random sampling, we\nneed to choose diminishing learning rate \u03b7k = O(1/k), and thus the stochastic (proximal) gradient\ndescent converges at a sub-linear rate.\n\nTo improve the stochastic (proximal) gradient descent, we need a variance reduction technique,\nwhich allows us to take a larger learning rate. Recently, several papers proposed such variance\nreduction methods for the various special cases of (1). We refer the reader to [1\u201313]. In the case\nwhere gi(x) is Lipschitz smooth and h(x) is strongly convex, Shalev-Shwartz and Zhang [1, 2]\nproposed a proximal stochastic dual coordinate ascent (Prox-SDCA); the same authors developed\naccelerated variants of SDCA [3, 4]. Le Roux et al. [5] proposed a stochastic average gradient\n(SAG) for the case where gi(x) is Lipschitz smooth, g(x) is strongly convex, and h(x) \u2261 0. These\nmethods achieve a linear convergence rate. However, SDCA and SAG need to store all gradients (or\ndual variables), so that O(nd) storage is required in general problems. Although this can be reduced\nto O(n) for linear prediction problems, these methods may be unsuitable for more complex and\nlarge-scale problems. More recently, Johnson and Zhang [6] proposed stochastic variance reduction\ngradients (SVRG) for the case where gi(x) is L-Lipschitz smooth, g(x) is \u00b5-strongly convex, and\nh(x) \u2261 0. SVRG achieves the following overall complexity (total number of component gradient\nevaluations to \ufb01nd an \u01eb-accurate solution),\n\nO(cid:18)(n + \u03ba) log(cid:18) 1\n\n\u01eb(cid:19)(cid:19) ,\n\n(5)\n\nwhere \u03ba is the condition number L/\u00b5. Furthermore, this method need not store all gradients. Xiao\nand Zhang [7] proposed a proximal variant of SVRG, called Prox-SVRG which also achieves the\nsame complexity.\n\nAnother effective method for solving (1) is accelerated proximal gradient descent (APG), proposed\nby Nesterov [14, 15]. APG [14] is an accelerated variant of deterministic gradient descent and\nachieves the following overall complexity to \ufb01nd an \u01eb-accurate solution,\n\nO(cid:18)n\u221a\u03ba log(cid:18) 1\n\n\u01eb(cid:19)(cid:19) .\n\n(6)\n\nComplexities (5) and (6) are in a trade-off relationship. For example, if \u03ba = n, then the complexity\n(5) is less than (6). On the other hand, the complexity of APG has a better dependence on the\ncondition number \u03ba.\n\nIn this paper, we propose and analyze a new method called the Accelerated Mini-Batch Prox-SVRG\n(Acc-Prox-SVRG) for solving (1). Acc-Prox-SVRG incorporates two acceleration techniques in\n\n2\n\n\fthe mini-batch setting: (1) Nesterov\u2019s acceleration method of APG and (2) an variance reduction\ntechnique of SVRG. We show that the overall complexity of this method, with an appropriate mini-\nbatch size, is more ef\ufb01cient than both Prox-SVRG and APG; even when mini-batch size is not\nappropriate, our method is still comparable to APG or Prox-SVRG.\n\n2 Accelerated Mini-Batch Prox-SVRG\n\nAs mentioned above, to ensure convergence of SPGD, the learning rate \u03b7k has to decay to zero\nfor reducing the variance effect of the stochastic gradient. This slows down the convergence. As\na remedy to this issue, we use the variance reduction technique of SVRG [6] (see also [7]), which\nallows us to take a larger learning rate. Acc-Prox-SVRG is a multi-stage scheme. During each stage,\nthis method performs m APG-like iterations and employs the following direction with mini-batch\ninstead of gradient,\n\nvk = \u2207gIk (yk) \u2212 \u2207gIk (\u02dcx) + \u2207g(\u02dcx),\n\n(7)\n\nwhere Ik = {i1, . . . , ib} is a randomly chosen size b subset of {1, 2, . . . , n} and gIk = 1\nj=1 gij .\nAt the beginning of each stage, the initial point x1 is set to be \u02dcx, and at the end of stage, \u02dcx is updated.\nConditioned on yk, we can take expectation with respect to Ik and obtain EIk [vk] = \u2207g(yk), so\nthat vk is an unbiased estimator. As described in the next section, the conditional variance EIkkvk \u2212\n\u2207g(yk)k2 can be much smaller than Eik\u2207gi(yk)\u2212\u2207g(yk)k2 near the optimal solution. The pseudo-\ncode of our Acc-Prox-SVRG is given in Figure 1.\n\nb Pb\n\nParameters update frequency m, learning rate \u03b7, mini-batch size b\n\nand non-negative sequence \u03b21, . . . , \u03b2m\n\nInitialize \u02dcx1\nIterate: for s = 1, 2, . . .\n\ni=1 \u2207gi(\u02dcx)\n\n\u02dcx = \u02dcxs\nnPn\n\u02dcv = 1\nx1 = y1 = \u02dcx\nIterate: for k = 1, 2, . . . , m\nRandomly pick subset Ik \u2282 {1, 2, . . . , n} of size b\nvk = \u2207gIk (yk) \u2212 \u2207gIk (\u02dcx) + \u02dcv\nxk+1 = prox\u03b7h (yk \u2212 \u03b7vk)\nyk+1 = xk+1 + \u03b2k(xk+1 \u2212 xk)\nend\nset \u02dcxs+1 = xm+1\n\nend\n\nIn our analysis, we focus on a basic variant of the algorithm (Figure 1) with \u03b2k = 1\u2212\u221a\u00b5\u03b7\n1+\u221a\u00b5\u03b7 .\n\nFigure 1: Acc-Prox-SVRG\n\n3 Analysis\n\nIn this section, we present our analysis of the convergence rates of Acc-Prox-SVRG described in\nFigure 1 under Assumptions 1, 2 and 3, and provide some notations and de\ufb01nitions. Note that we\nmay omit the outer index s for notational simplicity. By the de\ufb01nition of a proximity operator, there\nexists a subgradient \u03bek \u2208 \u2202h(xk+1) such that\n\nWe de\ufb01ne the estimate sequence \u03a6k(x) (k = 1, 2, . . . , m + 1) by\n\nxk+1 = yk \u2212 \u03b7 (vk + \u03bek) .\n\n\u03a61(x) = f (x1) +\n\n\u00b5\n2 kx \u2212 x1k2 and\n\n\u03a6k+1(x) = (1 \u2212 \u221a\u00b5\u03b7)\u03a6k(x) + \u221a\u00b5\u03b7(gIk (yk) + (vk, x \u2212 yk) +\n\n\u00b5\n2 kx \u2212 ykk2\n\n+h(xk+1) + (\u03bek, x \u2212 xk+1)),\n\nf or k \u2265 1.\n\n3\n\n\fWe set\n\n\u03a6\u2217k = min\nx\u2208Rd\n\n\u03a6k(x) and zk = arg min\n\n\u03a6k(x).\n\nx\u2208Rd\n\nSince \u22072\u03a6k(x) = \u00b5In, it follows that for \u2200x \u2208 Rd,\n\n\u03a6k(x) =\n\n\u00b5\n2 kx \u2212 zkk2 + \u03a6\u2217k.\n\n(8)\n\nThe following lemma is the key to the analysis of our method.\n\nLemma 1. Consider Acc-Prox-SVRG in Figure 1 under Assumptions 1, 2, and 3. If \u03b7 \u2264 1\nfor k \u2265 1 we have\n\n2L , then\n\n(9)\n\nE [\u03a6k(x)] \u2264 f (x) + (1 \u2212 \u221a\u00b5\u03b7)k\u22121 (\u03a61 \u2212 f )(x) and\nk\u22121\n\n\u00b5\n2\n\nXl=1\n\n(1 \u2212 \u221a\u00b5\u03b7)k\u22121\u2212l(cid:26)\u2212\n\n\u221a\u00b5\u03b7 kxl \u2212 ylk2 + \u03b7k\u2207g(yl) \u2212 vlk2(cid:27)# , (10)\n1 \u2212 \u00b5\u03b7\n\nE [f (xk)] \u2264 E\"\u03a6\u2217k +\nwhere the expectation is taken with respect to the history of random variables I1, . . . , Ik\u22121.\nNote that if the conditional variance of vl is equal to zero, we immediately obtain a linear conver-\ngence rate from (9) and (10). Before we can prove Lemma 1, additional lemmas are required, whose\nproofs may be found in the Supplementary Material.\nLemma 2. If \u03b7 < 1\n\n\u00b5 , then for k \u2265 1 we have\n\nzk+1 = (1 \u2212 \u221a\u00b5\u03b7)zk + \u221a\u00b5\u03b7yk \u2212r \u03b7\nzk \u2212 yk =\n\n(yk \u2212 xk).\n\n1\n\u221a\u00b5\u03b7\n\n\u00b5\n\n(vk + \u03bek) and\n\n(11)\n\n(12)\n\nLemma 3. For k \u2265 1, we have\n\n1\n\n2(cid:0)k\u2207g(yk) + \u03bekk2 + kvk + \u03bekk2 \u2212 k\u2207g(yk) \u2212 vkk2(cid:1) , (13)\n\n(14)\n\n(\u2207g(yk) + \u03bek, vk + \u03bek) =\nkvk + \u03bekk2 \u2264 2(cid:0)k\u2207g(yk) + \u03bekk2 + k\u2207g(yk) \u2212 vkk2(cid:1) , and\nk\u2207g(yk) + \u03bekk2 \u2264 2(cid:0)kvk + \u03bekk2 + k\u2207g(yk) \u2212 vkk2(cid:1) .\n\n(15)\n\nProof of Lemma 1. Using induction, it is easy to show (9). The proof is in Supplementary Material.\nNow we prove (10) by induction. From the de\ufb01nition of \u03a61, \u03a6\u22171 = f (x1). we assume (10) is true\nfor k. Using Eq. (11), we have\n\nkyk \u2212 zk+1k2 =(cid:13)(cid:13)(cid:13)(cid:13)\n\n(1 \u2212 \u221a\u00b5\u03b7)(yk \u2212 zk) +r \u03b7\n= (1 \u2212 \u221a\u00b5\u03b7)2kyk \u2212 zkk2 + 2r \u03b7\n\n\u00b5\n\n\u00b5\n\n(vk + \u03bek)(cid:13)(cid:13)(cid:13)(cid:13)\n\n2\n\n(1 \u2212 \u221a\u00b5\u03b7)(yk \u2212 zk, vk + \u03bek) +\n\n\u03b7\n\u00b5kvk + \u03bekk2.\n\nFrom above equation and (8) with x = yk, we get\n\n\u03a6k+1(yk) = \u03a6\u2217k+1 +\n\n\u00b5\n\n2 (cid:26)(1 \u2212 \u221a\u00b5\u03b7)2kyk \u2212 zkk2 + 2r \u03b7\n\n\u00b5\n\n(1 \u2212 \u221a\u00b5\u03b7)(yk \u2212 zk, vk + \u03bek)\n\n+\n\n\u03b7\n\n\u00b5kvk + \u03bekk2(cid:27) .\n\nOn the other hand, from the de\ufb01nition of the estimate sequence and (8),\n\n\u03a6k+1(yk) = (1 \u2212 \u221a\u00b5\u03b7)(cid:16)\u03a6\u2217k +\n\n\u00b5\n\n2 kyk \u2212 zkk2(cid:17) + \u221a\u00b5\u03b7(gIk (yk) + h(xk+1) + (\u03bek, yk \u2212 xk+1)).\n\n4\n\n\fTherefore, from these two equations, we have\n\n\u03a6\u2217k+1 = (1 \u2212 \u221a\u00b5\u03b7)\u03a6\u2217k +\n\n+(\u03bek, yk \u2212 xk+1)) \u2212 (1 \u2212 \u221a\u00b5\u03b7)\u221a\u00b5\u03b7(yk \u2212 zk, vk + \u03bek) \u2212\n\n(1 \u2212 \u221a\u00b5\u03b7)\u221a\u00b5\u03b7kyk \u2212 zkk2 + \u221a\u00b5\u03b7(gIk (yk) + h(xk+1)\n\u03b7\n2kvk + \u03bekk2.\n\n\u00b5\n2\n\nSince g is Lipschitz smooth, we bound f (xk+1) as follows:\nf (xk+1) \u2264 g(yk) + (\u2207g(yk), xk+1 \u2212 yk) + L\n\n2 kxk+1 \u2212 ykk2 + h(xk+1).\n\nUsing (16), (17), (12), and xk+1 \u2212 yk = \u2212\u03b7(vk + \u03bek) we have\nEIk(cid:2)f (xk+1) \u2212 \u03a6\u2217k+1(cid:3)\n\u2264(16),(17)\n\nL\n2 kxk+1 \u2212 ykk2 \u2212\n\nEIkh(1 \u2212 \u221a\u00b5\u03b7)(\u2212\u03a6\u2217k + g(yk) + h(xk+1)) + (\u2207g(yk), xk+1 \u2212 yk)\n(1 \u2212 \u221a\u00b5\u03b7)\u221a\u00b5\u03b7kyk \u2212 zkk2\n\n+\u221a\u00b5\u03b7(\u03bek, xk+1 \u2212 yk) +\n+(1 \u2212 \u221a\u00b5\u03b7)\u221a\u00b5\u03b7(yk \u2212 zk, vk + \u03bek) +\nEIkh(1 \u2212 \u221a\u00b5\u03b7)(\u2212\u03a6\u2217k + g(yk) + h(xk+1) + (xk \u2212 yk, vk + \u03bek)) \u2212 \u03b7(\u2207g(yk), vk + \u03bek)\n\u2212\u03b7\u221a\u00b5\u03b7(\u03bek, vk + \u03bek) \u2212\n\n2kvk + \u03bekk2i\n\nkyk \u2212 xkk2 +\n\n1 \u2212 \u221a\u00b5\u03b7\n\u221a\u00b5\u03b7\n\n(19)\n\n\u00b5\n2\n\n\u00b5\n2\n\n\u03b7\n2\n\n\u03b7\n\n(L\u03b7 + 1)kvk + \u03bekk2i ,\n\n=\n(12)\n\nwhere for the \ufb01rst inequality we used EIk [gIk (yk)] = g(yk). Here, we give the following\n\n(16)\n\n(17)\n\n(18)\n\n\u00b5\n\nEIk [g(yk) + h(xk+1) + (xk \u2212 yk, vk + \u03bek)]\n= EIk [g(yk) + (vk, xk \u2212 yk) + h(xk+1) + (\u03bek, xk \u2212 xk+1) + (\u03bek, xk+1 \u2212 yk)]\n\u2264 EIkhg(xk) \u2212\nEIk(cid:2)f (xk+1) \u2212 \u03a6\u2217k+1(cid:3)\n\u2264(19),(20)\n\n2 kxk \u2212 ykk2 + h(xk) \u2212 \u03b7(\u03bek, vk + \u03bek)i ,\n\nwhere for the \ufb01rst inequality we used EIk [vk] = \u2207g(yk) and convexity of g and h. Thus we have\n\n1 \u2212 \u00b5\u03b7\n\u221a\u00b5\u03b7 kxk \u2212 ykk2\n\nEIkh(1 \u2212 \u221a\u00b5\u03b7)(f (xk) \u2212 \u03a6\u2217k) \u2212\n\n\u00b5\n2\n\n(20)\n\n\u03b7\n2\n\n(1 + L\u03b7)kvk + \u03bekk2i\n\n\u2212\u03b7(\u2207g(yk) + \u03bek, vk + \u03bek) +\nEIk(cid:20)(1 \u2212 \u221a\u00b5\u03b7)(f (xk) \u2212 \u03a6\u2217k) \u2212\n\u2264(13)\n\u03b7\n2k\u2207g(yk) + \u03bekk2 +\nEIk(cid:20)(1 \u2212 \u221a\u00b5\u03b7)(f (xk) \u2212 \u03a6\u2217k) \u2212\n\n\u2212\n\n2L\n\nL\u03b72\n2 kvk + \u03bekk2 +\n\n\u00b5\n2\n\n\u2264(14),\u03b7\u2264 1\n\n\u00b5\n2\n\n1 \u2212 \u00b5\u03b7\n\u221a\u00b5\u03b7 kxk \u2212 ykk2\n2kvk \u2212 \u2207g(yk)k2(cid:21)\n\u03b7\n\u221a\u00b5\u03b7 kxk \u2212 ykk2 + \u03b7kvk \u2212 \u2207g(yk)k2(cid:21) .\n1 \u2212 \u00b5\u03b7\n\nBy taking expectation with respect to the history of random variables I1, . . . , Ik\u22121, the induction\nhypothesis \ufb01nishes the proof of (10).\n\nOur bound on the variance of vk is given in the following lemma, whose proof is in the Supplemen-\ntary Material.\nLemma 4. Suppose Assumption 1 holds, and let x\u2217 = arg min\nx\u2208Rd\nthat\n\nf (x). Conditioned on yk, we have\n\nEIkkvk \u2212 \u2207g(yk)k2 \u2264\n\n1\nb\n\nn \u2212 b\nn \u2212 1(cid:0)2L2kyk \u2212 xkk2 + 8L(f (xk) \u2212 f (x\u2217) + f (\u02dcx) \u2212 f (x\u2217))(cid:1) . (21)\n\n5\n\n\fFrom (10), (21), and (9) with x = x\u2217, it follows that\n\nIf \u03b7 \u2264 min(cid:26) (pb)2\n\nIndeed, using\n\nl=1 (1 \u2212 \u221a\u00b5\u03b7)k\u22121\u2212l\n\nE [f (xk) \u2212 f (x\u2217)] \u2264 (1 \u2212 \u221a\u00b5\u03b7)k\u22121(\u03a61 \u2212 f )(x\u2217) + EhPk\u22121\n\u00b7n(cid:16)\u2212 \u00b5\n\n1\u2212\u00b5\u03b7\n\u221a\u00b5\u03b7 + n\u2212b\nn\u22121\n\nn\u22121\n\n2L2\u03b7\n\n8L\u03b7\n\n2\n\n\u00b5\n\nL2 , 1\n\nn\u2212b(cid:17)2\n64 (cid:16) n\u22121\n64 (cid:18) n \u2212 1\n\u03b7 \u2264\n\n(pb)2\n\nb (f (xl) \u2212 f (x\u2217) + f (\u02dcx) \u2212 f (x\u2217))oi .\n\nb (cid:17)kxl \u2212 ylk2 + n\u2212b\n2L(cid:27), then the coef\ufb01cients of kxl \u2212 ylk2 are non-positive for p \u2264 2.\nn \u2212 b(cid:19)2 \u00b5\nL2 \u21d2\n\u2212 \u00b5\n\n1\u2212\u00b5\u03b7\n\u221a\u00b5\u03b7 + n\u2212b\nn\u22121\n\nb \u2264 \u2212 \u00b5\n\nn \u2212 b\nn \u2212 1\n\n1\u2212\u00b5\u03b7\n\u221a\u00b5\u03b7 + L\n\nL\u03b7\nb \u2264\n\n2 \u221a\u00b5\u03b7\n\nf or p > 0,\n\n\u221a\u00b5\u03b7,\n\n(22)\n\n2L2\u03b7\n\np\n8\n\n2\n\n2\n\nwe get\n\n= 1\n\n2\u221a\u00b5\u03b7 (cid:0)\u2212\u00b5 + \u00b52\u03b7 + \u00b5L\u03b7(cid:1) \u2264\u00b5\u2264L\n\n1\n\n2\u221a\u00b5\u03b7 (\u2212\u00b5 + 2\u00b5L\u03b7) \u2264\u03b7\u2264 1\n\n2L\n\n0.\n\nThus, using (22) again with p \u2264 1, we have\n\nE [f (xk) \u2212 f (x\u2217)] \u2264 (1 \u2212 \u221a\u00b5\u03b7)k\u22121(\u03a61 \u2212 f )(x\u2217)\n\n(1 \u2212 \u221a\u00b5\u03b7)k\u22121\u2212lp\u221a\u00b5\u03b7(f (xl) \u2212 f (x\u2217) + f (\u02dcx) \u2212 f (x\u2217))#\n\n+E\"k\u22121\nXl=1\n\u2264 (1 \u2212 \u221a\u00b5\u03b7)k\u22121(\u03a61 \u2212 f )(x\u2217) + p(f (\u02dcx) \u2212 f (x\u2217))\n+E\"k\u22121\n(1 \u2212 \u221a\u00b5\u03b7)k\u22121\u2212lp\u221a\u00b5\u03b7(f (xl) \u2212 f (x\u2217))# ,\nXl=1\nl=1 (1 \u2212 \u221a\u00b5\u03b7)k\u22121\u2212l \u2264P\u221et=0(1 \u2212 \u221a\u00b5\u03b7)t = 1\u221a\u00b5\u03b7 .\n\n\u00b5\n\n(23)\n\nn\u2212b(cid:17)2\n64 (cid:16) n\u22121\n\nL2 , 1\n\n2L(cid:27) and 0 < p <\n\nwhere for the last inequality we usedPk\u22121\nTheorem 1. Suppose Assumption 1, 2, and 3. Let \u03b7 \u2264 min(cid:26) (pb)2\n\n1. Then we have\n\nE [f (\u02dcxs+1) \u2212 f (x\u2217)] \u2264(cid:18)(1 \u2212 (1 \u2212 p)\u221a\u00b5\u03b7)m +\n\np\n\n1 \u2212 p(cid:19) (2 + p)(f (\u02dcxs) \u2212 f (x\u2217)).\n\nMoreover, if m \u2265\n\n1\n\n(1\u2212p)\u221a\u00b5\u03b7 log 1\u2212p\n\np , then it follows that\n\nE [f (\u02dcxs+1) \u2212 f (x\u2217)] \u2264\n\n2p(2 + p)\n\n1 \u2212 p\n\n(f (\u02dcxs) \u2212 f (x\u2217)).\n\n(24)\n\n(25)\n\nFrom Theorem 1, we can see that for small 0 < p (e.g. p = 0.1), the overall complexity of Acc-\nProx-SVRG (total number of component gradient evaluations to \ufb01nd an \u01eb-accurate solution) is\n\nO(cid:18)(cid:18)n +\n\nb\n\n\u221a\u00b5\u03b7(cid:19) log\n\n1\n\n\u01eb(cid:19) .\n\nThus, we have the following corollary:\nCorollary 1. Suppose Assumption 1, 2, and 3. Let p be suf\ufb01ciently small, as stated above, and\n\n\u03b7 = min(cid:26) (pb)2\n\nn\u2212b(cid:17)2\n64 (cid:16) n\u22121\n\nlearning rate \u03b7 is equal to (pb)2\n\n\u00b5\n\nL2 , 1\n\nOtherwise, \u03b7 = 1\n\n2L and the complexity becomes\n\n\u00b5\nL2 and the overall complexity is\n\n2L(cid:27). If mini-batch size b is smaller thanl\nn\u2212b(cid:17)2\n64 (cid:16) n\u22121\n\u03ba(cid:19) log\nO(cid:18)(cid:18)n +\nO(cid:18)(cid:0)n + b\u221a\u03ba(cid:1) log\n\nn \u2212 b\nn \u2212 1\n\n\u01eb(cid:19) .\n\n\u01eb(cid:19) .\n\n1\n\n1\n\n6\n\n8\u221a\u03ban\n\n\u221a2p(n\u22121)+8\u221a\u03bam, then the\n\n(26)\n\n(27)\n\n\fTable 1: Comparison of overall complexity. b0 =\n\n8\u221a\u03ban\n\n\u221a2p(n\u22121)+8\u221a\u03ba\n\n.\n\nProxSVRG\n\nO(cid:0)(n + \u03ba) log 1\n\nAccProxSVRG b < \u2308b0\u2309\nn\u22121 \u03ba(cid:17) log 1\n\u01eb(cid:17)\n\n\u01eb(cid:1) O(cid:16)(cid:16)n + n\u2212b\n\nAPG [14]\n\nAccProxSVRG b \u2265 \u2308b0\u2309\n\nO(cid:0)(n\u221a\u03ba) log 1\n\n\u01eb(cid:1) O(cid:0)(n + b\u221a\u03ba) log 1\n\u01eb(cid:1)\n\n8\u221a\u03ban\n\n\u221a2p(n\u22121)+8\u221a\u03ba\n\nTable 1 lists the overall complexities of the algorithms that achieve linear convergence. As seen\nfrom Table 1, the complexity of Acc-Prox-SVRG monotonically decreases with respect to b < \u2308b0\u2309,\nand monotonically increases when b \u2265 \u2308b0\u2309. Moreover, if b = 1, then\nwhere b0 =\nAcc-Prox-SVRG has the same complexity as that of Prox-SVRG, while if b = n then the complexity\nof this method is equal to that of APG. Therefore, with an appropriate mini-batch size, Acc-Prox-\nSVRG may outperform both Prox-SVRG and APG; even if the mini-batch is not appropriate, then\nAcc-Prox-SVRG is still comparable to Prox-SVRG or APG. The following overall complexity is\nthe best possible rate of Acc-Prox-SVRG,\n\nO(cid:18)(cid:18)n +\n\nNow we give the proof of Theorem 1.\n\nn\u03ba\n\nn + \u221a\u03ba(cid:19) log(cid:18) 1\n\n\u01eb(cid:19)(cid:19) .\n\nProof of Theorem 1. We denote E[f (xk) \u2212 f (x\u2217)] by Vk, and we use Wk to denote the last expres-\nsion in (23). Thus, for k \u2265 1, Vk \u2264 Wk. For k \u2265 2, we have\n(1 \u2212 \u221a\u00b5\u03b7)k\u22122\u2212lp\u221a\u00b5\u03b7 Vl)\n\nWk = (1 \u2212 \u221a\u00b5\u03b7)((1 \u2212 \u221a\u00b5\u03b7)k\u22122(\u03a61 \u2212 f )(x\u2217) + pV1 +\n\nk\u22122\n\n+p\u221a\u00b5\u03b7 Vk\u22121 + p\u221a\u00b5\u03b7 V1 \u2264 (1 \u2212 \u221a\u00b5\u03b7(1 \u2212 p))Wk\u22121 + p\u221a\u00b5\u03b7 W1.\n\nXl=1\n\nSince 0 < \u221a\u00b5\u03b7(1 \u2212 p) < 1, the above inequality leads to\nWk \u2264(cid:18)(1 \u2212 (1 \u2212 p)\u221a\u00b5\u03b7)k\u22121 +\n\np\n\n1 \u2212 p(cid:19) W1.\n\n(28)\n\nFrom the strong convexity of g (and f ), we can see\n\nW1 = (1 + p)(f (\u02dcx) \u2212 f (x\u2217)) +\n\n\u00b5\n2 k\u02dcx \u2212 x\u2217k2 \u2264 (2 + p)(f (\u02dcx) \u2212 f (x\u2217)).\n\nThus, for k \u2265 2, we have\n\nVk \u2264 Wk \u2264(cid:18)(1 \u2212 (1 \u2212 p)\u221a\u00b5\u03b7)k\u22121 +\n\np\n\n1 \u2212 p(cid:19) (2 + p)(f (\u02dcx) \u2212 f (x\u2217)),\n\nand that is exactly (24). Using log(1 \u2212 \u03b1) \u2264 \u2212\u03b1 and m \u2265\n\nlog(1 \u2212 (1 \u2212 p)\u221a\u00b5\u03b7)m \u2264 \u2212m(1 \u2212 p)\u221a\u00b5\u03b7 \u2264 \u2212 log\n\n,\n\np\n\n1\n\n(1\u2212p)\u221a\u00b5\u03b7 log 1\u2212p\np , we have\n1 \u2212 p\n\nso that\n\n(1 \u2212 (1 \u2212 p)\u221a\u00b5\u03b7)m \u2264\n\nThis \ufb01nishes the proof of Theorem 1.\n\n4 Numerical Experiments\n\np\n\n1 \u2212 p\n\n.\n\nIn this section, we compare Acc-Prox-SVRG with Prox-SVRG and APG on L1-regularized multi-\nclass logistic regression with the regularization parameter \u03bb. Table 2 provides details of the datasets\n\n7\n\n\fmnist\n\ncovtype.scale\n\nrcv1.binary\n\nFigure 2: Comparison of Acc-Prox-SVRG with Prox-SVRG and APG. Top: Objective gap of L1\nregularized multi-class logistic regression. Bottom: Test error rates.\n\nand regularization parameters utilized in our experiments. These datasets can be found at the LIB-\nSVM website1. The best choice of mini-batch size is b = \u2308b0\u2309, which allows us to take a large\nlearning rate, \u03b7 = 1\n. When the num-\nber of components n is very large compared with \u221a\u03ba, we see that b0 = O(\u221a\u03ba); for this, we set\nm = \u03b4b (\u03b4 \u2208 {0.1, 1.0, 10}) and \u03b2k = b\u22122\nb+2 varying b in the set {100, 500, 1000}. We ran Acc-\nProx-SVRG using values of \u03b7 from the range {0.01, 0.05, 0.1, 0.5, 1.0, 5.0, 10.0}, and we chose\nthe best \u03b7 in each mini-batch setting.\n\n2L . Therefore, we have m \u2265 O(\u221a\u03ba) and \u03b2k =\n\n\u221a2\u03ba\u22121\n\u221a2\u03ba+1\n\nFigure 2 compares Acc-Prox-SVRG with Prox-SVRG and APG. The horizontal axis is the number\nof single-component gradient evaluations. For Acc-Prox-SVRG, each iteration computes the 2b\ngradients, and at the beginning of each stage, the n component gradients are evaluated. For Prox-\nSVRG, each iteration computes the two gradients, and at the beginning of each stage, the n gradients\nare evaluated. For APG, each iteration evaluates n gradients.\n\nTable 2: Details of data sets and regularization parameters.\n\nDataset\nmnist\n\ncovtype.scale\nrcv1.binary\n\nclasses Training size Testing size\n10,000\n58,102\n677,399\n\n60,000\n522,910\n20,242\n\n10\n7\n2\n\nFeatures\n780\n54\n47,236\n\n\u03bb\n\n10\u22125\n10\u22126\n10\u22125\n\nAs can be seen from Figure 2, Acc-Prox-SVRG with good values of b performs better than or is\ncomparable to Prox-SVRG and is much better than results for APG. On the other hand, for relatively\nlarge b, Acc-Prox-SVRG may perform worse because of an overestimation of b0, and hence the\nworse estimates of m and \u03b2k.\n\n5 Conclusion\n\nWe have introduced a method incorporating Nesterov\u2019s acceleration method and a variance reduc-\ntion technique of SVRG in the mini-batch setting. We prove that the overall complexity of our\nmethod, with an appropriate mini-batch size, is more ef\ufb01cient than both Prox-SVRG and APG; even\nwhen mini-batch size is not appropriate, our method is still comparable to APG or Prox-SVRG. In\naddition, the gradient evaluations for each mini-batch can be parallelized [3, 16, 17] when using our\nmethod; hence, it performs much faster in a distributed framework.\n\n1http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/\n\n8\n\n\fReferences\n\n[1] S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arXiv:1211.2717,\n\n2012.\n\n[2] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. Journal of Machine Learning Research 14, pages 567-599, 2013.\n\n[3] S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. Ad-\n\nvances in Neural Information Processing System 26, pages 378-385, 2013.\n\n[4] S. Shalev-Shwartz and T. Zhang. Accelerated Proximal Stochastic Dual Coordinate Ascent for\nRegularized Loss Minimization. Proceedings of the 31th International Conference on Machine\nLearning, pages 64-72, 2014.\n\n[5] N. Le Roux, M. Schmidt, and F. Bach. A Stochastic Gradient Method with an Exponential\nConvergence Rate for Finite Training Sets. Advances in Neural Information Processing System\n25, pages 2672-2680, 2012.\n\n[6] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. Advances in Neural Information Processing System 26, pages 315-323, 2013.\n\n[7] L. Xiao and T. Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Re-\n\nduction. arXiv:1403.4699, 2014.\n\n[8] J. Kone\u02c7cn\u00b4y and P. Richt\u00b4arik. S2GD: Semi-Stochastic Gradient Descent Methods.\n\narXiv:1312.1666, 2013.\n\n[9] J. Kone\u02c7cn\u00b4y, J. Lu, and P. Richt\u00b4arik. mS2GD: Mini-Batch Semi-Stochastic Gradient Descent in\nthe Proximal Setting. NIPS Workshop on OPT2014: Optimization for Machine Learning, 2014.\n\n[10] J. Kone\u02c7cn\u00b4y, Z. Qu, and P. Richt\u00b4arik. S2CD: Semi-Stochastic Coordinate Descent. NIPS Work-\n\nshop on OPT2014: Optimization for Machine Learning, 2014.\n\n[11] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With\nSupport for Non-Strongly Convex Composite Objectives. Advances in Neural Information Pro-\ncessing System 27, pages 1646-1654, 2014.\n\n[12] T. Suzuki. Stochastic Dual Coordinate Ascent with Alternating Direction Method of Multipli-\ners. Proceedings of the 31th International Conference on Machine Learning, pages 736-744,\n2014.\n\n[13] T. Zhao, M. Yu, Y. Wang, R. Arora, and H. Liu. Accelerated Mini-batch Randomized Block\nCoordinate Descent Method. Advances in Neural Information Processing System 27, pages\n3329-3337, 2014.\n\n[14] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston,\n\n2004.\n\n[15] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion\n\nPapers, 2007.\n\n[16] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction\n\nusing mini-batches. Journal of Machine Learning Research 13, pages 165-202, 2012.\n\n[17] A. Agarwal and J. Duchi. Distributed delayed stochastic optimization. Advances in Neural\n\nInformation Processing System 24, pages 873-881, 2011.\n\n9\n\n\f", "award": [], "sourceid": 825, "authors": [{"given_name": "Atsushi", "family_name": "Nitanda", "institution": "NTT DATA Mathematical Systems Inc."}]}