{"title": "Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 608, "page_last": 617, "abstract": "We develop a new accelerated stochastic gradient method for efficiently solving the convex regularized empirical risk minimization problem in mini-batch settings. The use of mini-batches has become a golden standard in the machine learning community, because the mini-batch techniques stabilize the gradient estimate and can easily make good use of parallel computing. The core of our proposed method is the incorporation of our new ``double acceleration'' technique and variance reduction technique. We theoretically analyze our proposed method and show that our method much improves the mini-batch efficiencies of previous accelerated stochastic methods, and essentially only needs size $\\sqrt{n}$ mini-batches for achieving the optimal iteration complexities for both non-strongly and strongly convex objectives, where $n$ is the training set size. Further, we show that even in non-mini-batch settings, our method achieves the best known convergence rate for non-strongly convex and strongly convex objectives.", "full_text": "Doubly Accelerated\n\nStochastic Variance Reduced Dual Averaging Method\n\nfor Regularized Empirical Risk Minimization\n\nTomoya Murata\n\nNTT DATA Mathematical Systems Inc. , Tokyo, Japan\n\nmurata@msi.co.jp\n\nGraduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan\n\nPRESTO, Japan Science and Technology Agency, Japan\n\nCenter for Advanced Integrated Intelligence Research, RIKEN, Tokyo, Japan\n\nTaiji Suzuki\n\nDepartment of Mathematical Informatics\n\ntaiji@mist.i.u-tokyo.ac.jp\n\nAbstract\n\nWe develop a new accelerated stochastic gradient method for ef\ufb01ciently solving\nthe convex regularized empirical risk minimization problem in mini-batch settings.\nThe use of mini-batches has become a golden standard in the machine learning\ncommunity, because the mini-batch techniques stabilize the gradient estimate and\ncan easily make good use of parallel computing. The core of our proposed method\nis the incorporation of our new \u201cdouble acceleration\u201d technique and variance re-\nduction technique. We theoretically analyze our proposed method and show that\nour method much improves the mini-batch ef\ufb01ciencies of previous accelerated\nstochastic methods, and essentially only needs size\nn mini-batches for achieving\nthe optimal iteration complexities for both non-strongly and strongly convex objec-\ntives, where n is the training set size. Further, we show that even in non-mini-batch\nsettings, our method achieves the best known convergence rate for non-strongly\nconvex and strongly convex objectives.\n\n\u221a\n\n1\n\nIntroduction\n\nWe consider a composite convex optimization problem associated with regularized empirical risk\nminimization, which often arises in machine learning. In particular, our goal is to minimize the\nsum of \ufb01nite smooth convex functions and a relatively simple (possibly) non-differentiable convex\nfunction by using \ufb01rst order methods in mini-batch settings. The use of mini-batches is now a\ngolden standard in the machine learning community, because it is generally more ef\ufb01cient to execute\nmatrix-vector multiplications over a mini-batch than an equivalent amount of vector-vector ones each\nover a single instance; and more importantly, mini-batch techniques can easily make good use of\nparallel computing.\nTraditional and effective methods for solving the abovementioned problem are the \u201cproximal gradient\u201d\n(PG) method and \u201caccelerated proximal gradient\u201d (APG) method [10, 3, 20]. These methods are\nwell known to achieve linear convergence for strongly convex objectives. Particularly, APG achieves\noptimal iteration complexities for both non-strongly and strongly convex objectives. However, these\nmethods need a per iteration cost of O(nd), where n denotes the number of components of the\n\ufb01nite sum, and d is the dimension of the solution space. In typical machine learning tasks, n and d\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fcorrespond to the number of instances and features respectively, which can be very large. Then, the\nper iteration cost of these methods can be considerably high.\nA popular alternative is the \u201cstochastic gradient descent\u201d (SGD) method [19, 5, 17]. As the per\niteration cost of SGD is only O(d) in non-mini-batch settings, SGD is suitable for many machine\nlearning tasks. However, SGD only achieves sublinear rates and is ultimately slower than PG and\nAPG.\nRecently, a number of stochastic gradient methods have been proposed; they use a variance reduction\ntechnique that utilizes the \ufb01nite sum structure of the problem (\u201cstochastic averaged gradient\u201d (SAG)\nmethod [15, 16], \u201cstochastic variance reduced gradient\" (SVRG) method [6, 22] and SAGA [4]).\nEven though the per iteration costs of these methods are same as that of SGD, they achieve a linear\nconvergence for strongly convex objectives. Consequently, these methods dramatically improve the\ntotal computational cost of PG. However, in size b mini-batch settings, the rate is essentially b times\nworse than in non-mini-batch settings (the extreme situation is b = n which corresponds to PG). This\nmeans that there is little bene\ufb01t in applying mini-batch scheme to these methods.\nMore recently, several authors have proposed accelerated stochastic methods for the composite \ufb01nite\nsum problem (\u201caccelerated stochastic dual coordinate ascent\u201d (ASDCA) method [18], Universal\nCatalyst (UC) [8], \u201caccelerated proximal coordinate gradient\u201d (APCG) method [9], \u201cstochastic\nprimal-dual coordinate\u201d (SPDC) method [23], and Katyusha [1]). ASDCA (UC), APCG, SPDC and\nKatyusha essentially achieve the optimal total computational cost1 for strongly convex objectives2\nin non-mini-batch settings. However, in size b mini-batch settings, the rate is essentially\nb times\nworse than that in non-mini-batch settings, and these methods need size O(n) mini-batches for\nachieving the optimal iteration complexity3, which is essentially the same as APG. In addition,\n[12, 13] has proposed the \u201caccelerated mini-batch proximal stochastic variance reduced gradient\u201d\n(AccProxSVRG) method and its variant, the \u201caccelerated ef\ufb01cient mini-batch stochastic variance\nreduced gradient\u201d (AMSVRG) method. In non-mini-batch settings, AccProxSVRG only achieves the\nsame rate as SVRG. However, in mini-batch settings, AccProxSVRG signi\ufb01cantly improves the mini-\nbatch ef\ufb01ciency of non-accelerated variance reduction methods, and surprisingly, AccProxSVRG\nessentially only needs size O(\n\u03ba) mini-batches for achieving the optimal iteration complexity for\nstrongly convex objectives, where \u03ba is the condition number of the problem. However, the necessary\nsize of mini-batches depends on the condition number and gradually increases when the condition\nnumber increases and ultimately matches with O(n) for a large condition number.\n\n\u221a\n\n\u221a\n\nMain contribution\n\nWe propose a new accelerated stochastic variance reduction method that achieves better convergence\nthan existing methods do, and it particularly takes advantages of mini-batch settings well; it is called\nthe \u201cdoubly accelerated stochastic variance reduced dual averaging\u201d (DASVRDA) method. We\ndescribe the main feature of our proposed method below and list the comparisons of our method with\nseveral preceding methods in Table 1.\n\nOur method signi\ufb01cantly improves the mini-batch ef\ufb01ciencies of state-of-the-art methods. As a\nn) mini-batches4for achieving the optimal\nresult, our method essentially only needs size O(\niteration complexities for both non-strongly and strongly convex objectives.\n\n\u221a\n\n1More precisely, the rate of ASDCA (UC) is with extra log-factors, and near but worse than the one of APCG,\n\nSPDC and Katyusha. This means that ASDCA (UC) cannot be optimal.\n\ntion method [11].\n\n2Katyusha also achieves a near optimal total computational cost for non-strongly convex objectives.\n3We refer to \u201coptimal iteration complexity\u201d as the iteration complexity of deterministic Nesterov\u2019s accelera-\n\n4Actually, when L/\u03b5 \u2264 n and L/\u00b5 \u2264 n, our method needs size O(n(cid:112)\u03b5/L) and O(n(cid:112)\u00b5/L) mini-batches,\n\n\u221a\nrespectively, which are larger than O(\nn), but smaller than O(n). Achieving optimal iteration complexity for\nsolving high accuracy and bad conditioned problems is much more important than doing ones with low accuracy\nand well-conditioned ones, because the former needs more overall computational cost than the latter.\n\n2\n\n\fWe regard one computation of a full gradient as n/b iterations in size b mini-batch settings, for\na fair comparison. \u201cUnattainable\u201d implies that the algorithm cannot achieve the optimal iteration\n\nTable 1: Comparisons of our method with SVRG (SVRG++ [2]), ASDCA (UC), APCG, SPDC,\nKatyusha and AccProxSVRG. n is the number of components of the \ufb01nite sum, d is the dimension of\nthe solution space, b is the mini-batch size, L is the smoothness parameter of the \ufb01nite sum, \u00b5 is the\nstrong convexity parameter of objectives, and \u03b5 is accuracy. \u201cNecessary mini-batch size\u201d indicates\nthe order of the necessary size of mini-batches for achieving the optimal iteration complexities\n\nO((cid:112)L/\u00b5log(1/\u03b5)) and O((cid:112)L/\u03b5) for strongly and non-strongly convex objectives, respectively.\ncomplexity even if it uses size n mini-batches. (cid:101)O hides extra log-factors.\n(cid:1) + bL\n(cid:1)(cid:1)\n(cid:17)(cid:17)\n(cid:19)(cid:19)\n(cid:113) nbL\n(cid:113) L\n\n(cid:17)\n(cid:1)(cid:17)\nlog(cid:0) 1\n(cid:17)\n(cid:1)(cid:17)\n(cid:113) nbL\nlog(cid:0) 1\n(cid:17)\n(cid:1)(cid:17)\n(cid:113) nbL\nlog(cid:0) 1\n(cid:1)(cid:17)\n(cid:17)\n(cid:113) nbL\nlog(cid:0) 1\n(cid:1)(cid:17)\n(cid:17)\n(cid:113) nbL\nlog(cid:0) 1\n(cid:1)(cid:17)\n(cid:17) L\n(cid:17)\n(cid:113) L\nlog(cid:0) 1\n(cid:1)(cid:17)\n(cid:17)\n(cid:113) L\nlog(cid:0) 1\n\nO(cid:0)d(cid:0)nlog(cid:0) 1\n(cid:16)\n(cid:16) n+\n(cid:101)O\n(cid:18)\n(cid:18)\nnlog(cid:0) 1\n(cid:1) +\n(cid:16)\n(cid:1) + (b +\nnlog(cid:0) 1\n\nNecessary mini-batch size\nL/\u00b5 \u2265 n\notherwise\nUnattainable Unattainable\n\n(cid:16)\n(cid:16)\n(cid:16)\n(cid:16)\n(cid:16)\n(cid:16) n\u2212b\n\nin size b mini-batch settings\nO\n\nNo direct analysis\n\u221a\n\nO(n)\n\nO(cid:0)n(cid:112) \u00b5\nO(cid:0)n(cid:112) \u00b5\n\nUnattainable Unattainable\n\nin size b mini-batch settings\n\nL/\u03b5 \u2265 nlog2(1/\u03b5)\n\nNecessary mini-batch size\n\nTotal computational cost\n\nTotal computational cost\n\nO(n)\n\nO(n)\n\n(cid:16)(cid:113) L\n\n(cid:16)\n(cid:16)\n(cid:16)\n(cid:16)\n(cid:16)\n\nNon-strongly convex\n\nAccProxSVRG O\n\n\u00b5-strongly convex\n\nNo direct analysis\n\nNo direct analysis\n\n\u00b5 + b\n\u221a\nn)\n\nASDCA (UC)\n\nUnattainable\n\nUnattainable\n\nUnattainable\n\nO\n\n(cid:16)\n(cid:16)\n\nd\n\nUnattainable\n\nUnattainable\n\nUnattainable\n\nUnattainable\n\nUnattainable\n\nUnattainable\n\nSVRG (++)\n\nd\n\nn + bL\n\u00b5\n\n(cid:101)O\n\nO\n\nO\n\n\u221a\n\u03b5\nnbL\u221a\n\n\u03b5\n\nn +\n\nn +\n\nn +\n\nn +\n\nO(n)\n\nO(n)\n\nDASVRDA\n\nAPCG\n\nSPDC\n\n(cid:16)\n\nO\n\notherwise\n\nKatyusha\n\n(cid:17)(cid:17)\n\nO\n\n\u00b5\n\n\u221a\n\nO (\n\nn)\n\nd\n\nn +\n\nn\u22121\n\nn + (b +\n\nO(n)\n\nUnattainable\n\n(cid:101)O(cid:0)n(cid:112) \u03b5\n\n(cid:1)\n\nL\n\n(cid:1)\n(cid:1)\n\nL\n\nL\n\n\u221a\n\nO (\n\nn)\n\nd\n\nd\n\nd\n\nd\n\nO\n\nd\n\n(cid:16)\n\nO\n\nd\n\n\u03b5\n\nn)\n\n\u03b5\n\n\u03b5\n\n\u03b5\n\n\u03b5\n\n\u03b5\n\n\u03b5\n\n\u00b5\n\n\u03b5\n\n\u00b5\n\n\u03b5\n\n\u00b5\n\n\u00b5\n\n\u00b5\n\n\u00b5\n\n\u03b5\n\n\u03b5\n\n(cid:17)\n\n\u03b5\n\nd\n\n(cid:16)\n\nO(n)\n\nO(n)\n\n2 Preliminary\n\nIn this section, we formally describe the problem to be considered in this paper and the assumptions\nfor our theory.\n\nWe use (cid:107) \u00b7 (cid:107) to denote the Euclidean L2 norm (cid:107) \u00b7 (cid:107)2: (cid:107)x(cid:107) = (cid:107)x(cid:107)2 =(cid:112)(cid:80)\n\ni . For natural number n,\n\n[n] denotes set {1, . . . , n}.\nIn this paper, we consider the following composite convex minimization problem:\n\ni x2\n\n(cid:80)n\n\n{P (x) def= F (x) + R(x)},\n\nmin\nx\u2208Rd\n\ni=1 fi(x). Here each fi\n\n(1)\n: Rd \u2192 R is a Li-smooth convex function and\nwhere F (x) = 1\nR : Rd \u2192 R is a relatively simple and (possibly) non-differentiable convex function. Problems\nn\nof this form often arise in machine learning and fall under regularized empirical risk minimization\n(ERM). In ERM problems, we are given n training examples {(ai, bi)}n\ni=1, where each ai \u2208 Rd\nis the feature vector of example i, and each bi \u2208 R is the label of example i. Important examples\nof ERM in our setting include linear regression and logistic regression with Elastic Net regularizer\nR(x) = \u03bb1(cid:107) \u00b7 (cid:107)1 + (\u03bb2/2)(cid:107) \u00b7 (cid:107)2\nWe make the following assumptions for our analysis:\nAssumption 1. There exists a minimizer x\u2217 of (1).\nAssumption 2. Each fi is convex, and is Li-smooth, i.e.,\n\n2 (\u03bb1, \u03bb2 \u2265 0).\n\n(cid:107)\u2207fi(x) \u2212 \u2207fi(y)(cid:107) \u2264 Li(cid:107)x \u2212 y(cid:107) (\u2200x, y \u2208 Rd).\n\n2(cid:107)x \u2212 y(cid:107)2 + R(x)(cid:9), takes\n(cid:8) 1\n\nAssumption 3. Regularization function R is convex, and is relatively simple, which means that\ncomputing the proximal mapping of R at y, proxR(y) = argmin x\u2208Rd\nO(d) computational cost, for any y \u2208 Rd.\nWe always consider Assumptions 1, 2 and 3 in this paper.\nAssumption 4. There exists \u00b5 > 0 such that objective function P is \u00b5-optimally strongly convex,\ni.e., P has a minimizer and satis\ufb01es\n\n(cid:107)x \u2212 x\u2217(cid:107)2 \u2264 P (x) \u2212 P (x\u2217)\n\n(\u2200x \u2208 Rd,\u2200x\u2217 \u2208 argminx\u2208Rd f (x)).\n\n\u00b5\n2\n\nNote that the requirement of optimally strong convexity is weaker than the one of ordinary strong\nconvexity (for the de\ufb01nition of ordinary strong convexity, see [11]).\nWe further consider Assumption 4 when we deal with strongly convex objectives.\n\n3\n\n\f3 Our Approach: Double Acceleration\n\nIn this section, we provide high-level ideas of our main contribution called \u201cdouble acceleration.\u201d\nFirst, we consider deterministic PG (Algorithm 1) and (non-mini-batch) SVRG (Algorithm 2). PG\nis an extension of the steepest descent to proximal settings. SVRG is a stochastic gradient method\nusing the variance reduction technique, which utilizes the \ufb01nite sum structure of the problem, and\nit achieves a faster convergence rate than PG does. As SVRG (Algorithm 2) matches with PG\n(Algorithm 1) when the number of inner iterations m equals 1, SVRG can be seen as a generalization\nof PG. The key element of SVRG is employing a simple but powerful technique called the variance\nsetting gk = \u2207fik (xk\u22121) \u2212 \u2207fik ((cid:101)x) + \u2207F ((cid:101)x) rather than vanilla stochastic gradient \u2207fik (xk\u22121).\nreduction technique for gradient estimate. The variance reduction of the gradient is realized by\nGenerally, stochastic gradient \u2207fik (xk\u22121) is an unbiased estimator of \u2207F (xk\u22121), but it may have\nis, the variance converges to zero as xk\u22121 and(cid:101)x to x\u2217.\nhigh variance. In contrast, gk is also unbiased, and one can show that its variance is \u201creduced\u201d; that\nAlgorithm 1: PG ((cid:101)x0, \u03b7, S)\n\nAlgorithm 3: One Stage PG ((cid:101)x, \u03b7)\n(cid:101)x+ = prox\u03b7R((cid:101)x \u2212 \u03b7\u2207F ((cid:101)x)).\nreturn (cid:101)x+.\nAlgorithm 4: One Stage SVRG ((cid:101)x, \u03b7, m)\nx0 =(cid:101)x.\ngk = \u2207fik (xk\u22121) \u2212 \u2207fik ((cid:101)x) + \u2207F ((cid:101)x).\nPick ik \u2208 [1, n] randomly.\nxk = prox\u03b7R(xk\u22121 \u2212 \u03b7gk).\n\nfor k = 1 to m do\n\n(cid:80)n\n\nend for\nreturn 1\nn\n\nk=1 xk.\n\nfor s = 1 to S do\n\n(cid:101)xs = One Stage PG((cid:101)xs\u22121, \u03b7).\nAlgorithm 2: SVRG ((cid:101)x0, \u03b7, m, S)\n\n(cid:80)S\ns=1(cid:101)xs.\n\nend for\nreturn 1\nS\n\nfor s = 1 to S do\n\n(cid:101)xs = One Stage SVRG((cid:101)xs\u22121, \u03b7, m).\n\nend for\nreturn 1\nS\n\n(cid:80)S\ns=1(cid:101)xs.\nAlgorithm 5: APG ((cid:101)x0, \u03b7, S)\n(cid:101)x\u22121 =(cid:101)x0,(cid:101)\u03b80 = 0.\n(cid:101)\u03b8s = s+1\n(cid:101)xs = One Stage PG((cid:101)ys, \u03b7).\n\nfor s = 1 to S do\n\n2 , (cid:101)ys =(cid:101)xs\u22121 + (cid:101)\u03b8s\u22121\u22121(cid:101)\u03b8s\n\nend for\nreturn xS.\n\n((cid:101)xs\u22121 \u2212(cid:101)xs\u22122).\n\nNext, we explain the method of accelerating SVRG and obtaining an even faster convergence rate\nbased on our new but quite natural idea \u201couter acceleration.\u201d First, we would like to remind you\nthat the procedure of deterministic APG is given as described in Algorithm 5. APG uses the famous\n\u201cmomentum\u201d scheme and achieves the optimal iteration complexity. Our natural idea is replacing\nOne Stage PG in Algorithm 5 with One Stage SVRG. With slight modi\ufb01cations, we can show\nthat this algorithm improves the rates of PG, SVRG and APG, and is optimal. We call this new\nalgorithm outerly accelerated SVRG. However, this algorithm has poor mini-batch ef\ufb01ciency, because\nin size b mini-batch settings, the rate of this algorithm is essentially\nb times worse than that of\nnon-mini-batch settings. State-of-the-art methods APCG, SPDC, and Katyusha also suffer from the\nsame problem in the mini-batch setting.\nNow, we illustrate that for improving the mini-batch ef\ufb01ciency, using the \u201cinner acceleration\u201d tech-\nnique is bene\ufb01cial. The author of [12] has proposed AccProxSVRG in mini-batch settings. Ac-\ncProxSVRG applies the momentum scheme to One Stage SVRG, and we call this technique \u201cinner\u201d\nacceleration. He showed that the inner acceleration could signi\ufb01cantly improve the mini-batch\nef\ufb01ciency of vanilla SVRG. This fact indicates that inner acceleration is essential to fully utilize\nthe mini-batch settings. However, AccProxSVRG is not a truly accelerated method, because in\nnon-mini-batch settings, the rate of AccProxSVRG is same as that of vanilla SVRG.\nIn this way, we arrive at our main high-level idea called \u201cdouble\u201d acceleration, which involves\napplying momentum scheme to both outer and inner algorithms. This enables us not only to lead to\n\n\u221a\n\n4\n\n\fthe optimal total computational cost in non-mini-batch settings, but also to improving the mini-batch\nef\ufb01ciency of vanilla acceleration methods.\nWe have considered SVRG and its accelerations so far; however, we actually adopt stochastic variance\nreduced dual averaging (SVRDA) rather than SVRG itself, because we can construct lazy update\nrules of (innerly) accelerated SVRDA for sparse data. In Section G of supplementary material, we\nbrie\ufb02y discuss a SVRG version of our proposed method and provide its convergence analysis.\n\n4 Algorithm Description\n\nIn this section, we describe the concrete procedure of the proposed algorithm in detail.\n\nn\n\ni=1, m, b, S)\n\nfor s = 1 to S do\n\n4.1 DASVRDA for non-strongly convex objectives\n\nAlgorithm 6: DASVRDAns((cid:101)x0,(cid:101)z0, \u03b3,{Li}n\ni=1 Li, Q = {qi} =(cid:8) Li\n(cid:9), \u03b7 =\n(cid:80)n\n(cid:101)x\u22121 =(cid:101)z0,(cid:101)\u03b80 = 0, \u00afL = 1\n(cid:17) s+2\n(cid:16)\n(cid:101)\u03b8s =\n2 , (cid:101)ys =(cid:101)xs\u22121 + (cid:101)\u03b8s\u22121\u22121(cid:101)\u03b8s\n((cid:101)xs,(cid:101)zs) = One Stage AccSVRDA((cid:101)ys,(cid:101)xs\u22121, \u03b7, m, b, Q).\nreturn (cid:101)xS.\nAlgorithm 7: One Stage AccSVRDA ((cid:101)y,(cid:101)x, \u03b7, m, b, Q)\nx0 = z0 =(cid:101)y, \u00afg0 = 0, \u03b80 = 1\n\nend for\n\n1 \u2212 1\n\n2.\n\nn \u00afL\n\n\u03b3\n\n1\n\n(1+ \u03b3(m+1)\n\n.\n\n((cid:101)xs\u22121 \u2212(cid:101)xs\u22122) + (cid:101)\u03b8s\u22121(cid:101)\u03b8s\n\nb\n\n) \u00afL\n\n((cid:101)zs\u22121 \u2212(cid:101)xs\u22121).\n\nfor k = 1 to m do\n\nPick independently i1\n\u03b8k = k+1\n\n2 , yk =\n\n(cid:80)\n\nk, . . . , ib\n1 \u2212 1\n\nk \u2208 [1, n] according to Q, set Ik = {i(cid:96)\nxk\u22121 + 1\n\u03b8k\n\n(cid:16)\n(cid:17)\n(\u2207fi(yk) \u2212 \u2207fi((cid:101)x)) + \u2207F ((cid:101)x), \u00afgk =\n(cid:110)(cid:104)\u00afgk, z(cid:105) + R(z) +\n(cid:107)z \u2212 z0(cid:107)2(cid:111)\n(cid:17)\n\n1 \u2212 1\n\n2\u03b7\u03b8k\u03b8k\u22121\n\nzk\u22121.\n\n1\nnqi\n\n\u03b8k\n\n1\n\n(cid:96)=1.\n\n(cid:16)\n= prox\u03b7\u03b8k\u03b8k\u22121R (z0 \u2212 \u03b7\u03b8k\u03b8k\u22121\u00afgk) .\n\n\u00afgk\u22121 + 1\n\u03b8k\n\ngk.\n\n\u03b8k\n\nk}b\n(cid:17)\n\ngk = 1\nb\n\ni\u2208Ik\nzk = argmin\nz\u2208Rd\n1 \u2212 1\n\n(cid:16)\n\nxk =\nend for\nreturn (xm, zm).\n\n\u03b8k\n\nxk\u22121 + 1\n\u03b8k\n\nzk.\n\nAlgorithm 6 can be seen as a direct generalization of APG, because if we set m = 1, One Stage\n\nWe provide details of the doubly accelerated SVRDA (DASVRDA) method for non-strongly convex\nobjectives in Algorithm 6. Our momentum step is slightly different from that of vanilla deterministic\n\naccelerated methods: we not only add momentum term (((cid:101)\u03b8s\u22121 \u2212 1)/(cid:101)\u03b8s)((cid:101)xs\u22121 \u2212(cid:101)xs\u22122) to the\ncurrent solution(cid:101)xs\u22121 but also add term ((cid:101)\u03b8s\u22121/(cid:101)\u03b8s)((cid:101)zs\u22121 \u2212(cid:101)xs\u22121), where(cid:101)zs\u22121 is the current more\n\u201caggressively\u201d updated solution rather than(cid:101)xs\u22121; thus, this term also can be interpreted as momentum5.\nThen, we feed(cid:101)ys to One Stage Accelerated SVRDA (Algorithm 7) as an initial point. Note that\nAccelerated SVRDA is essentially the same as one iteration PG with initial point(cid:101)ys; then, we can\nsee that(cid:101)zs = (cid:101)xs, and Algorithm 6 essentially matches with deterministic APG. Next, we move\nreduced gradient gk, we use the full gradient of F at (cid:101)xs\u22121 rather than the initial point (cid:101)ys. The\n5This form also arises in Monotone APG [7]. In Algorithm 7,(cid:101)x = xm can be rewritten as (2/(m(m +\n1))(cid:80)m\nk=1 kzk, which is a weighted average of zk; thus, we can say that(cid:101)z is updated more \u201caggressively\u201d than\n(cid:101)x. For the outerly accelerated SVRG (that is a combination of Algorithm 6 with vanilla SVRG, see section 3),(cid:101)z\nand(cid:101)x correspond to xm and (1/m)(cid:80)m\nk=1 xk in [22], respectively. Thus, we can also see that(cid:101)z is updated more\n\u201caggressively\u201d than(cid:101)x.\n\nto One Stage Accelerated SVRDA (Algorithm 7). Algorithm 7 is essentially a combination of the\n\u201caccelerated regularized dual averaging\u201d (AccSDA) method [21] with the variance reduction technique\nof SVRG. It updates zk by using the weighted average of all past variance reduced gradients \u00afgk\ninstead of only using a single variance reduced gradient gk. Note that for constructing variance\n\n5\n\n\fAdoption of (Innerly) Accelerated SVRDA rather than (Innerly) Accelerated SVRG enables us to\nconstruct lazy updates for sparse data (for more details, see Section E of supplementary material).\n\n4.2 DASVRDA for strongly convex objectives\nAlgorithm 8: DASVRDAsc(\u02c7x0, \u03b3,{Li}n\n\ni=1, m, b, S, T )\n\nfor t = 1 to T do\n\n\u02c7xt = DASVRDAns(\u02c7xt\u22121, \u02c7xt\u22121, \u03b3,{Li}n\n\ni=1, m, b, S).\n\nend for\nreturn \u02c7xT .\n\nAlgorithm 8 is our proposed method for strongly convex objectives. Instead of directly accelerating\nthe algorithm using a constant momentum rate, we restart Algorithm 6. Restarting scheme has several\nadvantages both theoretically and practically. First, the restarting scheme only requires the optimal\nstrong convexity of the objective instead of the ordinary strong convexity. Whereas, non-restarting\naccelerated algorithms essentially require the ordinary strong convexity of the objective. Second,\nfor restarting algorithms, we can utilize adaptive restart schemes [14]. The adaptive restart schemes\nhave been originally proposed for deterministic cases. The schemes are heuristic but quite effective\nempirically. The most fascinating property of these schemes is that we need not prespecify the\nstrong convexity parameter \u00b5, and the algorithms adaptively determine the restart timings. [14]\nhave proposed two heuristic adaptive restart schemes: the function scheme and gradient scheme.\nWe can easily apply these ideas to our method, because our method is a direct generalization of\n\nthe deterministic APG. For the function scheme, we restart Algorithm 6 if P ((cid:101)xs) > P ((cid:101)xs\u22121). For\nthe gradient scheme, we restart the algorithm if ((cid:101)ys \u2212(cid:101)xs)(cid:62)((cid:101)ys+1 \u2212(cid:101)xs) > 0. Here(cid:101)ys \u2212(cid:101)xs can be\ninterpreted as a \u201cone stage\u201d gradient mapping of P at(cid:101)ys. As(cid:101)ys+1 \u2212(cid:101)xs is the momentum, this scheme\n\ncan be interpreted such that we restart whenever the momentum and negative one Stage gradient\nmapping form an obtuse angle (this means that the momentum direction seems to be \u201cbad\u201d). We\nnumerically demonstrate the effectiveness of these schemes in Section 6.\n\nParameter tunings\n\nFor DASVRDAns, only learning rate \u03b7 needs to be tuned, because we can theoretically obtain the\noptimal choice of \u03b3, and we can naturally use m = n/b as a default epoch length (see Section 5). For\nDASVRDAsc, both learning rate \u03b7 and \ufb01xed restart interval S need to be tuned.\n\n5 Convergence Analysis of DASVRDA Method\n\n(cid:19)2\n\n(cid:16)\n\n(cid:17)2\n\n4\n\n1 \u2212 1\n\n\u03b3\n\n(S + 2)2\n\n1 \u2212 1\n\u03b3\n\ni=1, m, b, S) satis\ufb01es\n\n(P ((cid:101)x0) \u2212 P (x\u2217)) +\n\nIn this section, we provide the convergence analysis of our algorithms. Unless otherwise speci\ufb01ed,\nserial computation is assumed. First, we consider the DASVRDAns algorithm.\n\nTheorem 5.1. Suppose that Assumptions 1, 2 and 3 hold. Let(cid:101)x0,(cid:101)z0 \u2208 Rd, \u03b3 \u2265 3, m \u2208 N, b \u2208 [n]\nand S \u2208 N. Then DASVRDAns((cid:101)x0,(cid:101)z0, \u03b3,{Li}n\n(cid:33)\n(cid:32)(cid:18)\nE [P ((cid:101)xS) \u2212 P (x\u2217)] \u2264\n(cid:107)(cid:101)z0 \u2212 x\u2217(cid:107)2\nthe optimal choice of \u03b3 is (3+(cid:112)9 + 8b/(m + 1))/2 = O(1+b/m) (see Section B of supplementary\nCorollary 5.2. Suppose that Assumptions 1, 2, and 3 hold. Let(cid:101)x0 \u2208 Rd, \u03b3 = \u03b3\u2217 , m \u221d n/b and b \u2208\nmb)(cid:112) \u00afL(cid:107)(cid:101)x0 \u2212 x\u2217(cid:107)2/\u03b5),\n[n]. If we appropriately choose S = O((cid:112)(P ((cid:101)x0) \u2212 P (x\u2217))/\u03b5+(1/m+1/\nthen a total computational cost of DASVRDAns ((cid:101)x0, \u03b3\u2217,{Li}n\ni=1, m, b, S) for E [P ((cid:101)xS) \u2212 P (x\u2217)] \u2264\n(cid:33)(cid:33)\nn(cid:1)(cid:114) \u00afL(cid:107)(cid:101)x0 \u2212 x\u2217(cid:107)2\n\nThe proof of Theorem 5.1 is found in the supplementary material (Section A). We can easily see that\n\nmaterial). We denote this value as \u03b3\u2217. From Theorem 5.1, we obtain the following corollary:\n\nP ((cid:101)x0) \u2212 P (x\u2217)\n\n+(cid:0)b +\n\n(cid:32)\n\n(cid:32)\n\n(cid:114)\n\n2\n\n\u03b7(m + 1)m\n\nO\n\nd\n\nn\n\n\u03b5 is\n\n\u221a\n\n.\n\n\u03b5\n\n\u03b5\n\n.\n\n\u221a\n\n6\n\n\fRemark. If we adopt a warm start scheme for DASVRDAns, we can further improve the rate to\n\n(cid:32)\n\n(cid:32)\n\nO\n\nd\n\nnlog\n\n(cid:18) P ((cid:101)x0) \u2212 P (x\u2217)\n\n(cid:19)\n\n\u03b5\n\n(cid:114)\n\nL(cid:107)(cid:101)x0 \u2212 x\u2217(cid:107)2\n\n\u03b5\n\n(cid:33)(cid:33)\n\n\u221a\n\n+ (b +\n\nn)\n\n(see Section C and D of supplementary material).\n\nNext, we analyze the DASVRDAsc algorithm for optimally strongly convex objectives. Combining\nTheorem 5.1 with the optimal strong convexity of the objective function immediately yields the\nfollowing theorem, which implies that the DASVRDAsc algorithm achieves a linear convergence.\nTheorem 5.3. Suppose that Assumptions 1, 2, 3 and 4 hold. Let \u02c7x0 \u2208 Rd, \u03b3 = \u03b3\u2217, m \u2208 N, b \u2208 [n]\nand T \u2208 N. De\ufb01ne \u03c1 def= 4{(1 \u2212 1/\u03b3\u2217)2 + 4/(\u03b7(m + 1)m\u00b5)}/{(1 \u2212 1/\u03b3\u2217)2(S + 2)2}. If S is\nsuf\ufb01ciently large such that \u03c1 \u2208 (0, 1), then DASVRDAsc(\u02c7x0, \u03b3\u2217,{Li}n\nE[P (\u02c7xT ) \u2212 P (x\u2217)] \u2264 \u03c1T [P (\u02c7x0) \u2212 P (x\u2217)].\n\ni=1, m, b, S, T ) satis\ufb01es\n\nO\n\n\u221a\n\nFrom Theorem 5.3, we have the following corollary.\nCorollary 5.4. Suppose that Assumptions 1, 2, 3 and 4 hold. Let \u02c7x0 \u2208 Rd, \u03b3 = \u03b3\u2217, m \u221d n/b,\n\u221a\nb \u2208 [n]. There exists S = O(1 + (b/n + 1/\nif we appropriately choose T = O(log(P (\u02c7x0) \u2212 P (x\u2217)/\u03b5), then a total computational cost of\nDASVRDAsc(\u02c7x0, \u03b3\u2217,{Li}n\n\nn)(cid:112) \u00afL/\u00b5), such that 1/log(1/\u03c1) = O(1). Moreover,\n\uf8f6\uf8f8 log\nn(cid:1)(cid:115)\n\ni=1, m, b, S, T ) for E [P (\u02c7xT ) \u2212 P (x\u2217)] \u2264 \u03b5 is\n(cid:18) P (\u02c7x0) \u2212 P (x\u2217)\n\n\uf8eb\uf8edn +(cid:0)b +\n\n(cid:19)\uf8f6\uf8f8 .\n\n\uf8eb\uf8edd\n\n\u221a\nRemark. Corollary 5.4 also implies that DASVRDAsc only needs size O(\n\n\u221a\nRemark. Corollary 5.4 implies that if the mini-batch size b is O(\n{Li}n\n\ni=1, n/b, b, S, T ) still achieves the total computational cost of O(d(n +(cid:112)n \u00afL/\u00b5)log(1/\u03b5)),\nwhich is much better than O(d(n +(cid:112)nb \u00afL/\u00b5)log(1/\u03b5)) of APCG, SPDC, and Katyusha.\nachieving the optimal iteration complexity O((cid:112)L/\u00b5log(1/\u03b5)), when L/\u00b5 \u2265 n. In contrast, APCG,\nSPDC and Katyusha need size O(n) mini-batches and AccProxSVRG does O((cid:112)L/\u00b5) ones for\nsize O(n(cid:112)\u00b5/L) mini-batches 6. This size is smaller than O(n) of APCG, SPDC, and Katyusha, and\n\nachieving the optimal iteration complexity. Note that even when L/\u00b5 \u2264 n, our method only needs\n\nn), DASVRDAsc(\u02c7x0, \u03b3\u2217,\n\nn) mini-batches for\n\n\u00afL\n\u00b5\n\n\u03b5\n\nthe same as that of AccProxSVRG.\n\n6 Numerical Experiments\n\nIn this section, we provide numerical experiments to demonstrate the performance of DASVRDA.\nWe numerically compare our method with several well-known stochastic gradient methods in mini-\nbatch settings: SVRG [22] (and SVRG++ [2]), AccProxSVRG [12], Universal Catalyst [8] , APCG\n[9], and Katyusha [1]. The details of the implemented algorithms and their parameter tunings\nare found in the supplementary material. In the experiments, we focus on the regularized logistic\nregression problem for binary classi\ufb01cation, with regularizer \u03bb1(cid:107) \u00b7 (cid:107)1 + (\u03bb2/2)(cid:107) \u00b7 (cid:107)2\n2.\nWe used three publicly available data sets in the experiments. Their sizes n and dimensions d, and\ncommon min-batch sizes b for all implemented algorithms are listed in Table 2.\n\nTable 2: Summary of the data sets and mini-batch size used in our numerical experiments\n\nData sets\n\na9a\nrcv1\nsido0\n\nn\n\n32, 561\n20, 242\n12, 678\n\nd\n123\n\n47, 236\n4, 932\n\nb\n\n180\n140\n100\n\nFor regularization parameters, we used three settings (\u03bb1, \u03bb2) = (10\u22124, 0), (10\u22124, 10\u22126), and\n(0, 10\u22126). For the former case, the objective is non-strongly convex, and for the latter two cases,\n\n6Note that the required size is O(n(cid:112)\u00b5/L)(\u2264 O(n)), which is not O(n(cid:112)L/\u00b5) \u2265 O(n).\n\n7\n\n\f(a) a9a, (\u03bb1, \u03bb2) = (10\u22124, 0)\n\n(b) a9a, (\u03bb1, \u03bb2) = (10\u22124, 10\u22126)\n\n(c) a9a, (\u03bb1, \u03bb2) = (0, 10\u22126)\n\n(d) rcv1, (\u03bb1, \u03bb2) = (10\u22124, 0)\n\n(e) rcv1, (\u03bb1, \u03bb2) = (10\u22124, 10\u22126)\n\n(f) rcv1, (\u03bb1, \u03bb2) = (0, 10\u22126)\n\n(g) sido0, (\u03bb1, \u03bb2) = (10\u22124, 0)\n\n(h) sido0, (\u03bb1, \u03bb2) = (10\u22124, 10\u22126)\n\n(i) sido0, (\u03bb1, \u03bb2) = (0, 10\u22126)\n\nFigure 1: Comparisons on a9a (top), rcv1 (middle) and sido0 (bottom) data sets, for regularization\nparameters (\u03bb1, \u03bb2) = (10\u22124, 0) (left), (\u03bb1, \u03bb2) = (10\u22124, 10\u22126) (middle) and (\u03bb1, \u03bb2) = (0, 10\u22126)\n(right).\n\nthe objectives are strongly convex. Note that for the latter two cases, the strong convexity of the\nobjectives is \u00b5 = 10\u22126 and is relatively small; thus, it makes acceleration methods bene\ufb01cial.\nFigure 1 shows the comparisons of our method with the different methods described above on several\nsettings. \u201cObjective Gap\u201d means P (x) \u2212 P (x\u2217) for the output solution x. \u201cElapsed Time [sec]\u201d\nmeans the elapsed CPU time (sec). \u201cRestart_DASVRDA\u201d means DASVRDA with heuristic adaptive\nrestarting (Section 4). We can observe the following from these results:\n\n\u2022 Our proposed DASVRDA and Restart DASVRDA signi\ufb01cantly outperformed all the other\n\nmethods overall.\n\n\u2022 DASVRDA with the heuristic adaptive restart scheme ef\ufb01ciently made use of the local\nstrong convexities of non-strongly convex objectives and signi\ufb01cantly outperformed vanilla\nDASVRDA. For the other settings, the algorithm was still comparable to vanilla DASVRDA.\n\u2022 UC+AccProxSVRG7 outperformed vanilla AccProxSVRG but was outperformed by our\n\nmethods overall.\n\n7Although there has been no theoretical guarantee for UC + AccProxSVRG, we thought that it was fair\nto include experimental results about that because UC + AccProxSVRG gives better performances than the\nvanilla AccProxSVRG. Through some theoretical analysis, we can prove that UC + AccProxSVRG also has the\nsimilar rate and mini-batch ef\ufb01ciency to our proposed method, although these results are not obtained in any\nliterature. However, our proposed method is superior to this algorithm both theoretically and practically, because\nthe algorithm has several drawbacks due to the use of UC as follows. First, the algorithm has an additional\nlogarithmic factor in its convergence rate. This factor is generally not negligible and slows down its practical\nperformances. Second, the algorithm has more tuning parameters than our method. Third, the stopping criterion\nof each sub-problem of UC is hard to be tuned.\n\n8\n\n\f\u2022 APCG sometimes performed unstably and was outperformed by vanilla SVRG. On sido0\n\u2022 Katyusha always outperformed vanilla SVRG, but was signi\ufb01cantly outperformed by our\n\ndata set, for Ridge Setting, APCG signi\ufb01cantly outperformed all the other methods.\n\nmethods.\n\n7 Conclusion\n\n\u221a\n\n\u221a\n\nn)(cid:112)L/\u03b5)) and\nn)(cid:112)L/\u00b5)log(1/\u03b5)) in size b mini-batch settings for non-strongly and optimally\n\nIn this paper, we developed a new accelerated stochastic variance reduced gradient method for\nregularized empirical risk minimization problems in mini-batch settings: DASVRDA. We have shown\nthat DASVRDA achieves the total computational costs of O(d(nlog(1/\u03b5) + (b +\nO(d(n + (b +\n\u221a\nstrongly convex objectives, respectively. In addition, DASVRDA essentially achieves the optimal\niteration complexities only with size O(\nIn the numerical\nexperiments, our method signi\ufb01cantly outperformed state-of-the-art methods, including Katyusha\nand AccProxSVRG.\n\nn) mini-batches for both settings.\n\nAcknowledgment\n\nThis work was partially supported by MEXT kakenhi (25730013, 25120012, 26280009 and\n15H05707), JST-PRESTO and JST-CREST.\n\nReferences\n[1] Z. Allen-Zhu. Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. In 48th\n\nAnnual ACM Symposium on the Theory of Computing, pages 19\u201323, 2017.\n\n[2] Z. Allen-Zhu and Y. Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex\nObjectives. In Proceedings of the 33rd International Conference on Machine Learning, pages\n1080\u20131089, 2016.\n\n[3] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[4] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in Neural Information\nProcessing Systems 27, pages 1646\u20131654, 2014.\n\n[5] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimiza-\n\ntion. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[6] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems 26, pages 315\u2013323, 2013.\n\n[7] H. Li and Z. Lin. Accelerated proximal gradient methods for nonconvex programming. In\n\nAdvances in Neural Information Processing Systems 28, pages 379\u2013387, 2015.\n\n[8] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization.\n\nAdvances in Neural Information Processing Systems 28, pages 3384\u20133392, 2015.\n\nIn\n\n[9] Q. Lin, Z. Lu, and L. Xiao. An accelerated proximal coordinate gradient method. In Advances\n\nin Neural Information Processing Systems 27, pages 3059\u20133067, 2014.\n\n[10] Y. Nesterov. Gradient methods for minimizing composite objective function. Mathematical\n\nProgramming, 140(1):125\u2013161, 2013.\n\n[11] Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer\n\nScience & Business Media, 2013.\n\n[12] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in\n\nNeural Information Processing Systems 27, pages 1574\u20131582, 2014.\n\n9\n\n\f[13] A. Nitanda. Accelerated stochastic gradient descent for minimizing \ufb01nite sums. In Proceedings\nof the 19th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 195\u2013203,\n2016.\n\n[14] B. O\u2019Donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations\n\nof computational mathematics, 15(3):715\u2013732, 2015.\n\n[15] N. L. Roux, M. Schmidt, and F. R. Bach. A stochastic gradient method with an exponential\nconvergence _rate for \ufb01nite training sets. In Advances in Neural Information Processing Systems\n25, pages 2663\u20132671, 2012.\n\n[16] M. Schmidt, N. L. Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Mathematical Programming, 162(1):83\u2013112, 2017.\n\n[17] S. Shalev-Shwartz and Y. Singer. Logarithmic regret algorithms for strongly convex repeated\n\ngames. Technical report, The Hebrew University, 2007.\n\n[18] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized\n\nloss. The Journal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[19] Y. Singer and J. C. Duchi. Ef\ufb01cient learning using forward-backward splitting. In Advances in\n\nNeural Information Processing Systems 22, pages 495\u2013503, 2009.\n\n[20] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. Technical\n\nreport, University of Washington, Seattle, 2008.\n\n[21] L. Xiao. Dual averaging method for regularized stochastic learning and online optimization. In\n\nAdvances in Neural Information Processing Systems 22, pages 2116\u20132124, 2009.\n\n[22] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance\n\nreduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[23] Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical\nrisk minimization. In Proceedings of the 32nd International Conference on Machine Learning,\npages 353\u2013361, 2015.\n\n10\n\n\f", "award": [], "sourceid": 429, "authors": [{"given_name": "Tomoya", "family_name": "Murata", "institution": "NTT DATA Mathematical Systems Inc."}, {"given_name": "Taiji", "family_name": "Suzuki", "institution": "taiji@mist.i.u-tokyo.ac.jp"}]}