{"title": "General Proximal Incremental Aggregated Gradient Algorithms: Better and Novel Results under General Scheme", "book": "Advances in Neural Information Processing Systems", "page_first": 996, "page_last": 1006, "abstract": "The incremental aggregated gradient algorithm is popular in network optimization\nand machine learning research. However, the current convergence results require\nthe objective function to be strongly convex. And the existing convergence rates\nare also limited to linear convergence. Due to the mathematical techniques, the\nstepsize in the algorithm is restricted by the strongly convex constant, which may\nmake the stepsize be very small (the strongly convex constant may be small).\n\nIn this paper, we propose a general proximal incremental aggregated gradient algorithm, which contains various existing algorithms including the basic incremental aggregated gradient method. Better and new convergence results are proved even with the general scheme. The novel results presented in this paper, which have not appeared in previous literature, include: a general scheme, nonconvex analysis, the sublinear convergence rates of the function values, much larger stepsizes that guarantee the convergence, the convergence when noise exists, the line search strategy of the proximal incremental aggregated gradient algorithm and its convergence.", "full_text": "General Proximal Incremental Aggregated Gradient\nAlgorithms: Better and Novel Results under General\n\nScheme\u2217\n\nTao Sun\n\nCollege of Computer\n\nNational University of Defense Technology\n\nChangsha, Hunan 410073, China\n\nYuejiao Sun\n\nDepartment of Mathematics\n\nUniversity of California, Los Angeles\n\nLos Angeles, CA 90095, USA\n\nnudtsuntao@163.com\n\nsunyj@math.ucla.edu\n\nDongsheng Li\u2020\n\nCollege of Computer\n\nNational University of Defense Technology\n\nChangsha, Hunan 410073, China\n\nQing Liao\n\nDepartment of Computer Science & Technology\n\nHarbin Institute of Technology (Shenzhen)\n\nShenzhen, Guangdong 518055, China\n\ndsli@nudt.edu.cn\n\nliaoqing@hit.edu.cn\n\nAbstract\n\nThe incremental aggregated gradient algorithm is popular in network optimization\nand machine learning research. However, the current convergence results require\nthe objective function to be strongly convex. And the existing convergence rates\nare also limited to linear convergence. Due to the mathematical techniques, the\nstepsize in the algorithm is restricted by the strongly convex constant, which may\nmake the stepsize be very small (the strongly convex constant may be small).\nIn this paper, we propose a general proximal incremental aggregated gradient algo-\nrithm, which contains various existing algorithms including the basic incremental\naggregated gradient method. Better and new convergence results are proved even\nwith the general scheme. The novel results presented in this paper, which have not\nappeared in previous literature, include: a general scheme, nonconvex analysis, the\nsublinear convergence rates of the function values, much larger stepsizes that guar-\nantee the convergence, the convergence when noise exists, the line search strategy\nof the proximal incremental aggregated gradient algorithm and its convergence.\n\n1\n\nIntroduction\n\nMany problems in machine learning and network optimization can be formulated as\n\nmin\n\nx\n\n{F (x) = f (x) + g(x)} ,\n\nwhere f (x) =(cid:80)m\n\n(1)\ni=1 fi(x), x \u2208 Rn, fi is differentiable, \u2207fi is Lipschitz continuous with Li, for\ni = 1, 2, . . . , m, and g is proximable. A state-of-the-art method for this problem is the proximal\ngradient method [10], which requires to compute the full gradient of f in each iteration. However,\nwhen the number of the component functions fi is very large, i.e. m (cid:29) 1, it is costly to obtain the full\n\u2217This work is sponsored in part by the National Key R&D Program of China under Grant No.\n2018YFB0204300 and the National Natural Science Foundation of China under Grants (61932001 and\n61906200).\n\n\u2020Dongsheng Li is the corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f{1, 2, . . . , m}; and then computes(cid:80)\n\ngradient \u2207f; on the other hand, in some network cases, calculating the full gradient is not allowed,\neither. Thus, the incremental gradient algorithms are developed to avoid computing the full gradient.\nThe main idea of the incremental gradient descent lies on computing the gradients of partial com-\nponents of f to refresh the full gradient. Precisely, in each iteration, it selects an index set S from\ni\u2208S \u2207fi to update the full gradient. It requires much less com-\nputation than the gradient descent without losing too much accuracy of the true gradient. It is natural\nto consider two index selection strategies: deterministic and stochastic. In fact, all the incremental\ngradient algorithms for solving problem (1) can be labeled as one of these two routines.\n\n1.1 The general PIAG algorithm\n\nLet xi denote the i-th iterate. We \ufb01rst de\ufb01ne a \u03c3-algebra \u03c7k := \u03c3(x1, x2, . . . , xk). Consider a\ngeneral proximal incremental aggregated gradient algorithm which performs as\n\n(cid:26) E(vk | \u03c7k) =(cid:80)m\n\ni=1 \u2207fi(xk\u2212\u03c4i,k ) + ek,\n\nxk+1 = prox\u03b3kg[xk \u2212 \u03b3kvk],\n\n(2)\n\nwhere \u03c4i,k is the delay associated with fi and ek is the noise in the k-th iteration. The \ufb01rst equation\nin (2) indicates that vk is an approximation of the full gradient \u2207f with delays and noises in the\nperspective of expectation. For simplicity, we call this algorithm as the general PIAG algorithm.\n\n1.2 Literature review\n\nAs mentioned before, by the strategies of index selection, the literature can also be divided into two\nclasses.\nOn the deterministic road: Bertsekas proposed the Incremental Gradient (IG) method for problem\n(1) when g \u2261 0 [4]. To obtain convergence, IG method requires diminishing stepsizes, even for\nsmooth and strongly convex functions [5]. A special condition was proposed in [26] to relax this\ncondition. The second order IG method was also developed in [13]. An improved version of the\nIG is the Incremental Aggregated Gradient (IAG) method [6, 29]. When fi is a quadratic function,\nthe convergence based on the perturbation analysis of the eigenvalues of a periodic dynamic linear\nsystem was given by [6]. The global convergence is proved in [29]; and if the local Lipschitzian error\ncondition and local strong convexity are satis\ufb01ed, the local linear convergence can also be proved.\n[1] established lower complexity bounds for IAG for problem (1) with g \u2261 0. The linear convergence\nrates are proved under strong convexity assumption [12, 30].\nOn the stochastic road: The pioneer of this class is the Stochastic Gradient Descent (SGD) method\n[20], which suggests picking i from {1, 2, . . . , m} in each iteration with uniform probability, and\nusing m\u2207fi to replace \u2207f. However, SGD requires diminishing stepsizes, which makes its per-\nformance poor in both theory and practice. Due to the large deviation of m\u2207fi from f, \u201cvariance\nreduction\" schemes were proposed later, such as SVRG method [14], SAG method [25], and the\nSAGA method [11] are developed. With selected constant stepsizes, linear convergence has been\nproved in the strongly convex case, and ergodic sublinear convergence has been proved in the\nnon-strongly convex case.\n\n1.3 Relations with existing algorithms\n\nIn this part, we will present several popular existing algorithms which can be covered by the general\nPIAG.\nE.1. (Inexact) Proximal Gradient Desdent Algorithm: When \u03c4i,k \u2261 0 and expectation vanishes, the\ngeneral PIAG is equivalent to xk+1 = prox\u03b3kg[xk \u2212 \u03b3k\n(cid:80)\nE.2.\nIn the k-th iteration,\npick ik essentially cyclicly and then update xk+1 as xk+1 = prox\u03b3kg[xk \u2212 \u03b3k(\u2207fik (xk) +\n\n(cid:80)m\ni=1 \u2207fi(xk) + ek].\n\n(Inexact) Proximal Incremental Aggregated Gradient Algorithm:\n\ni(cid:54)=ik\n\n\u2207fi(xk\u2212\u03c4i,k ) + ek)], where(cid:26) \u03c4i,k+1 = \u03c4i,k + 1 if i (cid:54)= ik,\n\n(3)\n\n\u03c4i,k+1 = 1 if i = ik.\n\n2\n\n\fIn each iteration, one just needs to compute \u2207fik (xk) and the term(cid:80)\n(cid:26) vk+1 = vk \u2212 \u2207fik (xk\u2212\u03c4ik ,k ) + \u2207fik (xk),\n\nthe memory.\n\nE.3. Deterministic SAG (SAGA): Let (\u03c4i,k)i\u2208[N ],k\u22650 be de\ufb01ned as (3), pick ik essentially cyclicly,\nthen update xk+1 as\n\n\u2207fi(xk\u2212\u03c4i,k ) is shared by\n\ni(cid:54)=ik\n\nxk+1 = prox\u03b3kg(xk \u2212 \u03b3kvk+1).\n\nm(cid:99), take \u02dcx being any one\nE4. Deterministic SVRG: Pick ik cyclicly, i.e., ik \u2261 k(mod m), let t = (cid:98) k\nfrom {xmt, xmt+1, . . . , xmt+t\u22121}. And then update xk+1 as xk+1 = prox\u03b3kg[xk \u2212 \u03b3(\u2207fik (xk) \u2212\n\u2207kfik (\u02dcx) + \u2207f (\u02dcx))].\nE5. Decentralized Parallel Stochastic Gradient Descent (DPSGD): The algorithm is proposed in [17]\nF}, where X := [x1, x2, . . . , xn](cid:62),\nW is mixing matrix, and Sj is the neighbour set of j. In each iteration, the DPSGD computes a\nfi(xj) at node j, and then computes the neighborhood weighted average\n\nto solve minX{D(X) :=(cid:80)n\nstochastic gradient of(cid:80)\n\n2(cid:107)(I\u2212W )\n\nfi(xj)+ 1\n\n1\n\n2 X(cid:107)2\n\nj=1\n\ni\u2208Sj\n\n(cid:80)\n\ni\u2208Sj\nto update the local variable.\n\nE6. Forward-backward splitting by Parameter Server Computing: The forward-backward splitting\nfor problem (1) can be implemented on a parameter server computing model, in which, the worker\nnodes communicate synchronously with the parameter server but not directly with each other. Node i\njust computes \u2207if (\u02c6xk) and sends the result to the parameter server, where \u02c6xk comes from the shared\nmemory with bounded delay xk. In the parameter server, the gradients are collected and the iterate is\nupdated (implement the proximal map computation). The algorithm is a special case of the general\nPIAG. More details about parameter server computing can be found in [24].\n\nLAG: Lazily aggregated gradient: This algorithm is also designed with parameter server\nE7.\ncomputing. Different from E.7, the main motivation of this algorithm is to reduce the communications.\nIn this setting, the parameter server broadcasts the current iteration xk to the workers which is low-\ncostly; while the data transition cost from the worker to parameter server is high. In this case, authors\nin [9] propose LAG whose core idea is disabling the data feedback from worker to parameter server\nif the gradients change slightly. It is not dif\ufb01cult to verify that the general PIAG contains LAG.\n\n1.4 Contribution\n\nIn the perspective of algorithms, this paper proposes the general PIAG, which not only covers various\nclassical PIAG-like algorithms including the inexact schemes but also derives novel algorithms. In the\nperspective of theory, we build better and novel results compared with previous literature. Speci\ufb01cally,\nthe contribution of this paper can be summarized as follows:\nI. General scheme: We propose a general PIAG algorithm, which covers various classical algorithms\nin network optimization, distributed optimization, machine learning areas. We unify all these\nalgorithms into one mathematical scheme.\nII. Novel algorithm: We apply the line search strategy to PIAG and prove its convergence. The\nnumerical results demonstrate its ef\ufb01ciency.\nIII. Novel proof techniques: Compared with previous convergence analysis of PIAG, we use a new\nproof technique: Lyapunov function analysis. Due to this, we can build much stronger theoretical\nresults with more general scheme. The Lyapunov function analysis is through the paper for both\nconvex and nonconvex cases.\nIV. Better theoretical results: The previous convergence results of PIAG is restricted to strongly\nconvex case and the stepsize depends on the strong convexity constant. We get rid of this constant and\nstill guarantees the linear convergence with much larger stepsizes, even under a weaker assumption.\nFor the cyclic PIAG, the stepsize can be half of the gradient descent algorithm.\nV. Novel theoretical results: Many new results are proved in this paper. We list them as follows:\n\n\u2022 V.1. The convergence of nonconvex PIAG is studied. And in the expectation-free case, the\n\nsequence convergence is proved under the semi-algebraic property.\n\n3\n\n\fTable 1: Contributions of this paper\n\nNovel results\n\nTheory\n\nAlgorithms\n\n1. A general scheme which\ncovers various classical\nalgorithms.\n2. Line search of PIAG is\nproposed and analyzed.\n\nTheoretical\n\nresults\n\n1.The (in)exact convergence of nonconvex\nPIAG is studied. The sequence convergence\nis proved by means of semi-algebraic property.\n2. Sublinear convergence of (in)exact\nPIAG under general convex assumptions.\n\nBetter results\n\nProof technique Lyapunov function analysis\n\n1. Much larger stepsize is proved to be convergence.\n2. The numerical results show that line search of PIAG performs much better\nthan PIAG.\n\n\u2022 V.2. The convergence of the inexact PIAG is proved for both convex and nonconvex cases.\nIn the convex case, the convergence rates are exploited if the noises are promised to follow\ncertain assumptions. In the nonconvex case, the sequence convergence is also prove under\nsemi-algebraic property and assumption on the noises.\n\u2022 V.3. We proved the sublinear convergence of PIAG under general convex assumptions. To\nthe best of our knowledge, it is the \ufb01rst time to prove the non-ergodic O(1/k) convergence\nrate of PIAG. And, we also proved the non-ergodic O(1/k) convergence rate for inexact\nPIAG.\n\u2022 V.4. The convergence of line search of PIAG is proved for both convex and nonconvex cases.\n\nThe convergence rates are also presented in the convex case.\n\n2 Preliminaries\n\n:=\n\n2 . Assume that fi\n\nThrough the paper, we use\n\nthe notation \u2206k\n\nxk+1 \u2212 xk,\n\nand \u03c3k\nis differentiable and \u2207fi\n\n(cid:2)E((cid:107)vk \u2212(cid:80)m\ni=1 \u2207fi(xk\u2212\u03c4i,k )(cid:107)2 | \u03c7k)(cid:3) 1\nLipschitz continuous. Then, \u2207f is Lipschitz continuous with L := (cid:80)m\nsummable assumption on (\u03c3k)k\u22650, i.e.,(cid:80)\nthe general PIAG we de\ufb01ned in (2). Then we only need(cid:80)\nis easy to prove(cid:80)\n\n:=\nis Li-\ni=1 Li. The maximal\ndelay is \u03c4 := maxi,k{\u03c4i,k}. The convergence analysis in the paper depends on the square\ni < +\u221e. That is why the general PIAG just contains\ni \u03c32\ndeterministic SAGA and SVRG, in which case \u03c3k \u2261 0. However, the SAGA and SVRG may\nIn the deterministic case, \u03c3k = (cid:107)ek(cid:107), according to\nnot have the summable assumption held.\ni (cid:107)ei(cid:107)2 < +\u221e. Further, if the noise\nvanishes, the assumption certainly holds. Besides the deterministic case we discussed above,\nthe stochastic coordinate descent algorithm (with asynchronous parallel) can also satisfy this\nassumption. Taking the stochastic coordinate descent algorithm for example, in this algorithm,\nE(cid:107)\u2207f (xk)(cid:107)2 = (N \u2212 1)E(cid:107)\u2206k(cid:107)2. In the stochastic coordinate descent algorithm, it\nk = N\u22121\n\u03c32\nE(cid:107)\u2206(cid:107)2 < +\u221e if the stepsize is well chosen. For the asynchronous parallel\nalgorithm, by assuming the independence between \u02c6xk with ik, we can prove the same result given\nin [Lemma 1, [27]]. We introduce the de\ufb01nitions of subdifferentials. The details can be found in\n[19, 22, 23].\nDe\ufb01nition 1 Let J : RN \u2192 (\u2212\u221e, +\u221e] be a proper and lower semicontinuous function. The\nsubdifferential, of J at x \u2208 RN , written as \u2202J(x), is de\ufb01ned as\n\u2202J(x) := {u \u2208 RN : \u2203 xk \u2192 x, uk \u2192 u, such that lim\ny(cid:54)=xk\n\nJ(y) \u2212 J(xk) \u2212 (cid:104)uk, y \u2212 xk(cid:105)\n\ninf\ny\u2192xk\n\n(cid:107)y \u2212 xk(cid:107)2\n\nN\n\ni\n\n\u2265 0 }.\n\n3 Convergence analysis\n\nThe analysis in this section is heavily based on the following Lyapunov function:\n\n\u03bek(\u03b5, \u03b4) := F (xk) +\n\nL\n2\u03b5\n\nk\u22121(cid:88)\n\nd=k\u2212\u03c4\n\n+\u221e(cid:88)\n\ni=k\n\ni \u2212 min F,\n\u03c32\n\n(4)\n\n(d \u2212 (k \u2212 \u03c4 ) + 1)(cid:107)\u2206d(cid:107)2 +\n\n1\n2\u03b4\n\n4\n\n\fwhere \u03b5, \u03b4 > 0 will be determined later, based on the step size \u03b3 and \u03c4 (the bound for \u03c4i,k). We\ndiscuss the convergence when g (the regularized function in (1)) is convex or nonconvex separately.\nThe main difference of the two cases is the upper bound of the stepsize. Due to the convexity of g,\nthe upper bound of the stepsize in the \ufb01rst case is twice as the second one.\n\n3.1\n\ng is convex\n\nWhen g is convex, we consider three different types of convergence: the \ufb01rst one is in the perspective\nof expectation, the second one is about almost surely convergence, while the last one considers the\nsemi-algebraic property[18, 15, 7].\nConvergence in the perspective of expectation:\nLemma 1 Let f be a function (may be nonconvex) with L-Lipschitz gradient and g is convex,\nand \ufb01nite min F . Let (xk)k\u22650 be generated by the general PIAG, and maxi,k{\u03c4i,k} \u2264 \u03c4, and\n(2\u03c4 +1)L for arbitrary \ufb01xed 0 < c < 1. Then we can\n\ni < +\u221e. Choose the step size \u03b3k \u2261 \u03b3 = 2c\n\n(cid:80)\n\ni \u03c32\n\nchoose \u03b5, \u03b4 > 0 to obtain\n\nE\u03bek(\u03b5, \u03b4) \u2212 E\u03bek+1(\u03b5, \u03b4) \u2265 1\n4\n\n(\n\n1\n\u03b3\n\n\u2212 L\n2\n\n\u2212 \u03c4 L) \u00b7 E(cid:107)\u2206k(cid:107)2,\n\nE(cid:107)\u2206k(cid:107) = 0.\n\nlim\nk\n\n(5)\n\nWith the Lipschitz continuity of f, we are prepared to present the convergence result.\n\nTheorem 1 Assume the conditions of Lemma 1 hold and(cid:80)\n\ni \u03c32\nby general PIAG. Then, we have limk E[dist(0, \u2202F (xk))] = 0.\nRemark 1 For the cyclic PIAG, \u03c4 = M. If we apply the gradient descent for (1), the stepsize should\nbe 0 < \u03b3 < 2c\nM L , for some 0 < c < 1. In this case, the stepsize of cyclic PIAG is the half of the\ngradient descent algorithm for this problem.\n\ni < +\u221e, and (xk)k\u22650 is generated\n\nConvergence in the perspective of almost surely: The almost surely convergence is proved in this\npart. We consider a Lyapunov function which is modi\ufb01cation of (4) as\n(d \u2212 (k \u2212 \u03c4 ) + 1)(cid:107)\u2206d(cid:107)2 +\n\n\u02c6\u03bek(\u03b5, \u03b4) := F (xk) + \u03ba \u00b7 k\u22121(cid:88)\n\ni \u2212 min F,\n\u03c32\n\n+\u221e(cid:88)\n\n(6)\n\n1\n2\u03b4\n\ni=k\n\nwhere we assume \u03c4 \u2265 1 and\n\nd=k\u2212\u03c4\n\n\u03ba :=\n\nL\n2\u03b5\n\n+\n\n1\n4\u03c4\n\n(\n\n1\n\u03b3\n\n\u2212 L\n2\n\n\u2212 \u03c4 L).\n\n(7)\n\nA lemma on nonnegative almost supermartingales [21], whose details are included in the appendix,\nis needed to prove the almost sure convergence.\n\nTheorem 2 Assume the conditions of Lemma 1 hold and(cid:80)\n\ni < +\u221e, and (xk)k\u22650 is generated\n\nby general PIAG. Then we have dist(0, \u2202F (xk)) \u2192 0, a.s.\nConvergence under semi-algebraic property: If the function F satis\ufb01es the semi-algebraic prop-\nerty3, we can obtain more results for the inexact proximal incremental aggregated gradient algorithm.\nIn this case, the expectation of (24) and (28) can both be removed. Similar to [Theorem 1, [28]], we\ncan derive the following result.\n\ni \u03c32\n\nTheorem 3 Assume the conditions of Lemma 1 hold, and F satis\ufb01es the semi-algebraic property,\nand (xk)k\u22650 is generated by (in)exact PIAG, and (cid:107)ek(cid:107) \u223c O( 1\nk\u03b7 ) (\u03b7 > 1), then, (xk)k\u22650 converges\nto a critical point of F .\n\n3.2\n\ng is nonconvex\n\nIn this subsection, we consider the case when g is nonconvex. Under this weaker assumption, the\nstepsizes are reduced for the convergence. Like previous subsection, we also consider three kinds of\nconvergence. We list them as sequence.\n\n3Semi-algebraic property used in the nonconvex optimization, more details can be found in [8, 2].\n\n5\n\n\fc\n\nProposition 1 Assume the conditions of Theorem 1 hold except that g is nonconvex and \u03b3k \u2261 \u03b3 =\n(2\u03c4 +1)L for arbitrary \ufb01xed 0 < c < 1. Then, we have limk E[dist(0, \u2202F (xk))] = 0.\nProposition 2 Assume the conditions of Proposition 1 hold, then, we have dist(0, \u2202F (xk)) \u2192 0, a.s.\nProposition 3 Assume the conditions of Theorem 3 hold except that g is nonconvex and \u03b3k \u2261 \u03b3 =\n(2\u03c4 +1)L for arbitrary \ufb01xed 0 < c < 1, then, (xk)k\u22650 converges to a critical point of F .\n\nc\n\n4 Convergence rates in convex case\n\nIn this part, we prove the sublinear convergence rates of the general proximal incremental aggregated\ngradient algorithm under general convex case, i.e., both f and g are convex. The analysis in the part\nuses a slightly modi\ufb01ed Lyapunov function\n\n(d \u2212 (k \u2212 \u03c4 ) + 1)(cid:107)\u2206d(cid:107)2 + \u03bbk \u2212 min F,\n\n(8)\n\nFk(\u03b5, \u03b4) := F (xk) + \u03ba \u00b7 k\u22121(cid:88)\n(cid:80)+\u221e\n\nd=k\u2212\u03c4\n\ni +(cid:80)+\u221e\n\nwhere \u03ba is given in (7) and \u03bbk := 1\nHere, we assume \u03c4 \u2265 1.\n2\u03b4\n\ni=k \u03c32\n\ni=k \u03c62\n\ni and (\u03c6k)k\u22650 is a nonnegative sequence.\n\n4.1 Technical lemma\n\nThis part presents a technique lemma. The sublinear and linear convergence results are both derived\nfrom this lemma.\n\n2c\n\n\u2264 \u03c6k,(cid:80)+\u221e\n\nLemma 2 Assume the gradient of f is Lipschitz with L and g is convex. Choose the step size\n\u03b3k \u2261 \u03b3 =\n(2\u03c4 +1)L for arbitrary \ufb01xed 0 < c < 1. For any positive sequence (\u03c6k)k\u22650 satisfying\n\u03c3k\u221a\nk, for some D > 0. Let xk denote the projection of xk to arg min F ,\ni=k \u03c62\n2\nassumed to exist, and let\n\ni \u2264 D\u03c62\n(cid:40)\n\n\u03b1 := max{ 1\n\u03b2 := (\u03c4 + 1)( 1\n\n\u03b3 + L) + 1\n\n\u03b3 + L + \u03ba\u03c4, 2D}/[min{ L\n\n8\u03c4 ( 1\n\n\u03b3 \u2212 1\n\n2 \u2212 \u03c4 ), 1}]\n\n.\n\nThen, there exist \u03b5, \u03b4 > 0 such that:\n\n(EFk+1(\u03b5, \u03b4))2 \u2264 \u03b1(EFk(\u03b5, \u03b4) \u2212 EFk+1(\u03b5, \u03b4)) \u00d7 (\u03ba\u03c4\n\nk\u22121(cid:88)\n\nd=k\u2212\u03c4\n\nE(cid:107)\u2206d(cid:107)2 + \u03b2E(cid:107)xk+1 \u2212 xk+1(cid:107)2 + \u03bbk).\n\n(9)\n\n4.2 Sublinear convergence rate under general convexity\n\nIn this subsection, we present the sublinear convergence of the general proximal incremental aggre-\ngated gradient algorithm.\nTheorem 4 Assume the gradient of f is Lipschitz continuous with L and g is convex, and proxg(\u00b7)\nis bounded. Choose the step size \u03b3k \u2261 \u03b3 = 2c\n(2\u03c4 +1)L for arbitrary \ufb01xed 0 < c < 1. Let (xk)k\u22650 be\ngenerated by the general proximal incremental aggregated gradient algorithm. And the \u03c3k \u223c O(\u03b6 k),\nwhere 0 < \u03b6 < 1. Then, we have\n\n(10)\nIn many cases, proxg(\u00b7) may be unbounded. However, we can slightly modi\ufb01ed the algorithm. For\nexample, in the LASSO problem\n\n).\n\nEF (xk) \u2212 min F \u223c O(\n\n1\nk\n\n{(cid:107)b \u2212 Ax(cid:107)2\n\n2 + (cid:107)x(cid:107)1},\n\nmin\n\nx\n\n(11)\n\n6\n\n\f2,(cid:107)b(cid:107)2\n\n2 + (cid:107)0(cid:107)1 = (cid:107)b(cid:107)2\n(cid:104)\n\n2]N . Then, we can turn to solve minx{|b \u2212 Ax(cid:107)2\n\nwe can easily see that (cid:107)x\u2217(cid:107)1 \u2264 (cid:107)b \u2212 A \u00b7 0(cid:107)2\n2. That means the solution set of (11) is\nbounded by X := [\u2212(cid:107)b(cid:107)2\n2 + (cid:107)x(cid:107)1 + \u03b4X (x)}.\nAnd we can set g(\u00b7) = (cid:107)\u00b7(cid:107)1 + \u03b4X (\u00b7) rather than (cid:107)\u00b7(cid:107)1. Luckily, the proximal map of (cid:107)\u00b7(cid:107)1 + \u03b4X (\u00b7) is\n[prox|\u00b7|(xi)]\nproximable. With [Theorem 2, [31]], we have\nfor i \u2208 [1, 2, . . . , N ]. In the deterministic case, the sublinear convergence still holds even the\nproxg(\u00b7) is unbounded. The boundedness of the proxg(\u00b7) is used to derive the boundedness of\nsequence (xk)k\u22650. In fact, this boundedness can be obtained by the coercivity of function F in the\ndeterministic case.\n\nprox(cid:107)\u00b7(cid:107)1+\u03b4X (\u00b7)(x)\n\n= prox\u03b4[\u2212(cid:107)b(cid:107)2\n\n(cid:105)\n\ni\n\n2 ,(cid:107)b(cid:107)2\n2 ]\n\nProposition 4 Assume the condition of Theorem 4 hold. Let (xk)k\u22650 be generated by the (in)exact\nPIAG, then F (xk) \u2212 min F \u223c O( 1\nk ).\nTo the best of our knowledge, this is the \ufb01rst time to prove the sublinear convergence rate for the\nproximal incremental aggregated gradient algorithm.\n\n4.3 Linear convergence with larger stepsize\n\nAssume that the function F satis\ufb01es the following condition\n\n(12)\nwhere x is the projection of x to the set arg min F , and \u03bd > 0. This property is weaker than the\nstrongly convexity. If F is further differentiable, condition (12) is equivalent to the restricted strongly\nconvexity [16].\n\nF (x) \u2212 min F \u2265 \u03bd(cid:107)x \u2212 x(cid:107)2,\n\nTheorem 5 Assume the gradient of f is Lipschitz with L and g is convex, and the function F satis\ufb01es\ncondition (12). Choose the step size \u03b3k \u2261 \u03b3 =\n(2\u03c4 +1)L for arbitrary \ufb01xed 0 < c < 1. And the\n\u03c3k \u223c O(\u03b6 k), where 0 < \u03b6 < 1. Then, we have\n\n2c\n\nEF (xk) \u2212 min F \u223c O(\u03c9k),\n\n(13)\n\nfor some 0 < \u03c9 < 1.\n\nCompared with the existing linear convergence results in [30, 12], our theoretical \ufb01ndings enjoys two\nadvantages: 1. we generalize the strong convexity to a much weaker condition (12); 2. the stepsize\ngets rid of the parameter \u03bd which promises larger descent of the algorithm when \u03bd is small.\n\n5 Line search of the proximal incremental gradient algorithm\n\nc\n\n(2\u03c4 +1)L if g is nonconvex, and \u03b3k \u2261 2c\n\nof the algorithm can be presented as follows: Step 1 Compute the point vk =(cid:80)m\n\nIn this part, we consider a line search version of the deterministic proximal incremental gradient\nalgorithm. First, we set \u03b3k \u2261\n(2\u03c4 +1)L if g is convex. The scheme\ni=1 \u2207fi(xk\u2212\u03c4i,k ).\nStep 2 Find jk as the smallest integer number j which obeys that yk = prox\u03b7jk c1g[xk \u2212 \u03b7jk c1vk].\n2 (cid:107)yk \u2212 xk(cid:107)2 where 0 < \u03b7 < 1 and c1, c2 > 0 the\nand (cid:104)vk, yk \u2212 xk(cid:105) + g(yk) \u2212 g(xk) \u2264 \u2212 c2\nparameters. Set \u03b7k = \u03b7jk c1 if \u03b7k \u2265 \u03b3 and \u03b7k = \u03b3 if else. The point xk+1 is generated by\nxk+1 = prox\u03b7kg[xk \u2212 \u03b7kvk].\nWithout the noise, the Lyapunov function can get one parameter free in the analysis (we can get rid\nof t). Thus, the Lyapunov function used in this part can be described as\n\n\u03bek(\u03b5) := F (xk) +\n\nL\n2\u03b5\n\n(d \u2212 (k \u2212 \u03c4 ) + 1)(cid:107)\u2206d(cid:107)2 \u2212 min F.\n\n(14)\n\nk\u22121(cid:88)\n\nd=k\u2212\u03c4\n\nLemma 3 Let f be a function (may be nonconvex) with L-Lipschitz gradient and g is nonconvex, and\n\ufb01nite min F . Let (xk)k\u22650 be generated by the proximal incremental aggregated gradient algorithm\nwith line search, and maxi,k{\u03c4i,k} \u2264 \u03c4. Choose the parameter c2 \u2265 (2\u03c4 +1)L\nand 0 < c < 1. It then\nholds that limk dist(0, \u2202F (xk)) = 0.\n\nc\n\n7\n\n\f(cid:80)k\u22121\nIn previous result, if g is convex, the lower bound of c2 can be shortened by half. This is\nd=k\u2212\u03c4 (cid:107)\u2206d(cid:107)2 +\nbecause (68) in the Appendix can be improved as F (xk+1) \u2212 F (xk) \u2264 L\n\n(cid:105)(cid:107)\u2206k(cid:107)2. This result is proved by (21). Thus, we can obtain the following result.\n\n(cid:104) (\u03c4 \u03b5+1)L\n\n\u2212 (2\u03c4 +1)L\n\n2\u03b5\n\n2\n\nc\n\nLemma 4 Assume conditions of Lemma 3 hold except that both f and g are convex and c2 \u2265 (2\u03c4 +1)L\nIt then holds that limk dist(0, \u2202F (xk)) = 0.\nIn fact, we can also derive the convergence rate for the line search version in the convex case. The\nproof is very similar to the one in Section 4. Thus, we just present the sketch. Like the previous\nd=k\u2212\u03c4 (d \u2212 (k \u2212 \u03c4 ) +\n8c\u03c4 (L + 2\u03c4 L). With this Lyapunov function and suitable \u03b5, we\n\nanalysis, a modi\ufb01ed Lyapunov function is needed Fk(\u03b5) := F (xk) + \u02dc\u03ba \u00b7(cid:80)k\u22121\n\n1)(cid:107)\u2206d(cid:107)2\nprove the following two inequalities\n\n2 \u2212 min F, where \u02dc\u03ba := L\n\n2\u03b5 + 1\u2212c\n\n2c\n\n.\n\n(L + 2\u03c4 L), 1} \u00b7 (\n\n(cid:107)\u2206d(cid:107)2),\n\n(15)\n\nk(cid:88)\n\nd=k\u2212\u03c4\n\nand\n\nFk(\u03b5) \u2212 Fk+1(\u03b5) \u2265 min{ 1 \u2212 c\nk(cid:88)\n\n+ L + \u02dc\u03ba\u03c4 ) \u00d7 (\n\n8c\u03c4\n\n(Fk+1(\u03b5))2 \u2264 (\n\n1\n\u03b3\n\n(cid:107)\u2206d(cid:107)2)\n\nd=k\u2212\u03c4\n\n(cid:32)\n\n\u00d7\n\n[(\u03c4 + 1)(\n\n+ L) + 1](cid:107)xk+1 \u2212 xk+1(cid:107)2 + \u02dc\u03ba\u03c4\n\n1\n\u03b3\n\n(cid:33)\n\n.\n\n(cid:107)\u2206d(cid:107)2\n\nk\u22121(cid:88)\n\nd=k\u2212\u03c4\n\n(16)\n\nWith (15) and (16), we then derive the following lemma.\nTheorem 6 Let f be a convex function with L-Lipschitz gradient and g is convex, and \ufb01nite min F .\nLet (xk)k\u22650 be generated by the proximal incremental aggregated gradient algorithm with line\nsearch, and maxi,k{\u03c4i,k} \u2264 \u03c4. Choose the parameter c2 \u2265 (2\u03c4 +1)L\nand 0 < c < 1. Then, there\nexist \u03b5 > 0 such that:\n\n2c\n\n(Fk+1(\u03b5))2 \u2264 \u02dc\u03b1(Fk(\u03b5) \u2212 Fk+1(\u03b5)) \u00d7 (\u02dc\u03ba\u03c4\n\n(17)\n8c\u03c4 (L+2\u03c4 L), 1}]. Further more, if F is coercive, F (xk)\u2212min F \u223c\n\n(cid:107)\u2206d(cid:107)2 + \u03b2(cid:107)xk+1 \u2212 xk+1(cid:107)2),\n\nd=k\u2212\u03c4\n\nk ). If F satis\ufb01es condition (12), F (xk) \u2212 min F \u223c O(\u02dc\u03c9k) for some 0 < \u02dc\u03c9 < 1.\n\nwhere \u02dc\u03b1 := ( 1\nO( 1\n\n\u03b3 +L+\u02dc\u03ba\u03c4 )/[min{ 1\u2212c\n\nk\u22121(cid:88)\n\n6 Numerical results\n\nj +(cid:80)m\nj +(cid:80)m\n\nNow we use some numerical experiments to show how the line search strategy can accelerate the\nPIAG algorithms. Here we considered the following two updating rules,\n\ni=1 wk\n\n1. scheme I: xk+1 = prox\u03b3g[xk \u2212 \u03b3(wk+1\n2. scheme II: xk+1 = prox\u03b3g[xk \u2212 \u03b3(wk+1\nj = \u2207fj(xk),t = (cid:98) k\n\nj \u2212 wk\nj \u2212 wmt\nwhere j \u2261 k(mod m),wk+1\nm(cid:99). We tested binary classi\ufb01ers on MNIST, ijcnn1.\nTo include all convex and nonconvex cases, we choose logistic regression (convex) and squared\nlogistic loss (nonconvex) for f, (cid:96)1 regularization (convex) and MCP (nonconvex) for g. The results\nwhen using scheme I and II with and without line search are shown in Figure 6. In our experiments,\nwe choose \u03b3 = 2c\n\u03b3 .Our\nnumerical results shows that the line search strategy can speed up the PIAG algorithm a lot.\n\n(2\u03c4 +1)L when g is nonconvex, c = 0.99, c2 = 1\n\n(2\u03c4 +1)L when g is convex and\n\ni )],\ni=1 wmt\n\n)],\n\nc\n\ni\n\n7 Conclusion\n\nIn this paper, we consider a general proximal incremental aggregated gradient algorithm and prove\nseveral novel results. Much better results are proved under more general conditions. The core of\nthe analysis is using the Lyapunov function analysis. We also consider the line search of proximal\nincremental aggregated gradient algorithm and the convergence rate is proved.\n\n8\n\n\fReferences\n[1] Alekh Agarwal and Leon Bottou. A lower bound for the optimization of \ufb01nite sums. arXiv\n\npreprint arXiv:1410.0723, 2014.\n\n[2] H\u00e9dy Attouch, J\u00e9r\u00f4me Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating\nminimization and projection methods for nonconvex problems: An approach based on the\nkurdyka-\u0142ojasiewicz inequality. Mathematics of Operations Research, 35(2):438\u2013457, 2010.\n\n9\n\n100101102cpu time0.350.40.450.50.550.60.650.7function valueMNIST: convex f plus convex g100101102cpu time0.10.1050.110.1150.120.125function valueMNIST: nonconvex f plus convex g100101102cpu time0.350.40.450.50.550.60.650.7function valueMNIST: convex f plus nonconvex g100101102103cpu time0.10.1050.110.1150.120.125function valueMNIST: nonconvex f plus nonconvex g10-1100101102cpu time0.250.30.350.40.450.50.550.6function valueijcnn1: convex f plus convex g10-1100101102cpu time0.20.250.30.350.40.450.50.550.60.65function valueijcnn1: convex f plus nonconvex g10-1100101102cpu time0.20.250.30.350.40.450.50.550.60.65function valueijcnn1: convex f plus nonconvex g100101102103cpu time0.030.040.050.060.070.080.090.10.110.120.13function valueijcnn1: nonconvex f plus nonconvex g\f[3] Amir Beck. On the convergence of alternating minimization for convex programming with\napplications to iteratively reweighted least squares and decomposition schemes. SIAM Journal\non Optimization, 25(1):185\u2013209, 2015.\n\n[4] Dimitri P Bertsekas. Nonlinear programming. Athena scienti\ufb01c Belmont, 1999.\n\n[5] Dimitri P Bertsekas. Incremental gradient, subgradient, and proximal methods for convex\n\noptimization: A survey. Optimization for Machine Learning, 2010(1-38):3, 2011.\n\n[6] Doron Blatt, Alfred O Hero, and Hillel Gauchman. A convergent incremental gradient method\n\nwith a constant step size. SIAM Journal on Optimization, 18(1):29\u201351, 2007.\n\n[7] J\u00e9r\u00f4me Bolte, Aris Daniilidis, and Adrian Lewis. The \u0142ojasiewicz inequality for nonsmooth\nsubanalytic functions with applications to subgradient dynamical systems. SIAM Journal on\nOptimization, 17(4):1205\u20131223, 2007.\n\n[8] J\u00e9r\u00f4me Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization\nfor nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459\u2013494, 2014.\n\n[9] Tianyi Chen, Georgios B Giannakis, Tao Sun, and Wotao Yin. Lag: Lazily aggregated gradient\n\nfor communication-ef\ufb01cient distributed learning. NIPS 2018, 2018.\n\n[10] Patrick L Combettes and Val\u00e9rie R Wajs. Signal recovery by proximal forward-backward\n\nsplitting. Multiscale Modeling & Simulation, 4(4):1168\u20131200, 2005.\n\n[11] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives. In Advances in Neural\nInformation Processing Systems, pages 1646\u20131654, 2014.\n\n[12] Mert Gurbuzbalaban, Asuman Ozdaglar, and PA Parrilo. On the convergence rate of incremental\n\naggregated gradient algorithms. SIAM Journal on Optimization, 27(2):1035\u20131048, 2017.\n\n[13] Mert G\u00fcrb\u00fczbalaban, Asuman Ozdaglar, and Pablo Parrilo. A globally convergent incremental\n\nnewton method. Mathematical Programming, 151(1):283\u2013313, 2015.\n\n[14] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[15] Krzysztof Kurdyka. On gradients of functions de\ufb01nable in o-minimal structures. In Annales de\n\nl\u2019institut Fourier, volume 48, pages 769\u2013784. Chartres: L\u2019Institut, 1950-, 1998.\n\n[16] Ming-Jun Lai and Wotao Yin. Augmented \\ell_1 and nuclear-norm models with a globally\n\nlinearly convergent algorithm. SIAM Journal on Imaging Sciences, 6(2):1059\u20131091, 2013.\n\n[17] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jio Hsieh, Wei Zhang, and Ji Liu. Can decentralized\nalgorithms outperform centralized algorithms? a case study for decentralized parallel stochastic\ngradient descent. arXiv preprint arXiv:1705.09056, 2017.\n\n[18] Stanislas \u0141ojasiewicz. Sur la g\u00e9om\u00e9trie semi-et sous-analytique. Ann. Inst. Fourier, 43(5):1575\u2013\n\n1595, 1993.\n\n[19] Boris S Mordukhovich. Variational analysis and generalized differentiation I: Basic theory,\n\nvolume 330. Springer Science & Business Media, 2006.\n\n[20] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n[21] Herbert Robbins and David Siegmund. A convergence theorem for non negative almost\nsupermartingales and some applications. In Herbert Robbins Selected Papers, pages 111\u2013135.\nSpringer, 1985.\n\n[22] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science\n\n& Business Media, 2009.\n\n[23] Ralph Tyrell Rockafellar. Convex analysis. Princeton university press, 2015.\n\n10\n\n\f[24] Ernest K Ryu and Wotao Yin.\n\narXiv:1708.06908, 2017.\n\nProximal-proximal-gradient method.\n\narXiv preprint\n\n[25] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, 162(1-2):83\u2013112, 2017.\n\n[26] Mikhail V Solodov. Incremental gradient algorithms with stepsizes bounded away from zero.\n\nComputational Optimization and Applications, 11(1):23\u201335, 1998.\n\n[27] Tao Sun, Robert Hannah, and Wotao Yin. Asynchronous coordinate descent under more realistic\nassumptions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6183\u2013\n6191. 2017.\n\n[28] Tao Sun, Hao Jiang, Lizhi Cheng, and Wei Zhu. A convergence frame for inexact non-\nconvex and nonsmooth algorithms and its applications to several iterations. arXiv preprint\narXiv:1709.04072, 2017.\n\n[29] Paul Tseng and Sangwoon Yun. Incrementally updated gradient methods for constrained and\nregularized optimization. Journal of Optimization Theory and Applications, 160(3):832\u2013853,\n2014.\n\n[30] Nuri Denizcan Vanli, Mert Gurbuzbalaban, and Asu Ozdaglar. Global convergence rate of\nproximal incremental aggregated gradient methods. arXiv preprint arXiv:1608.01713, 2016.\n\n[31] Yao-Liang Yu. On decomposing the proximal map.\n\nProcessing Systems, pages 91\u201399, 2013.\n\nIn Advances in Neural Information\n\n11\n\n\f", "award": [], "sourceid": 582, "authors": [{"given_name": "Tao", "family_name": "Sun", "institution": "National university of defense technology"}, {"given_name": "Yuejiao", "family_name": "Sun", "institution": "University of California, Los Angeles"}, {"given_name": "Dongsheng", "family_name": "Li", "institution": "School of Computer Science, National University of Defense Technology"}, {"given_name": "Qing", "family_name": "Liao", "institution": "Harbin Institute of Technology (Shenzhen)"}]}