{"title": "A Multi-step Inertial Forward-Backward Splitting Method for Non-convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 4035, "page_last": 4043, "abstract": "In this paper, we propose a multi-step inertial Forward--Backward splitting algorithm for minimizing the sum of two non-necessarily convex functions, one of which is proper lower semi-continuous while the other is differentiable with a Lipschitz continuous gradient. We first prove global convergence of the scheme with the help of the Kurdyka\u2013\u0141ojasiewicz property. Then, when the non-smooth part is also partly smooth relative to a smooth submanifold, we establish finite identification of the latter and provide sharp local linear convergence analysis. The proposed method is illustrated on a few problems arising from statistics and machine learning.", "full_text": "A Multi-step Inertial Forward\u2013Backward Splitting\n\nMethod for Non-convex Optimization\n\nJingwei Liang and Jalal M. Fadili\n\nNormandie Univ, ENSICAEN, CNRS, GREYC\n\n{Jingwei.Liang,Jalal.Fadili}@greyc.ensicaen.fr\n\nGabriel Peyr\u00e9\n\nCNRS, DMA, ENS Paris\nGabriel.Peyre@ens.fr\n\nAbstract\n\nWe propose a multi-step inertial Forward\u2013Backward splitting algorithm for mini-\nmizing the sum of two non-necessarily convex functions, one of which is proper\nlower semi-continuous while the other is differentiable with a Lipschitz continuous\ngradient. We \ufb01rst prove global convergence of the algorithm with the help of the\nKurdyka-\u0141ojasiewicz property. Then, when the non-smooth part is also partly\nsmooth relative to a smooth submanifold, we establish \ufb01nite identi\ufb01cation of the\nlatter and provide sharp local linear convergence analysis. The proposed method is\nillustrated on several problems arising from statistics and machine learning.\n\n1\n\nIntroduction\n\n1.1 Non-convex non-smooth optimization\n\nNon-smooth optimization has proved extremely useful to all quantitative disciplines of science\nincluding statistics and machine learning. A common trend in modern science is the increase in size\nof datasets, which drives the need for more ef\ufb01cient optimization schemes. For large-scale problems\nwith non-smooth and possibly non-convex terms, it is possible to generalize gradient descent with\nthe Forward\u2013Backward (FB) splitting scheme [3] (a.k.a proximal gradient descent), which includes\nprojected gradient descent as a sub-case.\nFormally, we equip Rn the n-dimensional Euclidean space with the standard inner product (cid:104)\u00b7, \u00b7(cid:105) and\nassociated norm || \u00b7 || respectively. Our goal is the generic minimization of composite objectives of\nthe form\n(P)\n\n(cid:8)\u03a6(x) def= R(x) + F (x)(cid:9),\n\nmin\nx\u2208Rn\n\nwhere we have\n\n(A.1) R : Rn \u2192 R \u222a {+\u221e} is the penalty function which is proper lower semi-continuous (lsc),\n(A.2) F : Rn \u2192 R is the loss function which is \ufb01nite-valued, differentiable and its gradient \u2207F\n\nand bounded from below;\n\nis L-Lipschitz continuous.\n\nThroughout, no convexity is imposed neither on R nor on F .\nThe class of problems we consider is that of non-smooth and non-convex optimization problems.\nHere are some examples that are of particular relevance to problems in regression, machine learning\nand classi\ufb01cation.\nExample 1.1 (Sparse regression). Let A \u2208 Rm\u00d7n, y \u2208 Rm, \u00b5 > 0, and ||x||0 is the (cid:96)0 pseudo-norm\n(see Example 4.1). Consider (see e.g. [11])\n\nmin\nx\u2208Rn\n\n1\n2\n\n||y \u2212 Ax||2 + \u00b5||x||0.\n\n(1.1)\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fExample 1.2 (Principal component pursuit (PCP)). The PCP problem [9] aims at decomposing a\ngiven matrix into sparse and low-rank components\n\nmin\n\n(xs,xl)\u2208(Rn1\u00d7n2 )2\n\n||y \u2212 xs \u2212 xl||2\nwhere || \u00b7 ||F is the Frobenius norm and \u00b51 and \u00b52 > 0.\nExample 1.3 (Sparse Support Vector Machines). One would like to \ufb01nd a linear decision function\nwhich minimizes the objective\n\nF + \u00b51||xs||0 + \u00b52rank(xl),\n\n(1.2)\n\n1\n2\n\n1\nm\n\nmin\n\ni=1 G((cid:104)x, zi(cid:105) + b, yi) + \u00b5||x||0,\n\n(b,x)\u2208R\u00d7Rn\n\n(1.3)\nwhere for i = 1,\u00b7\u00b7\u00b7 , m, (zi, yi) \u2208 Rn \u00d7 {\u00b11} is the training set, and G is a smooth loss function\nwith Lipschitz-continuous gradient such as the squared hinge loss G(\u02c6yi, yi) = max(0, 1 \u2212 \u02c6yiyi)2 or\nthe logistic loss G(\u02c6yi, yi) = log(1 + e\u2212\u02c6yiyi).\n(Inertial) Forward\u2013Backward The Forward\u2013Backward splitting method for solving (P) reads\n\n(cid:0)xk \u2212 \u03b3k\u2207F (xk)(cid:1),\n\nwhere \u03b3k > 0 is a descent step-size, and\n\nxk+1 \u2208 prox\u03b3kR\n\n(cid:80)m\n\n(1.4)\n\n(1.5)\n\nprox\u03b3R(\u00b7) def= Argminx\u2208Rn\n\n1\n2\n\n||x \u2212 \u00b7||2 + \u03b3R(x),\n\ndenotes the proximity operator of R. prox\u03b3R(x) is non-empty under (A.1) and is set-valued in\ngeneral. Lower-boundedness of R can be relaxed by requiring e.g. coercivity of the objective in (1.5).\nSince the pioneering work of Polyak [24] on the heavy-ball method approach to gradient descent,\nseveral works have adapted this methodology to various optimization schemes. For instance, the\ninertial proximal point algorithm [1, 2], or the inertial FB methods [22, 21, 4, 20]. The FISTA scheme\n[5, 10] also belongs to this class. See [20] for a detailed account.\nThe non-convex case In the context of non-convex optimization, [3] was the \ufb01rst to establish\nconvergence of the FB iterates when the objective \u03a6 satis\ufb01es the Kurdyka-\u0141ojasiewicz property1.\nFollowing their footprints, [8, 23] established convergence of the special inertial schemes in [22] in\nthe non-convex setting.\n\n1.2 Contributions\n\nIn this paper, we introduce a novel inertial scheme (Algorithm 1) and study its global and local\nproperties to solve the non-smooth and non-convex optimization problem (P). More precisely, our\nmain contributions can be summarized as follows.\nA globally convergent general inertial scheme We propose a general multi-step inertial FB (MiFB)\nalgorithm to solve (P). This algorithm is very \ufb02exible as it allows higher memory and even negative\ninertial parameters (unlike previous work [20]). Global convergence of any bounded sequence\nof iterates to a critical point is proved when the objective \u03a6 is lower-bounded and satis\ufb01es the\nKurdyka-\u0141ojasiewicz property.\nLocal convergence properties under partial smoothness Under the additional assumptions that\nthe smooth part is locally C 2 around a critical point x(cid:63) (where xk \u2192 x(cid:63)), and that the non-smooth\ncomponent R is partly smooth (see De\ufb01nition 3.1) relative to an active submanifold Mx(cid:63), we show\nthat Mx(cid:63) can be identi\ufb01ed in \ufb01nite time, i.e. xk \u2208 Mx(cid:63) for all k large enough. Building on this\n\ufb01nite identi\ufb01cation result, we provide a sharp local linear convergence analysis and we characterize\nprecisely the corresponding convergence rate which, in particular, reveals the role of Mx(cid:63). Moreover,\nthis local convergence analysis naturally opens the door to higher-order acceleration, since under\nsome circumstances, the original problem (P) is eventually equivalent to locally minimizing \u03a6 on\nMx(cid:63), and partial smoothness implies that \u03a6 is actually C 2 on Mx(cid:63).\n\n1We are aware of the works existing on convergence of the objective sequence \u03a6(xk) of FB, including\nrates, in the non-smooth and non-convex setting. But given that, in general, this does not say anything about\nconvergence of the sequence of iterates xk, they are irrelevant to our discussion.\n\n2\n\n\fLet 0 < \u03b3 \u2264 \u03b3k \u2264 \u03b3 < 1\n\nAlgorithm 1: A Multi-step Inertial Forward\u2013Backward (MiFB)\nInitial: s \u2265 1 is an integer, I = {0, 1, . . . , s \u2212 1}, x0 \u2208 Rn and x\u2212s = . . . = x\u22121 = x0.\nrepeat\nL, {a0,k, a1,k, . . .} \u2208] \u2212 1, 2]s, {b0,k, b1,k, . . .} \u2208] \u2212 1, 2]s:\nya,k = xk +(cid:80)\nyb,k = xk +(cid:80)\nxk+1 \u2208 prox\u03b3kR\n\n(cid:0)ya,k \u2212 \u03b3k\u2207F (yb,k)(cid:1).\n\ni\u2208I ai,k(xk\u2212i \u2212 xk\u2212i\u22121),\ni\u2208I bi,k(xk\u2212i \u2212 xk\u2212i\u22121),\n\n(1.6)\n\n(1.7)\n\nk = k + 1;\n\nuntil convergence;\n\n1.3 Notations\nThroughout the paper, N is the set of non-negative integers. For a nonempty closed convex set\n\u2126 \u2282 Rn, ri(\u2126) is its relative interior, and par(\u2126) = R(\u2126 \u2212 \u2126) is the subspace parallel to it.\nLet R : Rn \u2192 R \u222a {+\u221e} be a lsc function, its domain is de\ufb01ned as dom(R) def= {x \u2208 Rn : R(x) <\n+\u221e}, and it is said to be proper if dom(R) (cid:54)= \u2205. We need the following notions from variational\nanalysis, see e.g. [25] for details. Given x \u2208 dom(R), the Fr\u00e9chet subdifferential \u2202F R(x) of R at x,\nis the set of vectors v \u2208 Rn that satis\ufb01es lim inf z\u2192x, z(cid:54)=x\n1||x\u2212z|| (R(z) \u2212 R(x) \u2212 (cid:104)v, z \u2212 x(cid:105)) \u2265 0. If\nx /\u2208 dom(R), then \u2202F R(x) = \u2205. The limiting-subdifferential (or simply subdifferential) of R at x,\nwritten as \u2202R(x), is de\ufb01ned as \u2202R(x) def= {v \u2208 Rn : \u2203xk \u2192 x, R(xk) \u2192 R(x), vk \u2208 \u2202F R(xk) \u2192\nv}. Denote dom(\u2202R) def= {x \u2208 Rn : \u2202R(x) (cid:54)= \u2205}. Both \u2202F R(x) and \u2202R(x) are closed, with \u2202F R(x)\nconvex and \u2202F R(x) \u2282 \u2202R(x) [25, Proposition 8.5]. Since R is lsc, it is (subdifferentially) regular at\nx if and only if \u2202F R(x) = \u2202R(x) [25, Corollary 8.11].\nAn lsc function R is r-prox-regular at \u00afx \u2208 dom(R) for \u00afv \u2208 \u2202R(\u00afx) if \u2203r > 0 such that R(x(cid:48)) >\nR(x) + (cid:104)v, x(cid:48) \u2212 x(cid:105) \u2212 1\nA necessary condition for x to be a minimizer of R is 0 \u2208 \u2202R(x). The set of critical points of R is\ncrit(R) = {x \u2208 Rn : 0 \u2208 \u2202R(x)}.\n\n2r||x \u2212 x(cid:48)||2 \u2200x, x(cid:48) near \u00afx, R(x) near R(\u00afx) and v \u2208 \u2202R(x) near \u00afv.\n\n2 Global convergence of MiFB\n\n2.1 Kurdyka-\u0141ojasiewicz property\nLet R : Rn \u2192 R \u222a {+\u221e} be a proper lsc function. For \u03b71, \u03b72 such that \u2212\u221e < \u03b71 < \u03b72 < +\u221e,\nde\ufb01ne the set [\u03b71 < R < \u03b72] def= {x \u2208 Rn : \u03b71 < R(x) < \u03b72}.\nDe\ufb01nition 2.1. R is said to have the Kurdyka-\u0141ojasiewicz property at \u00afx \u2208 dom(R) if there exists\n\u03b7 \u2208]0, +\u221e], a neighbourhood U of \u00afx and a continuous concave function \u03d5 : [0, \u03b7[\u2192 R+ such that\n\n(i) \u03d5(0) = 0, \u03d5 is C 1 on ]0, \u03b7[, and for all s \u2208]0, \u03b7[, \u03d5(cid:48)(s) > 0;\n(ii) for all x \u2208 U \u2229 [R(\u00afx) < R < R(\u00afx) + \u03b7], the Kurdyka-\u0141ojasiewicz inequality holds\n\n\u03d5(cid:48)(cid:0)R(x) \u2212 R(\u00afx)(cid:1)dist(cid:0)0, \u2202R(x)(cid:1) \u2265 1.\n\n(2.1)\nProper lsc functions which satisfy the Kurdyka-\u0141ojasiewicz property at each point of dom(\u2202R) are\ncalled KL functions.\nRoughly speaking, KL functions become sharp up to reparameterization via \u03d5, called a desingularizing\nfunction for R. Typical KL functions are the class of semi-algebraic functions, see [6, 7]. For instance,\nthe (cid:96)0 pseudo-norm and the rank function (see Example 1.1-1.3 and Section 4.1) are indeed KL.\n\n2.2 Global convergence\nLet \u00b5, \u03bd > 0 be two constants. For i \u2208 I and k \u2208 N, de\ufb01ne the following quantities,\n\ndef= 1 \u2212 \u03b3kL \u2212 \u00b5 \u2212 \u03bd\u03b3k\n\n\u03b2k\n\n2\u03b3k\n\n, \u03b2 def= lim inf\n\nk\u2208N \u03b2k and \u03b1i,k\n\ndef=\n\nsa2\ni,k\n2\u03b3k\u00b5\n\n+\n\nsb2\n\ni,kL2\n2\u03bd\n\n, \u03b1i\n\ndef= lim sup\n\nk\u2208N\n\n\u03b1i,k. (2.2)\n\n3\n\n\fTheorem 2.2 (Convergence of MiFB (Algorithm 1)). For problem (P), suppose that (A.1)-(A.2)\nhold, and moreover \u03a6 is a proper lsc KL function. For Algorithm 1, choose \u00b5, \u03bd, \u03b3k, ai,k, bi,k such\nthat\n\n= \u03b2 \u2212(cid:80)\n\ndef\n\n(2.3)\n\n(i) {xk}k\u2208N has \ufb01nite length, i.e.(cid:80)\n\nThen each bounded sequence {xk}k\u2208N generated by MiFB satis\ufb01es\nk\u2208N ||xk \u2212 xk\u22121|| < +\u221e;\n\ni\u2208I \u03b1i > 0.\n\n\u03b4\n\n(ii) There exists a critical point x(cid:63) \u2208 crit(\u03a6) such that limk\u2192\u221e xk = x(cid:63).\n(iii) If \u03a6 has the KL property at a global minimizer x(cid:63), then starting suf\ufb01ciently close from x(cid:63),\n\nany sequence {xk}k\u2208N converges to a global minimum of \u03a6 and satis\ufb01es (i).\n\nThe proof is detailed in the supplementary material.\nRemark 2.3.\n\n(i) The convergence result holds true for any real Hilbert space. The boundedness of {xk}k\u2208N\n\nis automatically ensured under standard assumptions such as coercivity of \u03a6.\n\n(ii) It is known from [13] that if the desingularizing function \u03d5 = C\n\n2 , 1[,\nthen global linear convergence of the objective and the iterates can be derived. However, we\nwill not pursue this further since our main interest is local linear convergence.\n\n(iii) Unlike existing work, negative inertial parameters are allowed by Theorem 2.2.\n(iv) When ai,k \u2261 0 and bi,k \u2261 0, i.e. the case of FB splitting, condition (2.3) holds naturally as\n\n\u03b8 t\u03b8, C > 0 and \u03b8 \u2208 [ 1\n\n.\n\n2\u03b3\n\n2\u03b3\u00b5 + b2\n\n(v) From (2.2) and (2.3), we conclude the following:\n\nlong as \u03b3 < 1\n(a) s = 1: if b0,k \u2261 b, a0,k \u2261 a (i.e. constant inertial parameters), then (2.3) implies that\n\nL which recovers the case of [3];\n\n2\u03bd/L2 < \u03b2 = 1\u2212\u03b3L\u2212\u00b5\u2212\u03bd\u03b3\n\na, b belong to an ellipsoid: a2\n\nparameters), then (2.3) means that the ai\u2019s live in a ball: ( 1\n\nAn empirical approach for inertial parameters Besides Theorem 2.2, we also provide an empirical\nbound for the choice of the inertial parameters. Consider the setting: \u03b3k \u2261 \u03b3 \u2208]0, 1/L[ and bi,k =\n\n(b) When s \u2265 2, for each i \u2208 I, let bi,k = ai,k \u2261 ai (i.e. constant symmetric inertial\ni < \u03b2.\n\n2\u03bd/L2 )(cid:80)\nai,k \u2261 ai \u2208] \u2212 1, 2[, i \u2208 I. We have the following empirical bound for the summand(cid:80)\niai \u2208(cid:3)0, min(cid:8)1, 1/L\u2212\u03b3\n(cid:80)\n(2.4)\n(cid:80)\nand choose ai,k such that(cid:80)\ni ai,k = min{(cid:80)\nTo ensure the convergence {xk}k\u2208N, an online updating rule should be applied together with the\nempirical bound. More precisely, choose ai according to (2.4). Then for each k \u2208 N, let bi,k = ai,k\ni\u2208I ||xk\u2212i \u2212\nNote that the allowed choices of the summand(cid:80)\nxk\u2212i\u22121||}k\u2208N is summable. For instance, ck =\ninstance, (2.4) allows(cid:80)\n\ni ai, ck} where ck > 0 is such that {ck\nk1+q(cid:80)\ni\u2208I ||xk\u2212i\u2212xk\u2212i\u22121|| , c > 0, q > 0.\n3L ]. While for Theorem 2.2,(cid:80)\ni ai by (2.4) is larger than those of Theorem 2.2. For\ni ai = 1 can be reached\n\ni ai = 1 for \u03b3 \u2208]0, 2\n\n2\u03b3\u00b5 + 1\n\n(cid:9)(cid:2).\n\n|2\u03b3\u22121/L|\n\ni\u2208I ai:\n\ni\u2208I a2\n\nc\n\nonly when \u03b3 \u2192 0.\n\n3 Local convergence properties of MiFB\n\n3.1 Partial smoothness\nLet M \u2282 Rn be a C 2-smooth submanifold, let TM(x) the tangent space of M at any point x \u2208 M.\nDe\ufb01nition 3.1. The function R : Rn \u2192 R \u222a {+\u221e} is C 2-partly smooth at \u00afx \u2208 M relative to M\nfor \u00afv \u2208 \u2202R(\u00afx) (cid:54)= \u2205 if M is a C 2-submanifold around \u00afx, and\n(i) (Smoothness): R restricted to M is C 2 around \u00afx;\n(ii) (Regularity): R is regular at all x \u2208 M near \u00afx and R is r-prox-regular at \u00afx for \u00afv;\n(iii) (Sharpness): TM(\u00afx) = par(\u2202R(x))\u22a5;\n(iv) (Continuity): The set-valued mapping \u2202R is continuous at \u00afx relative to M.\n\nWe denote the class of partly smooth functions at x relative to M for v as PSFx,v(M). Partial\nsmoothness was \ufb01rst introduced in [15] and its directional version stated here is due to [18, 12].\nProx-regularity is suf\ufb01cient to ensure that the partly smooth submanifolds are locally unique [18,\nCorollary 4.12], [12, Lemma 2.3 and Proposition 10.12].\n\n4\n\n\f3.2 Finite activity identi\ufb01cation\n\nOne of the key consequences of partial smoothness is \ufb01nite identi\ufb01cation of the partial smoothness\nsubmanifold associated to R for problem (P). This is formalized in the following statement.\nTheorem 3.2 (Finite activity identi\ufb01cation). Suppose that Algorithm 1 is run under the conditions\nof Theorem 2.2, such that the generated sequence {xk}k\u2208N converges to a critical point x(cid:63) \u2208 crit(\u03a6).\nAssume that R \u2208 PSFx(cid:63),\u2212\u2207F (x(cid:63))(Mx(cid:63) ) and the non-degeneracy condition\n\n\u2212\u2207F (x(cid:63)) \u2208 ri(cid:0)\u2202R(x(cid:63))(cid:1),\n\nholds. Then, xk \u2208 Mx(cid:63) for all k large enough.\nSee the supplementary material for the proof. This result generalizes that of [20] to the non-convex\ncase and multiple inertial steps.\n\n(ND)\n\n3.3 Local linear convergence\nGiven \u03b3 \u2208]0, 1\nR \u2208 PSFx(cid:63),\u2212\u2207F (x(cid:63))(Mx(cid:63) ). Denote Tx(cid:63)\nsymmetric,\n\nL [ and a critical point x(cid:63) \u2208 crit(\u03a6), let Mx(cid:63) be a C 2-smooth submanifold and\ndef= TMx(cid:63) (x(cid:63)) and the following matrices which are all\n\nH def= \u03b3PTx(cid:63)\u22072F (x(cid:63))PTx(cid:63) , G def= Id \u2212 H, Q def= \u03b3\u22072Mx(cid:63) \u03a6(x(cid:63))PTx(cid:63) \u2212 H,\n\n(3.1)\nwhere \u22072Mx(cid:63) \u03a6 is the Riemannian Hessian of \u03a6 along the submanifold Mx(cid:63) (readers may refer to\nthe supplementary material from more details on differential calculus on Riemannian manifolds).\nTo state our local linear convergence result, the following assumptions will play a key role.\nRestricted injectivity Besides the local C 2-smoothness assumption on F , following the idea of\n[19, 20], we assume the restricted injectivity condition,\n\nker(cid:0)\u22072F (x(cid:63))(cid:1) \u2229 Tx(cid:63) = {0}.\n\n(RI)\n\nPositive semi-de\ufb01niteness of Q Assume that Q is positive semi-de\ufb01nite, i.e.\u2200h \u2208 Tx(cid:63),\n\n(3.2)\nUnder (3.2), Id + Q is symmetric positive de\ufb01nite, hence invertible, we denote P def= (Id + Q)\u22121.\nConvergent parameters The parameters of MiFB (Algorithm 1), are convergent, i.e.\n\n(cid:104)h, Qh(cid:105) \u2265 0.\n\nai,k \u2192 ai, bi,k \u2192 bi, \u2200i \u2208 I and \u03b3k \u2192 \u03b3 \u2208 [\u03b3, min{\u03b3, \u00afr}],\n\n(3.3)\n\nwhere \u00afr < r, and r is the prox-regularity modulus of R (see De\ufb01nition 3.1).\nRemark 3.3.\n\n(i) Condition (3.2) can be met by various non-convex functions, such as polyhedral functions,\nincluding the (cid:96)0 pseudo-norm. For the rank function, it is also observed that this condition\nholds in our numerical experiments of Section 4.\n\n(ii) Condition (3.3) asserts that both the inertial parameters (ai,k, bi,k) and the step-size \u03b3k\n\nshould converge to some limit points, and this condition cannot be relaxed in general.\n(iii) It can be shown that conditions (3.2) and (RI) together imply that x(cid:63) is a local minimizer\nof \u03a6 in (P), and \u03a6 grows at least quadratically near x(cid:63). The arguments to prove this are\nessentially adapted from those used to show [20, Proposition 4.1(ii)].\n\nWe need the following notations:\n\ndef= (a0 \u2212 b0)P + (1 + b0)P G, Ms\n\ndef= \u2212(cid:0)(ai\u22121 \u2212 ai) \u2212 (bi\u22121 \u2212 bi)(cid:1)P \u2212 (bi\u22121 \u2212 bi)P G, i = 1, ..., s \u2212 1,\n\ndef= \u2212(as\u22121 \u2212 bs\u22121)P \u2212 bs\u22121P G,\n\nM0\n\nMi\n\n(3.4)\n\n\uf8ee\uf8ef\uf8ef\uf8f0M0\n\nId\n\n...\n\n0\n\nM def=\n\n\u00b7\u00b7\u00b7 Ms\u22121 Ms\n\u00b7\u00b7\u00b7\n0\n...\n...\n\u00b7\u00b7\u00b7\n\nId\n\n...\n\n0\n\n0\n\n5\n\n\uf8f9\uf8fa\uf8fa\uf8fb , dk\n\ndef=\n\n\uf8eb\uf8ec\uf8ed xk \u2212 x(cid:63)\n\n...\n\nxk\u2212s \u2212 x(cid:63)\n\n\uf8f6\uf8f7\uf8f8 .\n\n\fTheorem 3.4 (Local linear convergence). Suppose that Algorithm 1is run under the setting of\nTheorem 3.2. Moreover, assume that (RI), (3.2) and (3.3) hold. Then for all k large enough,\n\ndk+1 = M dk + o(||dk||).\n\nIf \u03c1(M ) < 1, then given any \u03c1 \u2208]\u03c1(M ), 1[, there exists K \u2208 N such that \u2200k \u2265 K,\n\nIn particular, if s = 1, then \u03c1(M ) < 1 if R is locally polyhedral around x(cid:63) or if a0 = b0.\n\n||xk \u2212 x(cid:63)|| = O(\u03c1k\u2212K).\n\nSee the supplementary material for the proof.\nRemark 3.5.\n\n(3.5)\n\n(3.6)\n\n(i) When s = 1, \u03c1(M ) can be given explicitly in terms of the parameters of the algorithm (i.e. a0,\nb0 and \u03b3), see [20, Section 4.2] for details. However, the spectral analysis of M becomes\nmuch more complicated to get for s \u2265 2, where the analysis of at least cubic equations are\ninvolved. Therefore, for the sake of brevity, we shall skip the detailed discussion here.\n\ns=1 = 1 \u2212 \u221a\n\n(ii) When s = 1, it was shown in [20] that the optimal convergence rate that can be obtained\n1 \u2212 \u03c4 \u03b3, where from condition (RI),\nby 1-step inertial scheme with \ufb01xed \u03b3 is \u03c1(cid:63)\ncontinuity of \u22072F at x(cid:63) implies that there exists \u03c4 > 0 and a neighbourhood of x(cid:63) such that\n(cid:104)h, \u22072F (x(cid:63))h(cid:105) \u2265 \u03c4||h||2, for all h \u2208 Tx(cid:63). As we will see in the numerical experiments of\nSection 4, such a rate can be improved by our multi-step inertial scheme. Taking s = 2 for\nexample, we will show that for a certain class of functions, the optimal local linear rate is\nclose to or even is \u03c1(cid:63)\n\n1 \u2212 \u03c4 \u03b3, which is obviously faster than \u03c1(cid:63)\n\ns=2 = 1 \u2212 3\u221a\n\n(iii) Though it can be satis\ufb01ed for many problems in practice, the restricted injectivity (RI) can\n\ns=1.\n\nbe removed for some penalties R, for instance, when R is locally polyhedral near x(cid:63).\n\n4 Numerical experiments\n\nIn this section, we illustrate our results with some numerical experiments carried out on the problems\nin Example 1.1, 1.2 and 1.3.\n\n4.1 Examples of KL and partly smooth functions\n\nAll the objectives \u03a6 in the above mentioned examples are continuous KL functions. Indeed, in\nExample 1.1 and 1.2, \u03a6 is the sum of semi-algebraic functions which is also semi-algebraic. In\nExample 1.3, \u03a6 is also algebraic when G is the squared hinge loss, and de\ufb01nable in an o-minimal\nstructure for the logistic loss (see e.g. [26] for material on o-minimal structures).\nMoreover, R is partly smooth in all these examples as we show now.\nExample 4.1 ((cid:96)0 pseudo-norm). The (cid:96)0 pseudo-norm is locally constant. Moreover, it is regular on\nRn ([14, Remark 2]) and its subdifferential is given by (see [14, Theorem 1])\n\nwhere (ei)i=1,...,n is the standard basis, and supp(x) =(cid:8)i : xi (cid:54)= 0(cid:9). The proximity operator of\n\n\u2202||x||0 = span(cid:0)(ei)i\u2208supp(x)c\n\n(cid:1),\n\n(cid:96)0-norm is given by hard-thresholding,\n\n\uf8f1\uf8f2\uf8f3z\n\nif |z| >\nif |z| =\nif |z| <\n\n\u221a\n\u221a\n\u221a\n\n2\u03b3,\n2\u03b3,\n2\u03b3.\n\nprox\u03b3||x||0\n\n(z) =\n\nsign(z)[0, z]\n0\n\nMx = Tx =(cid:8)z \u2208 Rn : supp(z) \u2282 supp(x)(cid:9).\n\nIt can then be easily veri\ufb01ed that the (cid:96)0 pseudo-norm is partly smooth at any x relative to the subspace\n\nIt is also prox-regular at x for any bounded v \u2208 \u2202||x||0. Note also condition (ND) is automatically\nveri\ufb01ed and that the Riemannian gradient and Hessian along Tx of || \u00b7 ||0 vanish.\nExample 4.2 (Rank). The rank function is the spectral extension of (cid:96)0 pseudo-norm to matrix-\nvalued data x \u2208 Rn1\u00d7n2 [17]. Consider a singular value decomposition (SVD) of x, i.e. x =\nU diag(\u03c3(x))V \u2217, where U = {u1, . . . , un}, V = {v1, . . . , vn} are orthonormal matrices, and\n\n6\n\n\f\u03c3(x) = (\u03c3i(x))i=1,...,n is the vector of singular values. By de\ufb01nition, rank(x) def= ||\u03c3(x)||0. Thus the\nrank function is partly smooth relative at x to the set of \ufb01xed rank matrices\n\nMx =(cid:8)z \u2208 Rn1\u00d7n2 : rank(z) = rank(x)(cid:9),\n\nwhich is a C 2-smooth submanifold [16]. The tangent space of Mx at x is\n\n(cid:9),\n\nTMx (x) = Tx =(cid:8)z \u2208 Rn1\u00d7n2 : u\u2217\n\u2202rank(x) = U \u2202(cid:0)||\u03c3(x)||0\n\ni zvj = 0, for all r < i \u2264 n1, r < j \u2264 n2\n\n(cid:1)V \u2217 = U span(cid:0)(ei)i\u2208supp(\u03c3(x))c\n\n(cid:1)V \u2217,\n\nThe rank function is also regular its subdifferential reads\n\nwhich is a vector space (see [14, Theorem 4 and Proposition 1]). The proximity operator of rank\nfunction amounts to applying hard-thresholding to the singular values. Observe that by de\ufb01nition of\nMx, the Riemannian gradient and Hessian of the rank function along Mx also vanish.\nFor Example 1.2, it is worth noting from the above examples and separability of the regularizer that\nthe latter is also partly smooth relative to the cartesian product of the partial smoothness submanifolds\nof (cid:96)0 and the rank function.\n\n4.2 Experimental results\n\nFor the problem in Example 1.1, we generated y = Axob + \u03c9 with m = 48, n = 128, the entries of\nA are i.i.d. zero-mean and unit variance Gaussian, xob is 8-sparse, and \u03c9 \u2208 Rm is an additive noise\nwith small variance.\nFor the problem in Example 1.2, we generated y = xs + xl + \u03c9, with n1 = n2 = 50, xs is 250-sparse,\nand the rank of xl is 5, and \u03c9 is an additive noise with small variance.\nFor Example 1.3, we generated m = 64 training samples with n = 96-dimensional feature space.\nFor all presented numerical results, 3 different settings were tested:\n\n\u2022 the FB method, with \u03b3k \u2261 0.3/L, noted as \u201cFB\u201d;\n\u2022 MiFB with s = 1, bk = ak \u2261 a and \u03b3k \u2261 0.3/L, noted as \u201c1-iFB\u201d;\n\u2022 MiFB with s = 2, bi,k = ai,k \u2261 ai, i = 0, 1 and \u03b3k \u2261 0.3/L, noted as \u201c2-iFB\u201d.\n\nTightness of theoretical prediction The convergence pro\ufb01les of ||xk \u2212 x(cid:63)|| are shown in Figure 1.\nAs it can be seen from all the plots, \ufb01nite identi\ufb01cation and local linear convergence indeed occur. The\npositions of the green dots indicate the iteration from which xk numerically identi\ufb01es the submanifold\nMx(cid:63). The solid lines (\u201cP\u201d) represents practical observations, while the dashed lines (\u201cT\u201d) denotes\ntheoretical predictions.\nAs the Riemannian Hessians of (cid:96)0 and the rank both vanish in all examples, our predicted rates\ncoincide exactly with the observed ones (same slopes for the dashed and solid lines).\n\n(a) Sparse regression\n\n(b) PCP\n\n(c) Sparse SVM\n\nFigure 1: Finite identi\ufb01cation and local linear convergence of MiFB under different inertial settings\nin terms of ||xk \u2212 x(cid:63)||. \u201cP\u201d stands for practical observation and \u201cT\u201d indicates the theoretical estimate.\nWe \ufb01x \u03b3k \u2261 0.3/L for all tests. For the 2 inertial schemes, inertial parameters are \ufb01rst chosen such\nthat (2.3) holds. The position of the green dot in each plot indicates the iteration beyond which\nidenti\ufb01cation of Mx(cid:63) occurs.\nComparison of the methods Under the tested settings, we draw the following remarks on the\ncomparison of the inertial schemes:\n\n7\n\n1002003004005006007008009001000k10-1010-610-2102kxk!x?kFB,PFB,T1-iFBP1-iFB,T2-iFB,P2-iFB,T2-iFBoptimal1!p1!=.1!3p1!=.50100150200250300350k10-1010-610-2102kxk!x?kFB,PFB,T1-iFBP1-iFB,T2-iFB,P2-iFB,T2-iFBoptimal1!p1!=.1!(1!=.)1=3200400600800100012001400160018002000k10-1010-610-2102kxk!x?kFB,PFB,T1-iFBP1-iFB,T2-iFB,P2-iFB,T2-iFBoptimal1!p1!=.1!(1!=.)1=3\f\u2022 The MiFB scheme is much faster than FB both globally and locally. Finite activity identi\ufb01-\n\u2022 Comparing the two MIFB inertial schemes, \u201c2-iFB\u201d outperforms \u201c1-iFB\u201d, showing the\n\ncation also occurs earlier for MiFB than for FB;\n\nadvantages of a 2-step inertial scheme over the 1-step one.\n\nOptimal \ufb01rst-order method To highlight the potential of multiple steps in MiFB, for the \u201c2-iFB\u201d\nscheme, we also added an example where we locally optimized the rate for the inertial parmeters. See\noptimal inertial parameters, the dashed line stands for the rate 1 \u2212 \u221a\nthe magenta lines all the examples, where the solid line corresponds to the observed pro\ufb01le for the\n1 \u2212 \u03c4 \u03b3, and the dotted line is\nthat of 1 \u2212 3\u221a\n1 \u2212 \u03c4 \u03b3, which shows indeed that a faster linear rate can be obtained owing to multiple\n\ninertial parameters.\nWe refer to [20, Section 4.5] for the optimal choice of inertial parameters for the case s = 1.\nThe empirical bound (2.4) and inertial steps s We now present a short comparison of the empirical\nbound (2.4) of inertial parameters and different choices of s under bigger choice of \u03b3 = 0.8/L. MiFB\nwith 3 inertial steps, i.e. s = 3, is added which is noted as \u201c3-iFB\u201d, see the magenta line in Figure 2.\nSimilar to the above experiments, we choose bi,k = ai,k \u2261 ai, i \u2208 I, and \u201cThm 2.2\u201d means that ai\u2019s\nare chosen according to Theorem 2.2, while \u201cBnd (2.4)\u201d means that ai\u2019s are chosen based on the\nempirical bound (2.4). We can infer from Figure 2 the following. Compared to the results in Figure 1,\na bigger choice of \u03b3 leads to faster convergence. Yet still, under the same choice of \u03b3, MiFB is faster\nthan FB both locally and globally; For either \u201cThm 2.2\u201d or \u201cBnd (2.4)\u201d, the performance of the three\ni\u2208I ai for each scheme\nare close. Then between \u201cThm 2.2\u201d and \u201cBnd (2.4)\u201d, \u201cBnd (2.4)\u201d shows faster convergence result,\ni\u2208I ai of (2.4) is bigger than that of Theorem 2.2. It should be noted that,\nwhen \u03b3 \u2208]0, 2\ni\u2208I ai equal or\nvery close to 1, then it can be observed in practice that MiFB locally oscillates, which is a well-known\nproperty of the FISTA scheme [5, 10]. We refer to [20, Section 4.4] for discussions of the properties\nof such oscillation behaviour.\n\nMiFB schemes are close, this is mainly due to the fact that values of the sum(cid:80)\nsince the allowed value of(cid:80)\n\ni\u2208I ai allowed by (2.4) is 1. If we choose(cid:80)\n\n3L ], the largest value of(cid:80)\n\n(a) Sparse regression\n\n(b) PCP\n\n(c) Sparse SVM\n\nFigure 2: Comparison of MiFB under different inertial settings. We \ufb01x \u03b3k \u2261 0.8/L for all tests. For\nthe three inertial schemes, the inertial parameters were chosen such that (2.3) holds.\n\nAcknowledgments\n\nThis work was partly supported by the European Research Council (ERC project SIGMA-Vision).\n\nReferences\n[1] F. Alvarez. On the minimizing property of a second order dissipative system in Hilbert spaces. SIAM\n\nJournal on Control and Optimization, 38(4):1102\u20131119, 2000.\n\n[2] F. Alvarez and H. Attouch. An inertial proximal method for maximal monotone operators via discretization\n\nof a nonlinear oscillator with damping. Set-Valued Analysis, 9(1-2):3\u201311, 2001.\n\n[3] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame\nproblems: proximal algorithms, Forward\u2013Backward splitting, and regularized Gauss\u2013Seidel methods.\nMathematical Programming, 137(1-2):91\u2013129, 2013.\n\n[4] H. Attouch, J. Peypouquet, and P. Redont. A dynamical approach to an inertial Forward\u2013Backward\n\nalgorithm for convex minimization. SIAM J. Optim., 24(1):232\u2013256, 2014.\n\n8\n\n100200300400500600700k10-1010-610-2102kxk!x?kFB1 - iFB, Thm 2.22 - iFB, Thm 2.23 - iFB, Thm 2.21 - iFB, Bnd (2.4)2 - iFB, Bnd (2.4)3 - iFB, Bnd (2.4)20406080100120k10-1010-610-2102kxk!x?kFB1 - iFB, Thm 2.22 - iFB, Thm 2.23 - iFB, Thm 2.21 - iFB, Bnd (2.4)2 - iFB, Bnd (2.4)3 - iFB, Bnd (2.4)100200300400500600700k10-1010-610-2102kxk!x?kFB1 - iFB, Thm 2.22 - iFB, Thm 2.23 - iFB, Thm 2.21 - iFB, Bnd (2.4)2 - iFB, Bnd (2.4)3 - iFB, Bnd (2.4)\f[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[6] J. Bolte, A. Daniilidis, and A. Lewis. The \u0141ojasiewicz inequality for nonsmooth subanalytic functions with\napplications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205\u20131223, 2007.\n\n[7] J. Bolte, A. Daniilidis, O. Ley, and L. Mazet. Characterizations of Lojasiewicz inequalities: subgradient\n\ufb02ows, talweg, convexity. Transactions of the American Mathematical Society, 362(6):3319\u20133363, 2010.\n\n[8] R. I. Bo\u00b8t, E. R. Csetnek, and S. C. L\u00e1szl\u00f3. An inertial Forward\u2013Backward algorithm for the minimization\nof the sum of two nonconvex functions. EURO Journal on Computational Optimization, pages 1\u201323, 2014.\n\n[9] E. J. Cand\u00e8s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM\n\n(JACM), 58(3):11, 2011.\n\n[10] A. Chambolle and C. Dossal. On the convergence of the iterates of the \u201cFast Iterative Shrink-\n\nage/Thresholding Algorithm\u201d. Journal of Optimization Theory and Applications, pages 1\u201315, 2015.\n\n[11] D. L. Donoho, M. Elad, and V. N. Temlyakov. Stable recovery of sparse overcomplete representations in\n\nthe presence of noise. IEEE Trans. Inform. Theory, 52(1):6\u201318, 2006.\n\n[12] D. Drusvyatskiy and A. S. Lewis. Optimality, identi\ufb01ability, and sensitivity. Mathematical Programming,\n\npages 1\u201332, 2013.\n\n[13] P. Frankel, G. Garrigos, and J. Peypouquet. Splitting methods with variable metric for kurdyka\u2013\u0142ojasiewicz\nfunctions and general convergence rates. Journal of Optimization Theory and Applications, 165(3):874\u2013900,\n2015.\n\n[14] H. Y. Le. Generalized subdifferentials of the rank function. Optimization Letters, 7(4):731\u2013743, 2013.\n\n[15] A. S. Lewis. Active sets, nonsmoothness, and sensitivity. SIAM J. on Optimization, 13(3):702\u2013725, 2003.\n\n[16] A. S. Lewis and J. Malick. Alternating projections on manifolds. Mathematics of Operations Research,\n\n33(1):216\u2013234, 2008.\n\n[17] A. S. Lewis and H. S. Sendov. Twice differentiable spectral functions. SIAM Journal on Matrix Analysis\n\nand Applications, 23(2):368\u2013386, 2001.\n\n[18] A. S. Lewis and S. Zhang. Partial smoothness, tilt stability, and generalized Hessians. SIAM Journal on\n\nOptimization, 23(1):74\u201394, 2013.\n\n[19] J. Liang, J. Fadili, and G. Peyr\u00e9. Local linear convergence of Forward\u2013Backward under partial smoothness.\n\nIn Advances in Neural Information Processing Systems, pages 1970\u20131978, 2014.\n\n[20] J. Liang, J. Fadili, and G. Peyr\u00e9. Activity identi\ufb01cation and local linear convergence of Forward\u2013Backward-\n\ntype methods. arXiv:1503.03703, 2015.\n\n[21] D. A. Lorenz and T. Pock. An inertial Forward\u2013Backward algorithm for monotone inclusions. Journal of\n\nMathematical Imaging and Vision, 51(2):311\u2013325, 2014.\n\n[22] A. Mouda\ufb01 and M. Oliny. Convergence of a splitting inertial proximal method for monotone operators.\n\nJournal of Computational and Applied Mathematics, 155(2):447\u2013454, 2003.\n\n[23] P. Ochs, Y. Chen, T. Brox, and T. Pock. iPiano: inertial proximal algorithm for nonconvex optimization.\n\nSIAM Journal on Imaging Sciences, 7(2):1388\u20131419, 2014.\n\n[24] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational\n\nMathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[25] R. T. Rockafellar and R. Wets. Variational analysis, volume 317. Springer Verlag, 1998.\n\n[26] L. van den Dries. Tame topology and o-minimal structures, volume 248 of Mathematrical Society Lecture\n\nNotes. Cambridge Univiversity Press, New York, 1998.\n\n9\n\n\f", "award": [], "sourceid": 2019, "authors": [{"given_name": "Jingwei", "family_name": "Liang", "institution": "GREYC"}, {"given_name": "Jalal", "family_name": "Fadili", "institution": "CNRS-ENSICAEN-Univ. Caen"}, {"given_name": "Gabriel", "family_name": "Peyr\u00e9", "institution": "Universit\u00e9 Paris Dauphine"}]}