{"title": "Semi-Proximal Mirror-Prox for Nonsmooth Composite Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 3411, "page_last": 3419, "abstract": "We propose a new first-order optimization algorithm to solve high-dimensional non-smooth composite minimization problems. Typical examples of such problems have an objective that decomposes into a non-smooth empirical risk part and a non-smooth regularization penalty. The proposed algorithm, called Semi-Proximal Mirror-Prox, leverages the saddle point representation of one part of the objective while handling the other part of the objective via linear minimization over the domain. The algorithm stands in contrast with more classical proximal gradient algorithms with smoothing, which require the computation of proximal operators at each iteration and can therefore be impractical for high-dimensional problems. We establish the theoretical convergence rate of Semi-Proximal Mirror-Prox, which exhibits the optimal complexity bounds for the number of calls to linear minimization oracle. We present promising experimental results showing the interest of the approach in comparison to competing methods.", "full_text": "Semi-Proximal Mirror-Prox\n\nfor Nonsmooth Composite Minimization\n\nNiao He\n\nGeorgia Institute of Technology\n\nZaid Harchaoui\n\nNYU, Inria\n\nnhe6@gatech.edu\n\nfirstname.lastname@nyu.edu\n\nAbstract\n\nWe propose a new \ufb01rst-order optimization algorithm to solve high-dimensional\nnon-smooth composite minimization problems. Typical examples of such prob-\nlems have an objective that decomposes into a non-smooth empirical risk part\nand a non-smooth regularization penalty. The proposed algorithm, called Semi-\nProximal Mirror-Prox, leverages the saddle point representation of one part of the\nobjective while handling the other part of the objective via linear minimization\nover the domain. The algorithm stands in contrast with more classical proximal\ngradient algorithms with smoothing, which require the computation of proximal\noperators at each iteration and can therefore be impractical for high-dimensional\nproblems. We establish the theoretical convergence rate of Semi-Proximal Mirror-\nProx, which exhibits the optimal complexity bounds, i.e. O(1/\u01eb2), for the number\nof calls to linear minimization oracle. We present promising experimental results\nshowing the interest of the approach in comparison to competing methods.\n\n1\n\nIntroduction\n\nA wide range of machine learning and signal processing problems can be formulated as the mini-\nmization of a composite objective:\n\nmin\nx\u2208X\n\nF (x) := f (x) + kBxk\n\n(1)\n\nwhere X is closed and convex, f is convex and can be either smooth, or nonsmooth yet enjoys\na particular structure. The term kBxk de\ufb01nes a regularization penalty through a norm k \u00b7 k, and\nx 7\u2192 Bx a linear mapping on a closed convex set X.\nIn many situations, the objective function F of interest enjoys a favorable structure, namely a so-\ncalled saddle point representation [6, 11, 13]:\n\nf (x) = max\n\nz\u2208Z {hx, Azi \u2212 \u03c8(z)}\n\n(2)\n\nwhere Z is convex compact subset of a Euclidean space, and \u03c8(\u00b7) is a convex function. Sec. 4 will\ngive several examples of such situations. Saddle point representations can then be leveraged to use\n\ufb01rst-order optimization algorithms.\n\nThe simple \ufb01rst option to minimize F is using the so-called Nesterov smoothing technique [19]\nalong with a proximal gradient algorithm [23], assuming that the proximal operator associated with\nX is computationally tractable and cheap to compute. However, this is certainly not the case when\nconsidering problems with norms acting in the spectral domain of high-dimensional matrices, such\nas the matrix nuclear-norm [12] and structured extensions thereof [5, 2].\nIn the latter situation,\nanother option is to use a smoothing technique now with a conditional gradient or Frank-Wolfe\nalgorithm to minimize F , assuming that a a linear minimization oracle associated with X is cheaper\nto compute than the proximal operator [6, 14, 24]. Neither option takes advantage of the composite\nstructure of the objective (1) or handles the case when the linear mapping B is nontrivial.\n\n1\n\n\fContributions Our goal is to propose a new \ufb01rst-order optimization algorithm, called Semi-\nProximal Mirror-Prox, designed to solve the dif\ufb01cult non-smooth composite optimization prob-\nlem (1), which does not require the exact computation of proximal operators. Instead, the Semi-\nProximal Mirror-Prox relies upon i) Saddle point representability of f (a less restricted role than\nFenchel-type representation); ii) Linear minimization oracle associated with k \u00b7 k in the domain\nX. While the saddle point representability of f allows to cure the non-smoothness of f , the linear\nminimization over the domain X allows to tackle the non-smooth regularization penalty k \u00b7 k. We\nestablish the theoretical convergence rate of Semi-Proximal Mirror-Prox, which exhibits the optimal\ncomplexity bounds, i.e. O(1/\u01eb2), for the number of calls to linear minimization oracle. Furthermore,\nSemi-Proximal Mirror-Prox generalizes previously proposed approaches and improves upon them\nin special cases:\n\n1. Case B \u2261 0: Semi-Proximal Mirror-Prox does not require assumptions on favorable geom-\n\netry of dual domain Z or simplicity of \u03c8(\u00b7) in (2).\n\n2. Case B = I: Semi-Proximal Mirror-Prox is competitive with previously proposed ap-\n\nproaches [15, 24] based on smoothing techniques.\n\n3. Case of non-trivial B: Semi-Proximal Mirror-Prox is the \ufb01rst proximal-free or conditional-\n\ngradient-type optimization algorithm for (1).\n\nRelated work The Semi-Proximal Mirror-Prox algorithm belongs to the family of conditional\ngradient algorithms, whose most basic instance is the Frank-Wolfe algorithm for constrained smooth\noptimization using a linear minimization oracle; see [12, 1, 4]. Recently, in [6, 13], the authors\nconsider constrained non-smooth optimization when the domain Z has a \u201cfavorable geometry\u201d,\nthe domain is amenable to proximal setups (favorable geometry), and establish a complexity\ni.e.\nbound with O(1/\u01eb2) calls to the linear minimization oracle. Recently, in [15], a method called\nconditional gradient sliding is proposed to solve similar problems, using a smoothing technique,\nwith a complexity bound in O(1/\u01eb2) for the calls to the linear minimization oracle (LMO) and\nadditionally a O(1/\u01eb) bound for the linear operator evaluations. Actually, this O(1/\u01eb2) bound for\nthe LMO complexity can be shown to be indeed optimal for conditional-gradient-type or LMO-\nbased algorithms, when solving general1 non-smooth convex problems [14].\n\nHowever, these previous approaches are appropriate for objective with a non-composite structure.\nWhen applied to our problem (1), the smoothing would be applied to the objective taken as a whole,\nignoring its composite structure. Conditional-gradient-type algorithms were recently proposed for\ncomposite objectives [7, 9, 26, 24, 16], but cannot be applied for our problem. In [9], f is smooth\nand B is identity matrix, whereas in [24], f is non-smooth and B is also the identity matrix. The\nproposed Semi-Proximal Mirror-Prox can be seen as a blend of the successful components resp. of\nthe Composite Conditional Gradient algorithm [9] and the Composite Mirror-Prox [11], that enjoys\nthe optimal complexity bound O(1/\u01eb2) on the total number of LMO calls, yet solves a broader class\nof convex problems than previously considered.\n\n2 Framework and assumptions\n\nWe present here our theoretical framework, which hinges upon a smooth convex-concave sad-\ndle point reformulation of the norm-regularized non-smooth minimization (3). We shall use the\nfollowing notations throughout the paper. For a given norm k \u00b7 k, we de\ufb01ne the dual norm as\nksk\u2217 = maxkxk\u22641hs, xi. For any x \u2208 Rm\u00d7n, kxk2 = kxkF = (Pm\n\nj=1 |xij|2)1/2.\n\ni=1Pn\n\nProblem We consider the composite minimization problem\nf (x) + kBxk\n\nOpt = min\nx\u2208X\n\n(3)\n\nwhere X is a closed convex set in the Euclidean space Ex; x 7\u2192 Bx is a linear mapping from X\nto Y (\u2283 BX), where Y is a closed convex set in the Euclidean space Ey. We make two important\nassumptions on the function f and the norm k\u00b7k de\ufb01ning the regularization penalty, explained below.\n1Related research extended such approaches to stochastic or online settings [10, 8, 15]; such settings are\n\nbeyond the scope of this work.\n\n2\n\n\fSaddle Point Representation The non-smoothness of f can be challenging to tackle. However,\nin many cases of interest, the function f enjoys a favorable structure that allows to tackle it with\nsmoothing techniques. We assume that f (x) is a non-smooth convex function given by\n\nf (x) = max\nz\u2208Z\n\n\u03a6(x, z)\n\n(4)\n\nwhere \u03a6(x, z) is a smooth convex-concave function and Z is a convex and compact set in the Eu-\nclidean space Ez. Such representation was introduced and developed in [6, 11, 13], for the purpose\nof non-smooth optimization. Saddle point representability can be interpreted as a general form of\nthe smoothing-favorable structure of non-smooth functions used in the Nesterov smoothing tech-\nnique [19]. Representations of this type are readily available for a wide family of \u201cwell-structured\u201d\nnonsmooth functions f (see Sec. 4 for examples ), and actually for all empirical risk functions with\nconvex loss in machine learning, up to our knowledge.\n\nComposite Linear Minimization Oracle Proximal-gradient-type algorithms require the compu-\n\ntation of a proximal operator at each iteration, i.e. miny\u2208Y (cid:8) 1\n\ncases of interest, described below, the computation of the proximal operator can be expensive or\nintractable. A classical example is the nuclear norm, whose proximal operator boils down to sin-\ngular value thresholding, therefore requiring a full singular value decomposition. In contrast to the\nproximal operator, the linear minimization oracle can be much cheaper. The linear minimization\noracle (LMO) is a routine which, given an input \u03b1 > 0 and \u03b7 \u2208 Ey, returns a point\n\n2kyk2\n\n2 + h\u03b7, yi + \u03b1kyk(cid:9). For several\n\nLMO(\u03b7, \u03b1) := argmin\n\n{h\u03b7, yi + \u03b1kyk}\n\ny\u2208Y\n\n(5)\n\nIn the case of nuclear-norm, the LMO only requires the computation of the leading pair of singular\nvectors, which is an order of magnitude faster in time-complexity.\n\nSaddle Point Reformulation. The crux of our approach is a smooth convex-concave saddle point\nreformulation of (3). After massaging the saddle-point reformulation, we consider the associated\nvariational inequality, which provides the suf\ufb01cient and necessary condition for an optimal solution\nto the saddle point problem [3, 4]. For any optimization problem with convex structure (including\nconvex minimization, convex-concave saddle point problem, convex Nash equilibrium), the corre-\nsponding variational inequality is directly related to the accuracy certi\ufb01cate used to guarantee the\naccuracy of a solution to the optimization problem; see Sec. 2.1 in [11] and [18]. We shall present\nthen an algorithm to solve the variational inequality established below, that exploits its particular\nstructure.\n\nAssuming that f admits a saddle point representation (4), we write (3) in epigraph form\n\nOpt =\n\nmin\n\nx\u2208X,y\u2208Y,\u03c4 \u2265kyk\n\nz\u2208Z {\u03a6(x, z) + \u03c4 : y = Bx} .\nmax\n\nwhere Y (\u2283 BX) is a convex set. We can approximate Opt by\n\ndOpt =\n\nmin\n\nx\u2208X,y\u2208Y,\u03c4 \u2265kyk\n\nmax\n\nz\u2208Z,kwk2\u22641 {\u03a6(x, z) + \u03c4 + \u03c1hy \u2212 Bx, wi} .\n\n(6)\n\nFor properly selected \u03c1 > 0, one hasdOpt = Opt (see details in [11]). By introducing the variables\n\nu := [x, y; z, w] and v := \u03c4 , the variational inequality associated with the above saddle point\nproblem is fully described by the domain\n\nX+ = {x+ = [u; v] : x \u2208 X, y \u2208 Y, z \u2208 Z,kwk2 \u2264 1, \u03c4 \u2265 kyk}\n\nand the monotone vector \ufb01eld\n\nF (x+ = [u; v]) = [Fu(u); Fv] ,\n\nwhere\n\n\uf8ec\uf8edu =\uf8ee\nFu\uf8eb\n\uf8ef\uf8f0\n\nx\ny\nz\nw\n\n\uf8f9\n\uf8fa\uf8fb\n\n\uf8f7\uf8f8 =\uf8ee\n\uf8f6\n\uf8ef\uf8f0\n\n\u2207x\u03a6(x, z) \u2212 \u03c1BT w\n\n\u03c1w\n\n\u2212\u2207z\u03a6(x, z)\n\u03c1(Bx \u2212 y)\n\n\uf8f9\n\uf8fa\uf8fb ,\n\nFv(v = \u03c4 ) = 1.\n\nIn the next section, we present an ef\ufb01cient algorithm to solve this type of variational inequality,\nwhich enjoys a particular structure; we call such an inequality semi-structured.\n\n3\n\n\f3 Semi-Proximal Mirror-Prox for Semi-structured Variational Inequalities\n\nSemi-structured variational inequalities (Semi-VI) enjoy a particular mixed structure, that allows to\nget the best of two worlds, namely the proximal setup (where the proximal operator can be com-\nputed) and the LMO setup (where the linear minimization oracle can be computed). Basically, the\ndomain X is decomposed as a Cartesian product over two sets X = X1 \u00d7 X2, such that X1 admits\na proximal-mapping while X2 admits a linear minimization oracle. We now describe the main the-\noretical and algorithmic components of the Semi-Proximal Mirror-Prox algorithm, resp. in Sec. 3.1\nand in Sec. 3.2, and \ufb01nally describe the overall algorithm in Sec. 3.3.\n\n3.1 Composite Mirror-Prox with Inexact Prox-mappings\n\nWe \ufb01rst present a new algorithm, which can be seen as an extension of the Composite Mirror Prox al-\ngorithm, denoted CMP for brevity, that allows inexact computation of prox-mappings and can solve\na broad class of variational inequalites. The original Mirror Prox algorithm was introduced in [17]\nand was extended to composite settings in [11] assuming exact computations of prox-mappings.\n\nStructured Variational Inequalities. We consider the variational inequality VI(X, F ):\n\nwith domain X and operator F that satisfy the assumptions (A.1)\u2013(A.4) below.\n\nFind x\u2217 \u2208 X : hF (x), x \u2212 x\u2217i \u2265 0,\u2200x \u2208 X\n\nwhere U is convex and closed, Eu, Ev are Euclidean spaces;\n\n(A.1) Set X \u2282 Eu \u00d7 Ev is closed convex and its projection P X = {u : x = [u; v] \u2208 X} \u2282 U ,\n(A.2) The function \u03c9(\u00b7) : U \u2192 R is continuously differentiable and also 1-strongly convex w.r.t.\nsome norm2 k \u00b7 k. This de\ufb01nes the Bregman distance Vu(u\u2032) = \u03c9(u\u2032) \u2212 \u03c9(u) \u2212 h\u03c9\u2032(u), u\u2032 \u2212\nui \u2265 1\n(A.3) The operator F (x = [u, v]) : X \u2192 Eu \u00d7 Ev is monotone and of form F (u, v) = [Fu(u); Fv]\nwith Fv \u2208 Ev being a constant and Fu(u) \u2208 Eu satisfying the condition\n\u2200u, u\u2032 \u2208 U : kFu(u) \u2212 Fu(u\u2032)k\u2217 \u2264 Lku \u2212 u\u2032k + M\n\n2ku\u2032 \u2212 uk2.\n\n(A.4) The linear form hFv, vi of [u; v] \u2208 Eu \u00d7 Ev is bounded from below on X and is coercive on\nt=1 is bounded\n\nfor some L < \u221e, M < \u221e;\nX w.r.t. v: whenever [ut; vt] \u2208 X, t = 1, 2, ... is a sequence such that {ut}\u221e\nand kvtk2 \u2192 \u221e as t \u2192 \u221e, we have hFv, vti \u2192 \u221e, t \u2192 \u221e.\n\nThe quality of an iterate, in the course of the algorithm, is measured through the so-called dual gap\nfunction\n\n\u01ebVI(x(cid:12)(cid:12)X, F ) = sup\n\ny\u2208X hF (y), x \u2212 yi .\n\nWe give in Appendix A a refresher on dual gap functions, for the reader\u2019s convenience. We shall\nestablish the complexity bounds in terms of this dual gap function for our algorithm, which directly\nprovides an accuracy certi\ufb01cate along the iterations. However, we \ufb01rst need to de\ufb01ne what we mean\nby an inexact prox-mapping.\n\n\u01eb-Prox-mapping Inexact proximal mappings were recently considered in the context of acceler-\nated proximal gradient algorithms [25]. The de\ufb01nition we give below is more general, allowing for\nnon-Euclidean proximal-mappings.\nWe introduce here the notion of \u01eb-prox-mapping for \u01eb \u2265 0. For \u03be = [\u03b7; \u03b6] \u2208 Eu \u00d7 Ev and x =\n[u; v] \u2208 X, let us de\ufb01ne the subset P \u01eb\n\nx(\u03be) of X as\n\nP \u01eb\n\nx(\u03be) = {bx = [bu;bv] \u2208 X : h\u03b7 + \u03c9\u2032(bu) \u2212 \u03c9\u2032(u),bu \u2212 si + h\u03b6,bv \u2212 wi \u2264 \u01eb \u2200[s; w] \u2208 X}.\n\nWhen \u01eb = 0, this reduces to the exact prox-mapping, in the usual setting, that is\n\nPx(\u03be) = Argmin\n\n[s;w]\u2208X {h\u03b7, si + h\u03b6, wi + Vu(s)} .\n\n2There is a slight abuse of notation here. The norm here is not the same as the one in problem (3)\n\n4\n\n\fWhen \u01eb > 0, this yields our de\ufb01nition of an inexact prox-mapping, with inexactness parameter \u01eb.\nNote that for any \u01eb \u2265 0, the set P \u01eb\nx(\u03be = [\u03b7; \u03b3Fv]) is well de\ufb01ned whenever \u03b3 > 0. The Composite\n\nMirror Prox with inexact prox-mappings is outlined in Algorithm 1.\n\nAlgorithm 1 Composite Mirror Prox Algorithm (CMP) for VI(X, F )\n\nInput: stepsizes \u03b3t > 0, inexactness \u01ebt \u2265 0, t = 1, 2, . . .\nInitialize x1 = [u1; v1] \u2208 X\nfor t = 1, 2, . . . , T do\n\nend for\n\nyt := [but;bvt] \u2208 P \u01ebt\nxt+1 := [ut+1; vt+1] \u2208 P \u01ebt\n\u22121PT\n\nt=1 \u03b3t)\n\nOutput: xT := [\u00afuT ; \u00afvT ] = (PT\n\nt=1 \u03b3tyt\n\nxt (\u03b3tF (xt)) = P \u01ebt\nxt (\u03b3tF (yt)) = P \u01ebt\n\nxt (\u03b3t[Fu(ut); Fv])\n\nxt (\u03b3t[Fu(but); Fv])\n\n(7)\n\nThe proposed algorithm is a non-trivial extension of the Composite Mirror Prox with exact prox-\nmappings, both from a theoretical and algorithmic point of views. We establish below the theoretical\nconvergence rate; see Appendix B for the proof.\nTheorem 3.1. Assume that the sequence of step-sizes (\u03b3t) in the CMP algorithm satisfy\n\nt M 2 ,\n\n\u03c3t := \u03b3thFu(but) \u2212 Fu(ut),but \u2212 ut+1i \u2212 Vbut (ut+1) \u2212 Vut (but) \u2264 \u03b32\nt = 1, 2, . . . , T . (8)\nThen, denoting \u0398[X] = sup[u;v]\u2208X Vu1 (u), for a sequence of inexact prox-mappings with inexact-\nness \u01ebt \u2265 0, we have\n\u0398[X] + M 2PT\nPT\n\n\u01ebVI(\u00afxT(cid:12)(cid:12)X, F ) := sup\n\nx\u2208X hF (x), \u00afxT \u2212 xi \u2264\n\nt + 2PT\n\nt=1\u03b32\nt=1 \u03b3t\n\nRemarks. Note that the assumption on the sequence of step-sizes (\u03b3t) is clearly satis\ufb01ed when\n\n\u03b3t \u2264 (\u221a2L)\u22121. When M = 0 (which is essentially the case for the problem described in Section 2),\nit suf\ufb01ces as long as \u03b3t \u2264 L\u22121. When (\u01ebt) is summable, we achieve the same O(1/T ) convergence\nrate as when there is no error. If (\u01ebt) decays with a rate of O(1/t), then the overall convergence\nis only affected by a log(T ) factor. Convergence results on the sequence of projections of (\u00afxT )\nonto X1 when F stems from saddle point problem minx1\u2208X1 supx2\u2208X2 \u03a6(x1, x2) is established in\nAppendix B.\n\nt=1\u01ebt\n\n(9)\n\n.\n\nThe theoretical convergence rate established in Theorem 3.1 and Corollary B.1 generalizes the pre-\nvious result established in Corollary 3.1 in [11] for CMP with exact prox-mappings. Indeed, when\nexact prox-mappings are used, we recover the result of [11]. When inexact prox-mappings are used,\nthe errors due to the inexactness of the prox-mappings accumulate and is re\ufb02ected in (9) and (37).\n\n3.2 Composite Conditional Gradient\n\nWe now turn to a variant of the composite conditional gradient algorithm, denoted CCG, tailored\nfor a particular class of problems, which we call smooth semi-linear problems. The composite\nconditional gradient algorithm was \ufb01rst introduced in [9] and also developed in [21]. We present an\nextension here which turns to be well-suited for sub-problems that will be solved in Sec. 3.3.\n\nMinimizing Smooth Semi-linear Functions. We consider the smooth semi-linear problem\n\nmin\n\nx=[u;v]\u2208X(cid:8)\u03c6+(u, v) = \u03c6(u) + h\u03b8, vi(cid:9)\n\n(10)\n\nrepresented by the pair (X; \u03c6+) such that the following assumptions are satis\ufb01ed. We assume that\n\nand compact;\n\ni) X \u2282 Eu \u00d7 Ev is closed convex and its projection P X on Eu belongs to U , where U is convex\nii) \u03c6(u) : U \u2192 R is a convex continuously differentiable function, and there exist 1 < \u03ba \u2264 2 and\n\nL0 < \u221e such that\n\nL0\n\u03ba ku\u2032 \u2212 uk\u03ba \u2200u, u\u2032 \u2208 U ;\n\n(11)\n\n\u03c6(u\u2032) \u2264 \u03c6(u) + h\u2207\u03c6(u), u\u2032 \u2212 ui +\n\n5\n\n\fiii) \u03b8 \u2208 Ev is such that every linear function on Eu \u00d7 Ev of the form\n\n(12)\nwith \u03b7 \u2208 Eu attains its minimum on X at some point x[\u03b7] = [u[\u03b7]; v[\u03b7]]; we have at our disposal\na Composite Linear Minimization Oracle (LMO) which, given on input \u03b7 \u2208 Eu, returns x[\u03b7].\n\n[u; v] 7\u2192 h\u03b7, ui + h\u03b8, vi\n\nAlgorithm 2 Composite Conditional Gradient Algorithm CCG(X, \u03c6(\u00b7), \u03b8; \u01eb)\n\nInput: accuracy \u01eb > 0 and \u03b3t = 2/(t + 1), t = 1, 2, . . .\nInitialize x1 = [u1; v1] \u2208 X\nfor t = 1, 2, . . . do\n\nCompute \u03b4t = hgt, ut \u2212 ut[gt]i + h\u03b8, vt \u2212 vt[gt]i, where gt = \u2207\u03c6(ut);\nif \u03b4t \u2264 \u01eb then\nelse\n\nReturn xt = [ut; vt]\n\nFind xt+1 = [ut+1; vt+1] \u2208 X such that \u03c6+(xt+1) \u2264 \u03c6+ (xt + \u03b3t(xt[gt] \u2212 xt))\n\nend if\nend for\n\nThe algorithm is outlined in Algorithm 2. Note that CCG works essentially as if there were no v-\ncomponent at all. The CCG algorithm enjoys a convergence rate in O(t\u2212(\u03ba\u22121)) in the evaluations\nof the function \u03c6+, and the accuracy certi\ufb01cates (\u03b4t) enjoy the same rate O(t\u2212(\u03ba\u22121)) as well.\nProposition 3.1. Denote D the k\u00b7k-diameter of U . When solving problems of type (10), the sequence\nof iterates (xt) of CCG satis\ufb01es\n\n\u03c6+(xt) \u2212 min\n\n\u03c6+(x) \u2264\nIn addition, the accuracy certi\ufb01cates (\u03b4t) satisfy\n\u03b4s \u2264 O(1)L0D\u03ba(cid:18) 2\n\nmin\n1\u2264s\u2264t\n\n2L0D\u03ba\n\n\u03ba(3 \u2212 \u03ba)(cid:18) 2\nt + 1(cid:19)\u03ba\u22121\n\nx\u2208X\n\nt + 1(cid:19)\u03ba\u22121\n\n, t \u2265 2\n\n, t \u2265 2\n\n(13)\n\n(14)\n\n3.3 Semi-Proximal Mirror-Prox for Semi-structured Variational Inequality\n\nWe now give the full description of a special class of variational inequalities, called semi-structured\nvariational inequalities. This family of problems encompasses both cases that we discussed so far\nin Section 3.1 and 3.2. But most importantly, it also covers many other problems that do not fall into\nthese two regimes and in particular, our essential problem of interest (3).\n\nSemi-structured Variational Inequalities. The class of semi-structured variational inequalities\nallows to go beyond Assumptions (A.1)\u2212 (A.4), by assuming more structure. This structure is con-\nsistent with what we call a semi-proximal setup, which encompasses both the regular proximal setup\nand the regular linear minimization setup as special cases. Indeed, we consider variational inequality\nVI(X, F ) that satis\ufb01es, in addition to Assumptions (A.1) \u2212 (A.4), the following assumptions:\n(S.1) Proximal setup for X: we assume that Eu = Eu1 \u00d7 Eu2 , Ev = Ev1 \u00d7 Ev2 , and U \u2282\nU1 \u00d7 U2, X = X1 \u00d7 X2 with Xi \u2208 Eui \u00d7 Evi and PiX = {ui : [ui; vi] \u2208 Xi} \u2282 Ui\nfor i = 1, 2, where U1 is convex and closed, U2 is convex and compact. We also assume that\n, with \u03c92(\u00b7) : U2 \u2192 R continuously\n\u03c9(u) = \u03c91(u1) + \u03c92(u2) and kuk = ku1kEu1 +ku2kEu2\ndifferentiable such that\n\n\u03c92(u\u2032\n\n2) \u2264 \u03c92(u2) + h\u2207\u03c92(u2), u\u2032\n\n2 \u2208 U2;\nfor a particular 1 < \u03ba \u2264 2 and L0 < \u221e. Furthermore, we assume that the k \u00b7 kEu2\nof U2 is bounded by some D > 0.\n\n2 \u2212 u2i +\n\n2 \u2212 u2k\u03ba\n\n,\u2200u2, u\u2032\n\nEu2\n\nL0\n\u03ba ku\u2032\n\n-diameter\n\n(S.2) Partition of F : the operator F induced by the above partition of X1 and X2 can be written as\n\nF (x) = [Fu(u); Fv] with Fu(u) = [Fu1 (u1, u2); Fu2 (u1, u2)], Fv = [Fv1 ; Fv2 ].\n\n6\n\n\f(S.3) Proximal mapping on X1: we assume that for any \u03b71 \u2208 Eu1 and \u03b1 > 0, we have at our\n\ndisposal easy-to-compute prox-mappings of the form,\n\nProx\u03c91 (\u03b71, \u03b1) :=\n\nargmin\n\nx1=[u1;v1]\u2208X1 {\u03c91(u1) + h\u03b71, u1i + \u03b1hFv1 , v1i} .\n\n(S.4) Linear minimization oracle for X2: we assume that we we have at our disposal Composite\nLinear Minimization Oracle (LMO), which given any input \u03b72 \u2208 Eu2 and \u03b1 > 0, returns an\noptimal solution to the minimization problem with linear form, that is,\n\nLMO(\u03b72, \u03b1) :=\n\nargmin\n\nx2=[u2;v2]\u2208X2 {h\u03b72, u2i + \u03b1hFv2 , v2i} .\n\nSemi-proximal setup We denote such problems as Semi-VI(X, F ). On the one hand, when U2 is\na singleton, we get the full-proximal setup. On the other hand, when U1 is a singleton, we get the full\nlinear-minimization-oracle setup (full LMO setup). The semi-proximal setup allows to cover both\nsetups and all the ones in between as well.\n\nThe Semi-Proximal Mirror-Prox algorithm. We \ufb01nally present here our main contribution, the\nSemi-Proximal Mirror-Prox algorithm, which solves the semi-structured variational inequality under\n(A.1) \u2212 (A.4) and (S.1) \u2212 (S.4). The Semi-Proximal Mirror-Prox algorithm blends both CMP and\nCCG. Basically, for sub-domain X2 given by LMO, instead of computing exactly the prox-mapping,\nwe mimick inexactly the prox-mapping via a conditional gradient algorithm in the Composite Mirror\nProx algorithm. For the sub-domain X1, we compute the prox-mapping as it is.\n\nAlgorithm 3 Semi-Proximal Mirror-Prox Algorithm for Semi-VI(X, F )\n\n1; x1\n\n1 = [u1\n\n[2] Compute yt = [yt\n\nInput: stepsizes \u03b3t > 0, accuracies \u01ebt \u2265 0, t = 1, 2, . . .\n[1] Initialize x1 = [x1\n1]; x1\nfor t = 1, 2, . . . , T do\n\n2] \u2208 X, where x1\n1; yt\n2] that\n1 := [but\n1;bvt\n1] = Prox\u03c91 (\u03b3tFu1 (ut\n2) \u2212 \u03c9\u2032\n1(ut\n2 := [but\n2;bvt\n2] = CCG(X2, \u03c92(\u00b7) + h\u03b3tFu2 (ut\n\n[3] Compute xt+1 = [xt+1\n\n; xt+1\n\n1; v1\n\n1, ut\n\n] that\n\nyt\nyt\n\n2\n\n2 = [u1\n\n2, ; v1\n2].\n\n1), \u03b3t)\n2) \u2212 \u03c9\u2032\n\n1, ut\n\n2(ut\n\n2),\u00b7i, \u03b3tFv2 ; \u01ebt)\n\nxt+1\n1\nxt+1\n2\n\n:= [ut+1\n:= [ut+1\n\n1\n\n2\n\n1\n; vt+1\n; vt+1\n\n1\n\n2\n\nend for\n\nOutput: xT := [\u00afuT ; \u00afvT ] = (PT\n1 = [but\n1;bvt\n2) = maxy2\u2208X2h\u2207\u03c6+(yt\n\n2) \u2212 \u03c9\u2032\n] = Prox\u03c91 (\u03b3tFu1 (but\n1,but\n1), \u03b3t)\n2) \u2212 \u03c9\u2032\n] = CCG(X2, \u03c92(\u00b7) + h\u03b3tFu2 (but\n1,but\n\n1(ut\n\n2(ut\n\n2),\u00b7i, \u03b3tFv2 ; \u01ebt)\n\nt=1 \u03b3t)\n\nt=1 \u03b3tyt\n\n\u22121PT\n\n2]\n\n2 = [but\n2;bvt\n\n1] by computing the exact prox-mapping and build yt\nAt step t, we \ufb01rst update yt\nby running the composite conditional gradient algorithm to problem (10) speci\ufb01cally with\n\nX = X2, \u03c6(\u00b7) = \u03c92(\u00b7) + h\u03b3tFu2 (ut\n\n1, ut\n\n2) \u2212 \u03c9\u2032\n\n2(ut\n\n2),\u00b7i, and \u03b8 = \u03b3tFv2 ,\n; vt+1\n\n2\n\n2\n\n; vt+1\n\n2), yt\n\n2 \u2212 y2i \u2264 \u01ebt. We then build xt+1\n\n2 =\n] similarly except this time taking the value of the operator at point yt. Combining the\n\nuntil \u03b4(yt\n[ut+1\nresults in Theorem 3.1 and Proposition 3.1, we arrive at the following complexity bound.\nProposition 3.2. Under the assumption (A.1)\u2212 (A.4) and (S.1)\u2212 (S.4) with M = 0, and choice of\nstepsize \u03b3t = L\u22121, t = 1, . . . , T , for the outlined algorithm to return an \u01eb-solution to the variational\ninequality V I(X, F ), the total number of Mirror Prox steps required does not exceed\n\n1 = [ut+1\n\n] and xt+1\n\n1\n\n1\n\nand the total number of calls to the Linear Minimization Oracle does not exceed\n\nTotal number of steps = O(1)\n\nL\u0398[X]\n\n\u01eb\n\nN = O(1)(cid:18) L0L\u03baD\u03ba\n\n\u01eb\u03ba\n\n\u03ba\u22121\n\n(cid:19) 1\n\n\u0398[X].\n\nIn particular, if we use Euclidean proximal setup on U2 with \u03c92(\u00b7) = 1\n2kx2k2, which leads to \u03ba = 2\nand L0 = 1, then the number of LMO calls does not exceed N = O(1)(cid:0)L2D2(\u0398[X1] + D2)(cid:1) /\u01eb2.\n\n7\n\n\fl\n\nl\n\ne\nu\na\nv\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n100\n\n10\u22121\n\n10\u22122\n\n \n0\n\nSemi\u2212MP(eps=10/t)\nSemi\u2212MP(fixed=96)\nSmooth\u2212CG(g =0.01)\nSemi\u2212SPG(eps=5/t)\nSemi\u2212SPG(fixed=96)\n\n \n\n100\n\nl\n\nl\n\ne\nu\na\nv\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n \n\nSemi\u2212MP(eps=5/t)\nSemi\u2212MP(fixed=24)\nSmooth\u2212CG(g =1)\nSemi\u2212SPG(eps=10/t)\nSemi\u2212SPG(fixed=24)\n\n1000\n\n2000\n\n3000\n\nElapsed time (sec)\n\n10\u22121\n\n \n0\n\n4000\n\n500\n\n1000\n2000\nElapsed time (sec)\n\n1500\n\n2500\n\n3000\n\nl\n\ne\nu\na\nv\n \n\ne\nv\ni\nt\nc\ne\nb\no\n\nj\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n \n0\n\n \n\nSemi\u2212MP(eps=1e2/t)\nSemi\u2212MP(eps=1e1/t)\nSemi\u2212MP(eps=1e0/t)\nSemi\u2212LPADMM(eps=1e\u22123/t)\nSemi\u2212LPADMM(eps=1e\u22124/t)\nSemi\u2212LPADMM(eps=1e\u22125/t)\n\n1000\n\n2000\n\n3000\n\n4000\n\nnumber of LMO calls\n\n5000\n\nl\n\ne\nu\na\nv\n \n\ne\nv\ni\nt\nc\ne\nb\no\n\nj\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n \n0\n\nSemi\u2212MP(eps=1e1/t)\nSemi\u2212LPADMM(eps=1e\u22125/t)\n\n \n\n400\n\n200\n800\nnumber of LMO calls\n\n600\n\n1000\n\nFigure 1: Robust collaborative \ufb01ltering and link prediction: objective function vs elapsed time.\nFrom left to right: (a) MovieLens100K; (b) MovieLens1M; (c) Wikivote (1024); (d) Wikivote (full)\n\nDiscussion The proposed Semi-Proximal Mirror-Prox algorithm enjoys the optimal complexity\nbounds, i.e. O(1/\u01eb2), in the number of calls to LMO; see [14] for the optimal complexity bounds\nfor general non-smooth optimization with LMO. Consequently, when applying the algorithm to the\nvariational reformulation of the problem of interest (3), we are able to get an \u01eb-optimal solution\nwithin at most O(1/\u01eb2) LMO calls. Thus, Semi-Proximal Mirror-Prox generalizes previously\nproposed approaches and improves upon them in special cases of problem (3); see Appendix D.2.\n\n4 Experiments\n\nWe report the experimental results obtained with the proposed Semi-Proximal Mirror-Prox, denoted\nSemi-MP here, and competing algorithms. We consider two different applications: i) robust col-\nlaborative \ufb01ltering for movie recommendation; ii) link prediction for social network analysis. For\ni), we compare to two competing approaches: a) smoothing conditional gradient proposed in [24]\n(denoted Smooth-CG); b) smoothing proximal gradient [20, 5] equipped with semi-proximal setup\n(Semi-SPG). For ii), we compare to Semi-LPADMM, using [22] equipped with semi-proximal\nsetup. Additional experiments and implementation details are given in Appendix E.\n\nRobust collaborative \ufb01ltering We consider the collaborative \ufb01ltering problem, with a nuclear-\nnorm regularization penalty and an \u21131-loss function. We run the above three algorithms on the\nthe small and medium MovieLens datasets. The small-size dataset consists of 943 users and 1682\nmovies with about 100K ratings, while the medium-size dataset consists of 3952 users and 6040\nmovies with about 1M ratings. We follow [24] to set the regularization parameters. In Fig. 1, we\ncan see that Semi-MP clearly outperforms Smooth-CG, while it is competitive with Semi-SPG.\n\nLink prediction We consider now the link prediction problem, where the objective consists a\nhinge-loss for the empirical risk part and multiple regularization penalties, namely the \u21131-norm and\nthe nuclear-norm. For this example, applying the Smooth-CG or Semi-SPG would require two\nsmooth approximations, one for hinge loss term and one for \u21131 norm term. Therefore, we consider\nan alternative approach, Semi-LPADMM, where we apply the linearized preconditioned ADMM al-\ngorithm [22] by solving proximal mapping through conditional gradient routines. Up to our knowl-\nedge, ADMM with early stopping is not fully theoretically analyzed in literature. However, intu-\nitively, as long as the error is controlled suf\ufb01ciently, such variant of ADMM should converge.\n\nWe conduct experiments on a binary social graph data set called Wikivote, which consists of 7118\nnodes and 103747 edges. Since the computation cost of these two algorithms mainly come from the\nLMO calls, we present in below the performance in terms of number of LMO calls. For the \ufb01rst set\nof experiments, we select top 1024 highest degree users from Wikivote and run the two algorithms\non this small dataset with different strategies for the inner LMO calls.\n\nIn Fig. 1, we observe that the Semi-MP is less sensitive to the inner accuracies of prox-mappings\ncompared to the ADMM variant, which sometimes stops progressing if the prox-mappings of early\niterations are not solved with suf\ufb01cient accuracy. The results on the full dataset corroborate the fact\nthat Semi-MP outperforms the semi-proximal variant of the ADMM algorithm.\n\nAcknowledgments\n\nThe authors would like to thank A. Juditsky and A. Nemirovski for fruitful discussions. This work\nwas supported by NSF Grant CMMI-1232623, LabEx Persyval-Lab (ANR-11-LABX-0025), project\n\u201cTitan\u201d (CNRS-Mastodons), project \u201cMacaron\u201d (ANR-14-CE23-0003-01), the MSR-Inria joint cen-\ntre, and the Moore-Sloan Data Science Environment at NYU.\n\n8\n\n\fReferences\n\n[1] Francis Bach. Duality between subgradient and conditional gradient methods. SIAM Journal on Opti-\n\nmization, 2015.\n\n[2] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-\n\ninducing penalties. Found. Trends Mach. Learn., 4(1):1\u2013106, 2012.\n\n[3] Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert\n\nSpaces. Springer, 2011.\n\n[4] D. P. Bertsekas. Convex Optimization Algorithms. Athena Scienti\ufb01c, 2015.\n\n[5] Xi Chen, Qihang Lin, Seyoung Kim, Jaime G Carbonell, and Eric P Xing. Smoothing proximal gradient\n\nmethod for general structured sparse regression. The Annals of Applied Statistics, 6(2):719\u2013752, 2012.\n\n[6] Bruce Cox, Anatoli Juditsky, and Arkadi Nemirovski. Dual subgradient algorithms for large-scale nons-\n\nmooth learning problems. Mathematical Programming, pages 1\u201338, 2013.\n\n[7] M. Dudik, Z. Harchaoui, and J. Malick. Lifted coordinate descent for learning with trace-norm regulariza-\ntion. Proceedings of the 15th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n2012.\n\n[8] Dan Garber and Elad Hazan. A linearly convergent conditional gradient algorithm with applications to\n\nonline and stochastic optimization. arXiv preprint arXiv:1301.4666, 2013.\n\n[9] Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-\n\nregularized smooth convex optimization. Mathematical Programming, pages 1\u201338, 2013.\n\n[10] E. Hazan and S. Kale. Projection-free online learning. In ICML, 2012.\n\n[11] Niao He, Anatoli Juditsky, and Arkadi Nemirovski. Mirror prox algorithm for multi-term composite\n\nminimization and semi-separable problems. arXiv preprint arXiv:1311.1098, 2013.\n\n[12] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, pages 427\u2013\n\n435, 2013.\n\n[13] Anatoli Juditsky and Arkadi Nemirovski. Solving variational inequalities with monotone operators on\n\ndomains given by linear minimization oracles. arXiv preprint arXiv:1312.107, 2013.\n\n[14] Guanghui Lan. The complexity of large-scale convex programming under a linear optimization oracle.\n\narXiv, 2013.\n\n[15] Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization. arXiv, 2014.\n\n[16] Cun Mu, Yuqian Zhang, John Wright, and Donald Goldfarb. Scalable robust matrix recovery: Frank-\n\nwolfe meets proximal methods. arXiv preprint arXiv:1403.7588, 2014.\n\n[17] Arkadi Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with lips-\nchitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal\non Optimization, 15(1):229\u2013251, 2004.\n\n[18] Arkadi Nemirovski, Shmuel Onn, and Uriel G Rothblum. Accuracy certi\ufb01cates for computational prob-\n\nlems with convex structure. Mathematics of Operations Research, 35(1):52\u201378, 2010.\n\n[19] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103(1):127\u2013\n\n152, 2005.\n\n[20] Yurii Nesterov. Smoothing technique and its applications in semide\ufb01nite optimization. Math. Program.,\n\n110(2):245\u2013259, 2007.\n\n[21] Yurii Nesterov. Complexity bounds for primal-dual methods minimizing the model of objective function.\nTechnical report, Universit\u00b4e catholique de Louvain, Center for Operations Research and Econometrics\n(CORE), 2015.\n\n[22] Yuyuan Ouyang, Yunmei Chen, Guanghui Lan, and Eduardo Pasiliao Jr. An accelerated linearized alter-\n\nnating direction method of multipliers, 2014. http://arxiv.org/abs/1401.6607.\n\n[23] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, pages\n\n1\u201396, 2013.\n\n[24] Federico Pierucci, Zaid Harchaoui, and J\u00b4er\u02c6ome Malick. A smoothing approach for composite conditional\n\ngradient with nonsmooth loss. In Conf\u00b4erence dApprentissage Automatique\u2013Actes CAP14, 2014.\n\n[25] Mark Schmidt, Nicolas L. Roux, and Francis R. Bach. Convergence rates of inexact proximal-gradient\n\nmethods for convex optimization. In Adv. NIPS. 2011.\n\n[26] X. Zhang, Y. Yu, and D. Schuurmans. Accelerated training for matrix-norm regularization: A boosting\n\napproach. In NIPS, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1881, "authors": [{"given_name": "Niao", "family_name": "He", "institution": "Georgia Institute of Technology"}, {"given_name": "Zaid", "family_name": "Harchaoui", "institution": "Inria"}]}