{"title": "Limited Memory Kelley's Method Converges for Composite Convex and Submodular Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 4414, "page_last": 4424, "abstract": "The original simplicial method (OSM), a variant of the classic Kelley\u2019s cutting plane method, has been shown to converge to the minimizer of a composite convex and submodular objective, though no rate of convergence for this method was known. Moreover, OSM is required to solve subproblems in each iteration whose size grows linearly in the number of iterations. We propose a limited memory version of Kelley\u2019s method (L-KM) and of OSM that requires limited memory (at most n+ 1 constraints for an n-dimensional problem) independent of the iteration. We prove convergence for L-KM when the convex part of the objective g is strongly convex and show it converges linearly when g is also smooth. Our analysis relies on duality between minimization of the composite convex and submodular objective and minimization of a convex function over the submodular base polytope. We introduce a limited memory version, L-FCFW, of the Fully-Corrective Frank-Wolfe (FCFW) method with approximate correction, to solve the dual problem. We show that L-FCFW and L-KM are dual algorithms that produce the same sequence of iterates; hence both converge linearly (when g is smooth and strongly convex) and with limited memory. We propose L-KM to minimize composite convex and submodular objectives; however, our results on L-FCFW hold for general polytopes and may be of independent interest.", "full_text": "Limited memory Kelley\u2019s Method Converges for\nComposite Convex and Submodular Objectives\n\nSong Zhou\n\nCornell University\n\nsz557@cornell.edu\n\nSwati Gupta\n\nGeorgia Institute of Technology\n\nswatig@gatech.edu\n\nMadeleine Udell\nCornell University\n\nudell@cornell.edu\n\nAbstract\n\nThe original simplicial method (OSM), a variant of the classic Kelley\u2019s cutting\nplane method, has been shown to converge to the minimizer of a composite convex\nand submodular objective, though no rate of convergence for this method was\nknown. Moreover, OSM is required to solve subproblems in each iteration whose\nsize grows linearly in the number of iterations. We propose a limited memory\nversion of Kelley\u2019s method (L-KM) and of OSM that requires limited memory (at\nmost n + 1 constraints for an n-dimensional problem) independent of the iteration.\nWe prove convergence for L-KM when the convex part of the objective (g) is\nstrongly convex and show it converges linearly when g is also smooth. Our analysis\nrelies on duality between minimization of the composite objective and minimization\nof a convex function over the corresponding submodular base polytope. We\nintroduce a limited memory version, L-FCFW, of the Fully-Corrective Frank-\nWolfe (FCFW) method with approximate correction, to solve the dual problem.\nWe show that L-FCFW and L-KM are dual algorithms that produce the same\nsequence of iterates; hence both converge linearly (when g is smooth and strongly\nconvex) and with limited memory. We propose L-KM to minimize composite\nconvex and submodular objectives; however, our results on L-FCFW hold for\ngeneral polytopes and may be of independent interest.\n\n1\n\nIntroduction\n\nOne of the earliest and fundamental methods to minimize non-smooth convex objectives is Kelley\u2019s\nmethod, which minimizes the maximum of lower bounds on the convex function given by the\nsupporting hyperplanes to the function at each previously queried point. An approximate solution\nto the minimization problem is found by minimizing this piecewise linear approximation, and the\napproximation is then strengthened by adding the supporting hyperplane at the current approximate\nsolution [11, 6]. Many variants of Kelley\u2019s method have been analyzed in the literature [16, 12, 8,\nfor e.g.]. Kelley\u2019s method and its variants are a natural \ufb01t for problem involving a piecewise linear\nfunction, such as composite convex and submodular objectives. This paper de\ufb01nes a new limited\nmemory version of Kelley\u2019s method adapted to composite convex and submodular objectives, and\nestablishes the \ufb01rst convergence rate for such a method, solving the open problem proposed in [2, 3].\nSubmodularity is a discrete analogue of convexity and has been used to model combinatorial con-\nstraints in a wide variety of machine learning applications, such as MAP inference, document\nsummarization, sensor placement, clustering, image segmentation [3, and references therein]. Sub-\nmodular set functions are de\ufb01ned with respect to a ground set of elements V , which we may identify\nwith {1, . . . , n} where |V | = n. These functions capture the property of diminishing returns:\nF : {0, 1}n ! R is said to be submodular if F (A [{ e}) F (A) F (B [{ e}) F (B) for\nall A \u2713 B \u2713 V , e /2 B. Lov\u00e1sz gave a convex extension f : [0, 1]n ! R of the submodular\nset functions which takes the value of the set function on the vertices of the [0, 1]n hypercube, i.e.\nf (1S) = F (S), where 1S is the indicator vector of the set S \u2713 V [17]. (See Section 2 for details.)\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this work, we propose a variant of Kelley\u2019s method, LIMITED MEMORY KELLEY\u2019S METHOD\n(L-KM), to minimize the composite convex and submodular objective\n\nminimize\n\ng(x) + f (x),\n\n(P )\nwhere g : Rn ! R is a closed strongly convex proper function and f : Rn ! R is the Lov\u00e1sz\nextension (see Section 2 for details) of a given submodular function F : 2|E| ! R. Composite convex\nand submodular objectives have been used extensively in sparse learning, where the support of the\nmodel must satisfy certain combinatorial constraints. L-KM builds on the ORIGINAL SIMPLICIAL\nMETHOD (OSM), proposed by Bach [3] to minimize such composite objectives. At the ith iteration,\nOSM approximates the Lov\u00e1sz extension by a piecewise linear function f(i) whose epigraph is the\nmaximum of the supporting hyperplanes to the function at each previously queried point. It is natural\nto approximate the submodular part of the objective by a piecewise linear function, since the Lov\u00e1sz\nextension is piecewise linear (with possibly an exponential number of pieces). OSM terminates once\nthe algorithm reaches the optimal solution, in contrast to a subgradient method, which might continue\nto oscillate. Contrast OSM with Kelley\u2019s Method: Kelley\u2019s Method approximates the full objective\nfunction using a piecewise linear function, while OSM only approximates the Lov\u00e1sz extension f\nand uses the exact form of g. In [3], the authors show that OSM converges to the optimum; however,\nno rate of convergence is given. Moreover, OSM maintains an approximation of the Lov\u00e1sz extension\nby maintaining a set of linear constraints whose size grows linearly with the number of iterations.\nHence the subproblems are harder to solve with each iteration.\nThis paper introduces L-KM, a variant of OSM that uses no more than n + 1 linear constraints\nin each approximation f(i) (and often, fewer) and provably converges when g is strongly convex.\nWhen in addition g is smooth, our new analysis of L-KM shows that it converges linearly, and, as a\ncorollary, that OSM also converges linearly, which was previously unknown.\nTo establish this result, we introduce the algorithm L-FCFW to solve a problem dual to (P ):\n\nmaximize h(w)\nsubject to w 2 B(F ),\n\n(D)\n\nwhere h : Rn ! R is a smooth concave function and B(F ) \u21e2 Rn is the submodular base polytope\ncorresponding to a given submodular function F (de\ufb01ned below). We show L-FCFW is a limited\nmemory version of the FULLY-CORRECTIVE FRANK-WOLFE (FCFW) method with approximate\ncorrection [15], and hence converges linearly to a solution of (D).\nWe show that L-KM and L-FCFW are dual algorithms in the sense that both algorithms produce\nthe same sequence of primal iterates and lower and upper bounds on the objective. This connection\nimmediately implies that L-KM converges linearly. Furthermore, when g is smooth as well as\nstrongly convex, we can recover the dual iterates of L-FCFW from the primal iterates of L-KM.\nRelated Work: The Original Simplicial Method was proposed by Bach (2013) [3]. As mentioned\nearlier, it converges \ufb01nitely but no known rate of convergence was known before the present work.\nIn 2015, Lacoste-Julien and Jaggi proved global linear convergence of variants of the Frank-Wolfe\nalgorithm, including the Fully Corrective Frank-Wolfe (FCFW) with approximate correction [15].\nL-FCFW, proposed in this paper, can be shown to be a limited memory special case of the latter,\nwhich proves linear convergence of both L-KM and OSM.\nMany authors have studied convergence guarantees and reduced memory requirements for variants\nof Kelley\u2019s method [11, 6]. These variants are computationally disadvantaged compared to OSM\nunless these variants allow approximation of only part of the objective. Among the earliest work on\nbounded storage in proximal level bundle methods is a paper by Kiwiel (1995) [12]. This method\nprojects iterates onto successive approximations of the level set of the objective; however, unlike our\nmethod, it is sensitive to the choice of parameters (level sets) and oblivious to any simplicial structure:\niterates are often not extreme points of the epigraph of the function. Subsequent work on the proximal\nsetup uses trust regions, penalty functions, level sets, and other more complex algorithmic tools;\nwe refer the reader to [18] for a survey on bundle methods. For the dual problem, a paper by Von\nHohenbalken (1977) [24] shares some elements of our proof techniques. However, their results\nonly apply to differentiable objectives and do not bound the memory. Another restricted simplicial\ndecomposition method was proposed by Hearn et. al. (1987) [10], which limits the constraint set\nby user-de\ufb01ned parameters (e.g., r = 1 reduces to the Frank-Wolfe algorithm [9]): it can replace an\natom with minimal weight in the current convex combination with a prior iterate of the algorithm,\n\n2\n\n\fwhich may be strictly inside the feasible region. In contrast, L-FCFW obeys a known upper bound\n(n + 1) on the number of vertices, and hence requires no parameter tuning.\nApplications: Composite convex and submodular objectives have gained popularity over the last few\nyears in a large number of machine learning applications such as structured regularization or empirical\nrisk minimization [4]: minw2RnPi l(yi, w>xi) + \u2326(w), where w are the model parameters and\n\u2326: Rn ! R is a regularizer. The Lov\u00e1sz extension can be used to obtain a convex relaxation of a\nregularizer that penalizes the support of the solution w to achieve structured sparsity, which improves\nmodel interpretable or encodes knowledge about the domain. For example, fused regularization\nuses \u2326(w) =Pi |wi wi+1|, which is the Lov\u00e1sz extension of the generalized cut function, and\ngroup regularization uses \u2326(w) =Pg dgkwgk1, which is the Lov\u00e1sz extension of the coverage\n\nsubmodular function. (See Appendix A, Table 1 for details on these and other submodular functions.)\nFurthermore, minimizing a composite convex and submodular objective is dual to minimizing a\nconvex objective over a submodular polytope (under mild conditions). This duality is central to\nthe present work. First-order projection-based methods like online stochastic mirror descent and\nits variants require computing a Bregman projection min{!(x) + r!(y)>(x y) : x 2 P} to\nminimize a strictly convex function ! : Rn ! R over the set P \u2713 Rn. Computing this projection is\noften dif\ufb01cult, and prevents practical application of these methods, though this class of algorithms is\nknown to obtain near optimal convergence guarantees in various settings [20, 1]. Using L-FCFW\nto compute these projections can reduce the memory requirements in variants of online mirror\ndescent used for learning over spanning trees to reduce communication delays in networks, [13]),\npermutations to model scheduling delays [26], and k-sets for principal component analysis [25],\nto give a few examples of submodular online learning problems. Other example applications of\nconvex minimization over submodular polytopes include computation of densest subgraphs [19],\ncomputation of a lower bound for the partition function of log-submodular distributions [7] and\ndistributed routing [14].\nSummary of contributions: We discuss background and the problem formulations in Section 2.\nSection 3 describes L-KM, our proposed limited memory version of OSM, and shows that L-KM\nconverges and solves a problem over Rn using subproblems with at most n + 1 constraints. We\nintroduce duality between our primal and dual problems in Section 4. Section 5 introduces a limited\nmemory (and hence faster) version of Fully-Corrective Frank-Wolfe, L-FCFW, and proves linear\nconvergence of L-FCFW. We establish the duality between L-KM and L-FCFW in Appendix E\nand thereby show L-KM achieves linear convergence and L-FCFW solves subproblems over no\nmore than n + 1 vertices. We present preliminary experiments in Section 7 that highlight the reduced\nmemory usage of both L-KM and L-FCFW and show that their performance compares favorably\nwith OSM and FCFW respectively.\n\n2 Background and Notation\nConsider a ground set V of n elements on which the submodular function F : 2V ! R is de\ufb01ned. The\nfunction F is said to be submodular if F (A) + F (B) F (A[ B) + F (A\\ B) for all A, B \u2713 V .\nThis is equivalent to the diminishing marginal returns characterization mentioned before. Without loss\nof generality, we assume F (;) = 0. For x 2 Rn, A \u2713 V , we de\ufb01ne x(A) =Pk 2 A x(k) = 1>Ax,\nwhere 1A 2 Rn is the indicator vector of A, and let both x(k) and xk denote the kth element of x.\nGiven a submodular set function F : V ! R, the submodular polyhedron and the base polytope\nare de\ufb01ned as P (F ) = {w 2 Rn : w(A) \uf8ff F (A), 8 A \u2713 V }, and B(F ) = {w 2 Rn : w(V ) =\nF (V ), w 2 P (F )}, respectively. We use vert(B(F )) to denote the vertex set of B(F ). The Lov\u00e1sz\nextension of F is the piecewise linear function [17]\n\nf (x) = max\n\nw 2 B(F )\n\nw>x.\n\n(1)\n\nThe Lov\u00e1sz extension can be computed using Edmonds\u2019 greedy algorithm for maximizing linear\nfunctions over the base polytope (in O(n log n + n) time, where is the time required to compute\nthe submodular function value). This extension can be de\ufb01ned for any set function, however it is\nconvex if and only if the set function is submodular [17]. We call a permutation \u21e1 over [n] consistent1\n\n1Therefore, the Lov\u00e1sz extension can also be written as f (x) = Pk x\u21e1k [F ({\u21e11,\u21e1 2, . . . ,\u21e1 k}) \nF ({\u21e11,\u21e1 2, . . . ,\u21e1 k1})] where \u21e1 is a permutation consistent with x and F (;) = 0 by assumption.\n\n3\n\n\fwith x 2 Rn if x\u21e1i x\u21e1j whenever i \uf8ff j. Each permutation \u21e1 corresponds to an extreme point\nx\u21e1k = F ({\u21e11,\u21e1 2, . . . ,\u21e1 k}) F ({\u21e11,\u21e1 2, . . . ,\u21e1 k1}) of the base polytope. For x 2 Rn, let V(x)\nbe the set of vertices B(F ) that correspond to permutations consistent with x. Note that\n\n@f (x) = conv(V(x)) = argmax\nw 2 B(F )\n\nw>x,\n\n(2)\n\nwhere @f (x) is the subdifferential of f at x and conv(S) represents the convex hull of the set S.\nWe assume all convex functions in this paper are closed and proper [21]. Given a convex function\ng : Rn ! R, its Fenchel conjugate g\u21e4 : Rn ! R is de\ufb01ned as\n\ng\u21e4(w) = max\nx 2 Rn\n\nw>x g(x).\n\n(3)\n\nNote that when g is strongly convex, the right hand side of (3) always has an unique solution, so g\u21e4 is\nde\ufb01ned for all w 2 Rn. Fenchel conjugates are always convex, regardless of the convexity of the\noriginal function. Since we assume g is closed, g\u21e4\u21e4 = g. Fenchel conjugates satisfy (@g)1 = @g\u21e4\nin the following sense:\n\nw 2 @g(x) () g(x) + g\u21e4(w) = w>x () x 2 @g\u21e4(w),\n\n(4)\nwhere @g(x) is the subdifferential of g at x. When g is \u21b5-strongly convex and -smooth, g\u21e4 is\n1/-strongly convex and 1/\u21b5-smooth [21, Section 4.2]. (See Appendix A.2 for details.)\nProofs of all results that do not follow easily from the main text can be found in the appendix.\n\n3 Limited Memory Kelley\u2019s Method\n\nWe now present our novel limited memory adaptation L-KM of the Original Simplicial Method\n(OSM). We \ufb01rst brie\ufb02y review OSM as proposed by Bach [3, Section 7.7] and discuss problems\nof OSM with respect to memory requirements and the rate of convergence. We then highlight the\nchanges in OSM, and verify that these changes will enable us to show a bound on the memory\nrequirements while maintaining \ufb01nite convergence. Proofs omitted from this section can be found in\nAppendix C.\nOriginal Simplicial Method: To solve the primal problem (P ), it is natural to approximate the piece-\nwise linear Lov\u00e1sz extension f with cutting planes derived from the function values and subgradients\nof the function at previous iterations, which results in piecewise linear lower approximations of f.\nThis is the basic idea of OSM introduced by Bach in [3]. This approach contrasts with Kelley\u2019s\nMethod, which approximates the entire objective function g + f. OSM adds a cutting plane to\nthe approximation of f at each iteration, so the number of the linear constraints in its subproblems\ngrows linearly with the number of iterations.2 Hence it becomes increasingly challenging to solve\nthe subproblem as the number of iterations grows up. Further, in spite of a \ufb01nite convergence, as\nmentioned in the introduction there was no known rate of convergence for OSM or its dual method\nprior to this work.\nLimited Memory Kelley\u2019s Method: To address these two challenges \u2014 memory requirements and\nunknown convergence rate \u2014 we introduce and analyze a novel limited memory version L-KM of\nOSM which ensures that the number of cutting planes maintained by the algorithm at any iteration is\nbounded by n + 1. This thrift bounds the size of the subproblems at any iteration, thereby making\nL-KM cheaper to implement. We describe L-KM in detail in Algorithm 1.\nL-KM and OSM differ only in the set of vertices V (i) considered at each step: L-KM keeps only\nthose vectors w 2V (i1) that maximize w>x(i), whereas OSM keeps every vector in V (i1).\nWe state some properties of L-KM here with proofs in Appendix C. We will revisit many of these\nproperties later via the lens of duality.\nThe sets V (i) in L-KM are af\ufb01nely independent, which shows the size of the subproblems is bounded.\nTheorem 1. For all i 0, vectors in V (i) are af\ufb01nely independent. Moreover, |V (i)|\uf8ff n + 1.\n\n2 Concretely, we obtain OSM from Algorithm 1 by setting V (i) = V (i1) [{ v(i)} in step 6.\n\n4\n\n\fAlgorithm 1 L-KM: The Limited Memory Kelley\u2019s Method for (P )\nRequire: strongly convex function g : Rn ! R, submodular function F : 2n ! R, tolerance \u270f 0\nEnsure: \u270f-suboptimal solution x] to (P )\n1: initialize: choose x(0) 2 Rn, set ; \u21e2V (0) \u2713 vert(B(F )) af\ufb01nely independent\n2: for i = 1, 2, . . . do\n3:\n\nConvex subproblem. De\ufb01ne approximation f(i)(x) = max{w>x : w 2V (i1)} and solve\n\nx(i) = argmin g(x) + f(i)(x).\n\n4:\n\nSubmodular subproblem. Compute value and subgradient of f at x(i)\n\n5:\n\n6:\n\nf (x(i)) = max\nw2B(F )\n\nw>x(i),\n\nv(i) 2 @f (x(i)) = argmax\nw2B(F )\n\nw>x(i).\n\nStopping condition. Break if duality gap p(i) d(i) \uf8ff \u270f, where\n\np(i) = g(x(i)) + f (x(i)),\n\nd(i) = g(x(i)) + f(i)(x(i)).\n\nUpdate memory. Identify active set A(i) and update memory V (i):\n\nA(i) = {w 2V (i1) : w>x(i) = f(i)(x(i))},\n\nV (i) = A(i) [{ v(i)}.\n\n7: return x(i)\n\nFigure 1: An illustration of L-KM (a)-(d) (left to right): blue curve marked with \u21e5 denotes the ith function\napproximation g + f(i). In (d), note that L-KM approximation g + f(4) is obtained by dropping the leftmost\nconstraint in g + f(3) (in (c)), unlike OSM.\nL-KM may discard pieces of the lower approximation f(i) at each iteration. However, it does so\nwithout any adverse affect on the solution to the current subproblem:\nLemma 1. The convex subproblem (in Step 3 of algorithm L-KM) has the same solution and optimal\nvalue over the new active set A(i) as over the memory V (i1):\n\nx(i) = argmin g(x) + max{w>x : w 2V (i1)} = argmin g(x) + max{w>x : w 2A (i)}.\n\nLemma 1 shows that L-KM remembers the important information about the solution, i.e. only the\ntight subgradients, at each iteration. Note that at the ith iteration, the solution x(i) is unique by the\nstrong convexity of g, and thus we can improve the lower bound d(i) since new information (i.e. v(i))\nis added:\nCorollary 1. The sequence {d(i)} constructed by L-KM form strictly increasing lower bounds on\nthe value of (P ): d(1) < d(2) < \u00b7\u00b7\u00b7\uf8ff p? = minx 2 Rn f (x) + g(x).\nRemark. It is easy to see that the sequence {p(i)} constructed by L-KM form upper bounds of p?,\nhence by Corollary 1, {p(i) d(i)} form valid optimality gaps for L-KM.\nCorollary 2. L-KM does not stall: for any iterations i1 6= i2, we solve subproblems over a distinct\nset of vertices V (i1) 6= V (i2).\nWe can strengthen Corollary 2 and show L-KM in fact converges to the exact solution in \ufb01nite\niterations:\nTheorem 2. L-KM (Algorithm 1) terminates after \ufb01nitely many iterations. Moreover, for any given\n\u270f 0, suppose L-KM terminates when i = i\u270f, then p? + \u270f p(i\u270f) p? and p? d(i\u270f) p? \u270f.\n\n5\n\n\fIn particular, when we choose \u270f = 0, we have p(i0) = p? = d(i\u270f), and x(i\u270f) is the unique optimal\nsolution to (P ).\n\nIn this section, we have shown that L-KM solves a series of limited memory convex subproblems with\nno more than n + 1 linear constraints, and produces strictly increasing lower bounds that converge to\nthe optimal value.\n\n4 Duality\nL-KM solves a series of subproblems parametrized by the sets V\u2713 vert(B(F )):\n\n(PV)\nNotice that when V = vert(B(F )), we recover (P ). We now analyze these subproblems via duality.\nThe Lagrangian of this problem with dual variables v for v 2V is,\n\nv 2V\n\nminimize\ng(x) + t\nsubject to t v>x,\n\nL(x, t, ) = g(x) + t +Xv2V\n\nv(v>x t).\n\nThe pair ((x, t), ) are primal-dual optimal for this problem iff they satisfy the KKT conditions [5]:\n\n\u2022 Optimality.\n\n0 2 @xL(x, t, ) =) Xv2V\n\nvv 2 @g(x),\n\n0 =\n\nd\n\ndtL(x, t, ) =) Xv2V\n\nv = 1.\n\n\u2022 Primal feasibility. t v>x for each v 2V .\n\u2022 Dual feasibility. v 0 for each v 2V .\n\u2022 Complementary slackness. v(v>x t) = 0 for each v 2V .\n\nThe requirement that lie in the simplex emerges naturally from the optimality conditions, and\n\nreduces the Lagrangian to L(x, ) = g(x) + (Pv2V vv)>x. One can introduce the variable\nw =Pv2V vv 2 conv(V), which is dual feasible so long as w 2 conv(V). We can rewrite the\nLagrangian in terms of x and w 2 conv(V) as L(x, w) = g(x) + w>x. Minimizing L(x, w) over x,\nwe obtain the dual problem\n(DV)\nNote (D) is the same as (DV) if V = vert(B(F )) and h(w) = g\u21e4(w), the Fenchel conjugate of\ng. Notice that g\u21e4 is smooth if g is strongly convex (Lemma 4 in Appendix A.2).\nTheorem 3 (Strong Duality). The primal problem (PV) and the dual problem (DV) have the same\n\ufb01nite optimal value.\n\nmaximize g\u21e4(w)\nsubject to w 2 conv(V).\n\nBy analyzing the KKT conditions, we obtain the following result, which we will used later in the\ndesign of our algorithms.\nLemma 2. Suppose (x, ) solve ((PV), (DV)) and t = maxv2V v>x. By complementary slackness,\n\nand in particular, {v : v > 0}\u2713{\n\nv : v>x = t}.\n\nv > 0 =) v>x = t,\n\nNotice {v : v>x = t} is the active set of L-KM. We will see {v : v > 0} is the (minimal) active set\nof the dual method L-FCFW. (If strict complementary slackness holds, these sets are the same.)\nThe \ufb01rst KKT condition shows how to move between primal and dual optimal variables.\nTheorem 4. If g : Rn ! R is strongly convex and w? solves (DV), then\n\nx? = (@g)1(w?) = rg\u21e4(w?)\n\n(5)\n\nsolves (PV). If in addition g is smooth and x? solves (PV), then w? = rg(x?) solves (DV).\n\n6\n\n\fAlgorithm 2 L-FCFW: Limited Memory Fully Corrective Frank Wolfe for (D)\nRequire: smooth concave function h : Rn ! R, submodular function F : 2n ! R, tolerance \u270f 0\nEnsure: \u270f-suboptimal solution w] to (D)\n1: initialize: set ; \u21e2V (0) \u2713 vert(B(F ))\n2: for i = 1, 2, . . . do\n3:\n\nConvex subproblem. Solve\n\nw(i) = argmax{h(w) : w 2 conv(V (i1))}.\n\nFor each v 2V (i), de\ufb01ne v 0 so that w(i) =Pv2V(i) vv andPv2V(i) v = 1.\nSubmodular subproblem. Compute gradient x(i) = rh(w(i)) and solve\n\nv(i) = argmax{w>x(i) : w 2 B(F )}.\n\nStopping condition. Break if duality gap p(i) d(i) \uf8ff \u270f, where\n\np(i) = (v(i))>x(i),\n\nd(i) = (w(i))>x(i).\n\nUpdate memory. Identify a supserset of active vertices B(i) and update memory V (i):\n\nB(i) \u25c6{ w 2V (i1) : w > 0}\n\nV (i) = B(i) [{ v(i)}.\n\n4:\n\n5:\n\n6:\n\n7: return w(i)\n\nProof. Check the optimality conditions to prove the result. By de\ufb01nition, x? satis\ufb01es the \ufb01rst\noptimality condition. To check complementary slackness, we rewrite the condition as\n\nv(v>x t) = 0 8v 2V () Xv2V\n\nvv!>\n\nx = t () w>x = max\nv2V\n\nv>x.\n\nNotice (w?)>(rg\u21e4(w?)) = maxv2V v>(rg\u21e4(w?)) by optimality of w?, since v w? is a\nfeasible direction for any v 2V , proving x? = rg\u21e4(w?) solves (PV).\nThat the primal optimal variable yields a dual optimal variable via w? = rg(x?) follows from a\nsimilar argument together with ideas from the proof of strong duality in Appendix D.\n\n5 Solving the dual problem\nLet\u2019s return to the dual problem (D): maximize a smooth concave function h(w) = g\u21e4(w) over\nthe polytope B(F ) \u2713 Rn. Linear optimization over this polytope is easy; hence a natural strategy\nis to use the Frank-Wolfe method or one of its variants [15]. However, since the cost of each linear\nminimization is not negligible, we will adopt a Frank-Wolfe variant that makes considerable progress\nat each iteration by solving a subproblem of moderate complexity: LIMITED MEMORY FULLY\nCORRECTIVE FRANK-WOLFE (L-FCFW, Algorithm 2), which at every iteration exactly minimizes\nthe function g\u21e4(w) over the the convex hull of the current subset of vertices V (i). Here we\noverload notation intentionally: when g is smooth and strongly convex, we will see that we can\nchoose the set of vertices V (i) in L-FCFW (Algorithm 2) so that the algorithm matches either L-KM\nor OSM depending on the choice of B(i) (Line 6 of L-FCFW). For details of the duality between\nL-KM and L-FCFW see Section 6.\n\nLimited memory.\nIn L-FCFW, we may choose any active set B(i) \u25c6{ w 2V (i1) : w > 0}.\nWhen B(i) = V (i1), we call the algorithm (vanilla) FCFW. When B(i) is chosen to be small, we\ncall the algorithm LIMITED MEMORY FCFW (L-FCFW). Standard FCFW increases the size of the\nactive set at each iteration, whereas the most limited memory variant of L-FCFW uses only those\nvertices needed to represent the iterate w(i).\nMoreover, recall Carath\u00e9odory\u2019s theorem (see e.g. [23]): for any set of vectors V, if x 2 conv(V) \u2713\nRn, then there exists a subset A \u2713V with |A|\uf8ff n + 1 such that x 2 conv(A). Hence we see we can\nchoose B(i) to contain at most n + 1 vertices at each iteration (hence n + 2 in V (i)), or even fewer if\nthe iterate lies on a low-dimensional face of B(F ). (The size of B(i) may depend on the solver used\n\n7\n\n\ffor (DV); to reduce the size of B(i), we can minimize a random linear objective over the optimal set\nof (DV) as in [22].)\nLinear convergence. Lacoste-Julien and Jaggi [15] show that FCFW converges linearly to an\n\u270f-suboptimal solution when g is smooth and strongly convex so long as the active set B(i) and iterate\nx(i) satisfy three conditions they call approximate correction(\u270f):\n\n1. Better than FW. h(y(i)) \uf8ff min 2[0,1] h((1 )w(i1) + v(i1))).\n2. Small away-step gap. max{(w(i) v)>x(i) : v 2V (w(i))}\uf8ff \u270f, where V(w(i)) = {v 2\n\nV (i1) : v > 0}.\n\n3. Representation. x(i) 2 conv(B(i)).\n\nBy construction, iterates of L-FCFW always satisfy these conditions with \u270f = 0:\n\n1. Better than FW. For any 2 [0, 1], w = (1 )w(i1) + v(i1) is feasible.\n2. Zero away-step gap. For each v 2V (i), if w(i) = v, then clearly (w(i) v)>(x(i)) = 0.\notherwise (if w(i) 6= v) v w(i) is a feasible direction, and so by optimality of w(i)\n(w(i) v)>(x(i)) \uf8ff 0.\n\n3. Representation. We have w(i) 2 conv(B(i)) by construction of B(i).\n\nHence we have proved Theorem 5:\nTheorem 5. Suppose g is \u21b5-smooth and -strongly convex. Let M be the diameter of B(F ) and\n be the pyramidal width3 of P , then the lower bounds d(i) in L-FCFW (Algorithm 2) converges\nlinearly at the rate of 1 \u21e2, i.e. p? d(i+1) \uf8ff (1 \u21e2)(p? d(i)), where \u21e2 = \nPrimal-from-dual algorithm. Recall that dual iterates yield primal iterates via Theorem 4. Hence\nthe gradients x(i) = rg\u21e4(w(i)) computed by L-FCFW converge linearly to the solution x? of\n(P ). However, it is dif\ufb01cult to run L-FCFW directly to solve (D) given only access to g, since in that\ncase computing g\u21e4 and its gradient requires solving another optimization problem; moreover, we will\nsee below that L-KM computes the same iterates. See Appendix E for more discussion.\n\n4\u21b5 ( \n\nM )2.\n\n6 L-KM (and OSM) converge linearly\n\nL-KM (Algorithm 1) and L-FCFW (Algorithm 2) are dual algorithms in the following strong sense:\nTheorem 6. Suppose g is \u21b5-smooth and -strongly convex. In L-FCFW (Algorithm 2), suppose we\nchoose B(i) = A(i) = {v 2V (i1) : v>x(i) = w(i)>x(i)}. Then\n1. The primal iterates x(i) of L-KM and L-FCFW match.\n2. The sets V (i) used at each iteration of L-KM and L-FCFW match.\n3. The upper and lower bounds p(i) and d(i) of L-KM and L-FCFW match.\nCorollary 3. The active sets of L-FCFW can be chosen to satisfy |B(i)|\uf8ff n + 1.\nTheorem 7. Suppose g is \u21b5-strongly convex and let M be the diameter of B(F ), the duality gap\np(i) d(i) in L-KM (Algorithm 1) converges linearly: p(i) d(i) \uf8ff (p? d(i)) + M 2/(2) when\n(p? d(i)) M 2/(2) and p(i) d(i) \uf8ff Mp2(p? d(i))/ otherwise. Note that p? d(i)\nconverges linearly by Theorem 5.\nWhen g is smooth and strongly convex, OSM and vanilla FCFW are dual algorithms in the same\nsense when we choose B(i) = V (i1). For details of the duality between OSM and L-FCFW see\nAppendix G. Hence we have a similar convergence result for OSM:\n\n3See Appendix H for de\ufb01nitions of the diameter and pyramidal width.\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Dimension n = 10 in (a), n = 100 in (b), (c) and (d). The methods converged in (a), (b), (c) and (d).\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: L-FCFW and FCFW converged in all plots, FW with away steps has converged in (b), (c) and (d).\nTheorem 8. Suppose g is \u21b5-strongly convex and let M be the diameter of B(F ), the duality gap\np(i) d(i) in OSM converges linearly: p(i) d(i) \uf8ff (p? d(i)) + M 2/(2) when (p? d(i)) \nM 2/(2) and p(i) d(i) \uf8ff Mp2(p? d(i))/ otherwise.\nRemark. Note that p? d(i) converges linearly by Theorem 5, Theorem 7 and Theorem 7 imply\nL-KM and OSM converge linearly when g is smooth and strongly convex.\n\nMoreover, this connection generates a new way to prune the active set of L-KM even further using a\nprimal dual solver: we may use any active set B(i) \u25c6{ w 2V (i1) : w > 0}, where 2 R|V (i1)| is\na dual optimal solution to (PV). When strict complementary slackness fails, we can have B(i) \u21e2A (i).\n7 Experiments and Conclusion\n\n2\n\nWe present in this section a computational study: we minimize non-separable composite functions\ng + f where g(x) = x>(A + nIn)x + b>x for x 2 Rn, and f is the Lov\u00e1sz extension of the\nsubmodular function F (A) = |A|(2n|A|+1)\nfor A \u2713 [n]. To construct g(\u00b7), entries of A 2 Mn and\nb 2 Rn were randomly sampled from U [1, 1]n\u21e5n, and U [0, n]n respectively. In is an n\u21e5 n identity\nmatrix. We remark that L-KM converges so quickly that the bound on the size of the active set is less\nimportant, in practice, than the fact that the active set need not grow at every iteration.\nPrimal convergence: We \ufb01rst solve a toy problem for n = 10 and show that the number of constraints\ndoes not exceed n + 1. Note that the number of constraints might oscillate before it reaches n + 1\n(Fig. 2(a)). We next compare the memory used in each iteration (Fig. 2(b)), the optimality gap per\niteration (Fig. 2(c)), and the running time (Fig. 2(d)) of L-KM and OSM by solving the problem\nfor n = 100 up to accuracy of 105 of the optimal value. Note that L-KM uses much less memory\ncompared to OSM, converges at almost the same rate in iterations, and its running time per iteration\nimproves as the iteration count increases.\nDual convergence: We compare the convergence of L-FCFW, FCFW and Frank-Wolfe with away\nsteps for the dual problem maxw 2 B(F ) (w b)>(A + nIn)1(w b) for n = 100 up to\nrelative accuracy of 105. L-FCFW maintains smaller sized subproblems (Figure (3)(a)), and it\nconverges faster than FCFW as the number of iteration increases (Figure (3)(d)). Their provable\nduality gap converges linearly in the number of iterations. Moreover, as shown in Figures (3)(b) and\n(c), L-FCFW and FCFW return better approximate solutions than Frank-Wolfe with away steps\nunder the same optimality gap tolerance.\nConclusion This paper de\ufb01nes a new limited memory version of Kelley\u2019s method adapted to compos-\nite convex and submodular objectives, and establishes the \ufb01rst convergence rate for such a method,\n\n9\n\n\fsolving the open problem proposed in [2, 3]. We show bounds on the memory requirements and\nconvergence rate, and demonstrate compelling performance in practice.\n\nAcknowledgments\nThis work was supported in part by DARPA Award FA8750-17-2-0101. A part of this work was\ndone while the \ufb01rst author was at the Department of Mathematical Sciences, Tsinghua University and\nwhile the second author was visiting the Simons Institute, UC Berkeley. The authors would also like\nto thank Sebastian Pokutta for invaluable discussions on the Frank-Wolfe algorithm and its variants.\n\nReferences\n[1] J. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathematics of\n\nOperations Research, 39(1):31\u201345, 2013.\n\n[2] Francis Bach. Duality between subgradient and conditional gradient methods. SIAM Journal on Optimiza-\n\ntion, 25(1):115\u2013129, 2015.\n\n[3] Francis Bach et al. Learning with submodular functions: A convex optimization perspective. Foundations\n\nand Trends R in Machine Learning, 6(2-3):145\u2013373, 2013.\n\n[4] Francis R Bach. Structured sparsity-inducing norms through submodular functions. In Advances in Neural\n\nInformation Processing Systems, pages 118\u2013126, 2010.\n\n[5] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2009.\n\n[6] E Ward Cheney and Allen A Goldstein. Newton\u2019s method for convex programming and Tchebycheff\n\napproximation. Numerische Mathematik, 1(1):253\u2013268, 1959.\n\n[7] Josip Djolonga and Andreas Krause. From MAP to marginals: Variational inference in bayesian submodular\n\nmodels. In Advances in Neural Information Processing Systems, pages 244\u2013252, 2014.\n\n[8] Yoel Drori and Marc Teboulle. An optimal variant of Kelley\u2019s cutting-plane method. Mathematical\n\nProgramming, 160(1-2):321\u2013351, 2016.\n\n[9] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics\n\nQuarterly, 3(1-2):95\u2013110, 1956.\n\n[10] D. Hearn, S. Lawphongpanich, and J. Ventura. Restricted simplicial decomposition: Computation and\n\nextensions. Mathematical Programming Study, 31:99\u2013118, 1987.\n\n[11] James E Kelley, Jr. The cutting-plane method for solving convex programs. Journal of the Society for\n\nIndustrial and Applied Mathematics, 8(4):703\u2013712, 1960.\n\n[12] Krzysztof C Kiwiel. Proximal level bundle methods for convex nondifferentiable optimization, saddle-point\n\nproblems and variational inequalities. Mathematical Programming, 69(1-3):89\u2013109, 1995.\n\n[13] W. M. Koolen, M. K. Warmuth, and J. Kivinen. Hedging structured concepts. COLT, 2010.\n\n[14] Walid Krichene, Syrine Krichene, and Alexandre Bayen. Convergence of mirror descent dynamics in the\n\nrouting game. In European Control Conference (ECC), pages 569\u2013574. IEEE, 2015.\n\n[15] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization\n\nvariants. In Advances in Neural Information Processing Systems, pages 496\u2013504, 2015.\n\n[16] Claude Lemar\u00e9chal, Arkadii Nemirovskii, and Yurii Nesterov. New variants of bundle methods. Mathe-\n\nmatical programming, 69(1-3):111\u2013147, 1995.\n\n[17] L. Lov\u00e1sz. Submodular functions and convexity. Mathematical Programming: The State of the Art, 1983.\n\n[18] Marko M\u00e4kel\u00e4. Survey of bundle methods for nonsmooth optimization. Optimization methods and software,\n\n17(1):1\u201329, 2002.\n\n[19] Kiyohito Nagano, Yoshinobu Kawahara, and Kazuyuki Aihara. Size-constrained submodular minimization\nthrough minimum norm base. In Proceedings of the 28th International Conference on Machine Learning\n(ICML), pages 977\u2013984, 2011.\n\n10\n\n\f[20] A. S. Nemirovski and D. B. Yudin. Problem complexity and method ef\ufb01ciency in optimization. Wiley-\n\nInterscience, New York, 1983.\n\n[21] Ernest K Ryu and Stephen Boyd. Primer on monotone operator methods. Journal of Applied and\n\nComputational Mathematics, 15(1):3\u201343, 2016.\n\n[22] Madeleine Udell and Stephen Boyd. Bounding duality gap for separable problems with linear constraints.\n\nComputational Optimization and Applications, 64(2):355\u2013378, 2016.\n\n[23] Roman Vershynin. High-dimensional probability: An introduction with applications in data science,\n\nvolume 47. Cambridge University Press, 2018.\n\n[24] Balder Von Hohenbalken. Simplicial decomposition in nonlinear programming algorithms. Mathematical\n\nProgramming, 13(1):49\u201368, 1977.\n\n[25] M. K. Warmuth and D. Kuzmin. Randomized PCA algorithms with regret bounds that are logarithmic in\n\nthe dimension. In Advances in Neural Information Processing Systems, pages 1481\u20131488, 2006.\n\n[26] S. Yasutake, K. Hatano, S. Kijima, E. Takimoto, and M. Takeda. Online linear optimization over permuta-\n\ntions. In Algorithms and Computation, pages 534\u2013543. Springer, 2011.\n\n11\n\n\f", "award": [], "sourceid": 2171, "authors": [{"given_name": "Song", "family_name": "Zhou", "institution": "Cornell University"}, {"given_name": "Swati", "family_name": "Gupta", "institution": "Georgia Institute of Technology"}, {"given_name": "Madeleine", "family_name": "Udell", "institution": "Cornell University"}]}