{"title": "Decomposition-Invariant Conditional Gradient for General Polytopes with Line Search", "book": "Advances in Neural Information Processing Systems", "page_first": 2690, "page_last": 2700, "abstract": "Frank-Wolfe (FW) algorithms with linear convergence rates have recently achieved great efficiency in many applications. Garber and Meshi (2016) designed a new decomposition-invariant pairwise FW variant with favorable dependency on the domain geometry. Unfortunately, it applies only to a restricted class of polytopes and cannot achieve theoretical and practical efficiency at the same time. In this paper, we show that by employing an away-step update, similar rates can be generalized to arbitrary polytopes with strong empirical performance. A new \"condition number\" of the domain is introduced which allows leveraging the sparsity of the solution. We applied the method to a reformulation of SVM, and the linear convergence rate depends, for the first time, on the number of support vectors.", "full_text": "Decomposition-Invariant Conditional Gradient for\n\nGeneral Polytopes with Line Search\n\nMohammad Ali Bashiri\n\nXinhua Zhang\n\nDepartment of Computer Science, University of Illinois at Chicago\n\nChicago, Illinois 60607\n\n{mbashi4,zhangx}@uic.edu\n\nAbstract\n\nFrank-Wolfe (FW) algorithms with linear convergence rates have recently achieved\ngreat ef\ufb01ciency in many applications. Garber and Meshi (2016) designed a new\ndecomposition-invariant pairwise FW variant with favorable dependency on the\ndomain geometry. Unfortunately it applies only to a restricted class of polytopes\nand cannot achieve theoretical and practical ef\ufb01ciency at the same time. In this\npaper, we show that by employing an away-step update, similar rates can be\ngeneralized to arbitrary polytopes with strong empirical performance. A new\n\u201ccondition number\u201d of the domain is introduced which allows leveraging the sparsity\nof the solution. We applied the method to a reformulation of SVM, and the linear\nconvergence rate depends, for the \ufb01rst time, on the number of support vectors.\n\n1\n\nIntroduction\n\nThe Frank-Wolfe algorithm [FW, 1] has recently gained revived popularity in constrained convex\noptimization, in part because linear optimization on many feasible domains of interest admits ef\ufb01cient\ncomputational solutions [2]. It has been well known that FW achieves O(1/\u0001) rate for smooth convex\noptimization on a compact domain [1, 3, 4]. Recently a number of works have focused on linearly\nconverging FW variants under various assumptions.\nIn the context of convex feasibility problem, [5] showed linear rates for FW where the condition\nnumber depends on the distance of the optimum to the relative boundary [6]. Similar dependency\nwas derived in the local linear rate on polytopes using the away-step [6, 7]. With a different analysis\napproach, [8\u201310] derived linear rates when the Robinson\u2019s condition is satis\ufb01ed at the optimal solution\n[11], but it was not made clear how the rate depends on the dimension and other problem parameters.\nTo avoid the dependency on the location of the optimum, [12] proposed a variant of FW whose\nrate depends on some geometric parameters of the feasible domain (a polytope). In a similar \ufb02avor,\n[13, 14] analyzed four versions of FW including away-steps [6], and their af\ufb01ne-invariant rates depend\non the pyramidal width (Pw) of the polytope, which is hard to compute and can still be ill-conditioned.\nMoreover, [15] recently gave a duality-based analysis for non-strongly convex functions. Some lower\nbounds on the dependency of problem parameters for linear rates of FW are given in [12, 16].\nTo get around the lower bound, one may tailor FW to speci\ufb01c objectives and domains (e.g. spectra-\nhedron in [17]). [18] specialized the pairwise FW (PFW) to simplex-like polytopes (SLPs) whose\nvertices are binary, and is de\ufb01ned by equality constraints and xi \u2265 0. The advantages include: a) the\nconvergence rate depends linearly on the cardinality of the optimal solution and the domain diameter\nsquare (D2), which can be much better than the pyramidal width; b) it is decomposition-invariant,\nmeaning that it does not maintain a pool of atoms accumulated and the away-step is performed on the\nface that the current iterate lies on. This results in considerable savings in computation and storage.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fPFW-1 [18] PFW-2 [18]\n\n(SLP)\n\ngeneral\n\nns\nks\n\u00d7\n\u00d7\n\n\u00d7\n\u00d7\n\u00d7\n\u00d7\n\nLJ [13]\ngeneral\nn2\nn (k = 1)\nk \u00b7 Pw\u22122\nD2 \u00b7 Pw\u22122\n\nAFW-1\n(SLP)\n\nns\nks\n\u00d7\n\u00d7\n\nAFW-2\ngeneral\n\nn2s\nk2s\n\nk2 min(sk, n)\n\nUnit cube [0, 1]n\n\nPk = {x \u2208 [0, 1]n : 1(cid:62)x = k}\nQk = {x \u2208 [0, 1]n : 1(cid:62)x \u2264 k}\n\narbitrary polytope in Rn\n\nD2nHs\n\u0001 to get the\nTable 1: Comparison of related methods. These numbers need to be multiplied with \u03ba log 1\nconvergence rates, where \u03ba is the condition number of the objective, D is the diameter of the domain,\ns is the cardinality of the optimum, and Pw is the pyradimal width.. Our method is AFW. \u00d7 means\ninapplicable or no rate known. PFW-1 [18] and AFW-1 apply only to SLP, hence not covering Qk\n(k\u2265 2). [13] showed the pyramidal width for Pk only with k = 1.\n\nHowever, [18] suffers from multiple inherent restrictions. First it applies only to SLPs, which although\nencompass useful sets such as k-simplex Pk, do not cover its convex hull with the origin (Qk):\nPk = {x \u2208 [0, 1]n : 1(cid:62)x = k}, Qk = {x \u2208 [0, 1]n : 1(cid:62)x \u2264 k}, where k \u2208 {1, . . . , n}.\n\nHere 1 = (1, . . . , 1)(cid:62). Extending its analysis to general polytopes is not promising because it relies\nfundamentally on the integrality of the vertices. Second, its rate is derived from a delicately designed\nsequence of step size (PFW-1), which exhibits no empirical competency. In fact, the experiments in\n[18] resorted to line search (PFW-2). However no rate was proved for it. As shown in [13], dimension\nfriendly bounds are intrinsically hard for PFW, and they settled for the factorial of the vertex number.\nThe goal of this paper is to address these two issues while at the same time retaining the computational\nef\ufb01ciency of decomposition invariance. Our contributions are four folds. First we generalize the\ndimension friendly linear rates to arbitrary polytopes, and this is achieved by replacing the pairwise\nPFW in [18] with the away-step FW (AFW, \u00a72), and setting the step sizes by line search instead of a\npre-de\ufb01ned schedule. This allows us to avoid \u201cswapping atoms\u201d in PFW, and the resulting method\n(AFW-2) delivers not only strong empirical performance (\u00a75) but also strong theoretical guarantees\n(\u00a73.5), improving upon PFW-1 and PFW-2 which are strong in either theory or practice, but not both.\nSecond, a new condition number Hs is introduced in \u00a73.1 to characterize the dimension dependency of\nAFW-2. Compared with pyramidal width, it not only provides a more explicit form for computation,\nbut also leverages the cardinality (s) of the optimal solution. This may lead to much smaller constants\nconsidering the likely sparsity of the solution. Since pyramidal width is hard to compute [13], we\nleave the thorough comparison for future work, but they are comparable on simple polytopes. The\ndecomposition invariance of AFW-2 also makes each step much more ef\ufb01cient than [13].\nThird, when the domain is indeed an SLP, we provide a step size schedule (AFW-1, \u00a73.4) yielding the\nsame rate as PFW-1. This is in fact nontrivial because the price for replacing PFW by AFW is the\nmuch increased hardness in maintaining the integrality of iterates. The current iterate is scaled in\nAFW, while PFW simply adds (scaled) new atoms (which on the other hand complicates the analysis\nfor line search [13]). Our solution relies on \ufb01rst running a constant number of FW-steps.\nFinally we applied AFW to a relaxed-convex hull reformulation of binary kernel SVM with bias (\u00a74),\nobtaining O(n\u03ba(#SV)3 log 1\n\u0001 ) for\nAFW-2. Here \u03ba is the condition number of the objective, n is the number of training examples, and\n#SV is the number of support vectors in the optimal solution. This is much better than the best known\nresult of O(n3\u03ba log 1\n\u0001 ) based on sequential minimal optimization [SMO, 19, 20], because #SV is\ntypically much smaller than n. To the best of our knowledge, this is the \ufb01rst linear convergence rate\nfor hinge-loss SVMs with bias where the rate leverages dual sparsity.\nA brief comparison of our method (AFW) with [18] and [13] is given in Table 1. AFW-1 matches\nthe superior rates of PFW-1 on SLPs, and AFW-2 is more general and its rate is slightly worse than\nAFW-1 on SLPs. PFW-2 has no rates available, and pyramidal width is hard to compute in general.\n\n\u0001 ) computational complexity for AFW-1 and O(n\u03ba(#SV)4 log 1\n\n2 Preliminaries and Algorithms\nOur goal is to solve minx\u2208P f (x), where P is a polytope and f is both strongly convex and smooth. A\nfunction f : P \u2192 R is \u03b1-strongly convex if f (y) \u2265 f (x)+(cid:104)y \u2212 x,\u2207f (x)(cid:105)+ \u03b1\n2 (cid:107)y \u2212 x(cid:107)2 , \u2200 x, y \u2208\n\n2\n\n\fif(cid:10)dFW\n\nChoose the FW-direction via v+\nChoose the away-direction v\u2212\n\n,\u2212\u2207f (xt)(cid:11) \u2265(cid:10)dA\n\nAlgorithm 1: Decomposition-invariant Away-step Frank-Wolfe (AFW)\n1 Initialize x1 by an arbitrary vertex of P. Set q0 = 1.\n2 for t = 1, 2, . . . do\n3\n4\n5\n6\n7\n8\n9\n10\n12\n11\n\nChoose the step size \u03b7t by using one of the following two options:\nOption 1: Pre-de\ufb01ned step size:\nif t \u2264 n0 then\n\nt ,\u2212\u2207f (xt)(cid:11) then dt \u2190 dFW\n\nt , and revert dt = dFW\n\nelse\n\nt\n\nt\n\nSet qt = t, \u03b7t = 1\nFind the smallest integer s \u2265 0 such that qt de\ufb01ned as follows satis\ufb01es qt \u2265 (cid:100)1/\u03b3t(cid:101):\n\n(cid:46) Perform FW-step for the \ufb01rst n0 steps\n\n.\n\nt\n\nt \u2190 arg minv\u2208P (cid:104)v,\u2207f (xt)(cid:105), and set dFW\nt by calling the away-oracle in (3), and set dA\n\nt \u2212 xt.\nt \u2190 v+\nt \u2190 xt \u2212 v\u2212\nt .\n\n, else dt \u2190 dA\n\nt . (cid:46) Choose a direction\n\n(cid:46) This is for SLP only. Need input arguments n0, \u03b3t.\n\n\uf8f1\uf8f2\uf8f32sqt\u22121 + 1\n\n2sqt\u22121 \u2212 1\n\nqt \u2190\n\nif line 5 adopts the FW-step\nif line 5 adopts the away-step\n\nand \u03b7t \u2190 q\u22121\n\nt\n\n.\n\n,\n\n(2)\n\nOption 2: Line search: \u03b7t \u2190 arg min\n\u03b7\u22650\n\nxt+1 \u2190 xt + \u03b7tdt. Return xt if(cid:10)\u2212\u2207f (xt), dFW\n\n(cid:11) \u2264 \u0001.\n\nt\n\n13\n\n14\n\nf (xt + \u03b7dt), s.t. xt + \u03b7dt\u2208P. (cid:46) General purpose\n\nAlgorithm 2: Decomposition-invariant Pairwise Frank-Wolfe (PFW) (exactly the same as [18])\n1 ... as in Algorithm 1, except replacing a) line 5 by dt = dPFW\nOption 1: Pre-de\ufb01ned step size: Find the smallest integer s \u2265 0 such that 2sqt\u22121 \u2265 1/\u03b3t.\n\n:= v+\nSet qt \u2190 2sqt\u22121 and \u03b7t \u2190 q\u22121\n\nt , and b) line 8-11 by\n\nt \u2212 v\u2212\n. (cid:46) This option is for SLP only.\n\nt\n\nt\n\nP. In this paper, all norms are Euclidean, and we write vectors in bold lowercase letters. f is \u03b2-\nsmooth if f (y)\u2264 f (x) +(cid:104)y \u2212 x,\u2207f (x)(cid:105) + \u03b2\n2 (cid:107)y\u2212x(cid:107)2,\u2200 x, y\u2208P. Denote the condition number as\n\u03ba = \u03b2/\u03b1, and the diameter of the domain P as D. We require D < \u221e, i.e. the domain is bounded.\nLet [m] := {1, . . . , m}. In general, a polytope P can be de\ufb01ned as\n\nP = {x \u2208 Rn : (cid:104)ak, x(cid:105) \u2264 bk, \u2200 k \u2208 [m], Cx = d}.\n\n(1)\nHere {ak} is a set of \u201cdirections\u201d and is \ufb01nite (m < \u221e) and bk cannot be reduced without changing P.\nAlthough the equality constraints can be equivalently written as two linear inequalities, we separate\nthem out to improve the bounds below. Denoting A = (a1, . . . , am)(cid:62) and b = (b1, . . . , bm)(cid:62), we\ncan simplify the representation into P = {x \u2208 Rn : Ax \u2264 b, Cx = d}.\nIn the sequel, we will \ufb01nd highly ef\ufb01cient solvers for a special class of polytope that was also studied\nby [18]. We call a potytope as a simplex-like polytope (SLP), if all vertices are binary (i.e. the set of\nextreme points ext(P) are contained in {0, 1}n), and the only inequality constraints are x \u2208 [0, 1]n.1\nOur decomposition-invariant Frank-Wolfe (FW) method with away-step is shown in Algorithm 1.\nThere are two different schemes of choosing the step size: one with \ufb01xed step size (AFW-1) and one\nwith line search (AFW-2). Compared with [13], AFW-2 enjoys decomposition invariance. Like [13],\nwe also present a pairwise version in Algorithm 2 (PFW), which is exactly the method given in [18].\nThe ef\ufb01ciency of line search in step 13 of Algorithm 1 depends on the polytope. Although in\ngeneral one needs a problem-speci\ufb01c procedure to compute the maximal step size, we will show in\nexperiments some examples where such procedures with high computational ef\ufb01ciency are available.\nThe idea of AFW is to compute a) the FW-direction in the conventional FW sense (call it FW-oracle),\nand b) the away-direction (call it away-oracle). Then pick the one that gives the steeper descent and\ntake a step along it. Our away-oracle adopts the decomposition-invariant approach in [18], which\ndiffers from [13] by saving the cost of maintaining a pool of atoms. To this end, our search space in\nthe away-oracle is restricted to the vertices that satisfy all the inequality constraints by equality if the\n1Although [18] does not allow for x \u2264 1 constraints, we can add a slack variable yi: yi + xi = 1, yi \u2265 0.\n\n3\n\n\ft\n\ncurrent xt does so:\n:= arg maxv (cid:104)v,\u2207f (xt)(cid:105) , s.t. Av\u2264 b, Cv = d, and (cid:104)ai, xt(cid:105) = bi \u21d2 (cid:104)ai, v(cid:105) = bi \u2200i. (3)\nv\u2212\nBesides saving the space of atoms, this also dispenses with computing the inner product between\nthe gradient and all existing atoms. Same as [18], it presumes ef\ufb01cient solutions to the away-oracle,\nwhich may preclude its applicability to problems where only the FW-oracle is ef\ufb01ciently solvable.\nWe will show some examples that admit ef\ufb01cient away-oracle.\nBefore moving on to the analysis, we here make a new, albeit quick, observation that this selection\nscheme is in fact decomposing xt implicitly. Speci\ufb01cally, it tries all possible decompositions of xt,\nand for each of them it \ufb01nds the best away-direction in the traditional sense. Then it picks the best of\nthe best over all proper convex decompositions of xt.\nProperty 1. Denote S(x) := {S \u2286 P : x is a proper convex combination of all elements in S},\nwhere proper means that all elements in S have a strictly positive weight. Then the away-step in (3)\nis exactly equivalent to maxS\u2208S(xt) maxv\u2208S (cid:104)v,\u2207f (xt)(cid:105) . See the proof in Appendix A.\n\n3 Analysis\nWe aim to analyze the rate by which the primal gap ht := f (xt) \u2212 f (x\u2217) decays. Here x\u2217 is the\nminimizer of f, and we assume it can be written as the convex combination of s vertices of P.\n\n3.1 A New Geometric \u201cCondition Number\u201d of a Polytope\n\nUnderlying the analysis of linear convergence for FW-style algorithms is the following inequality\nthat involves a geometric \u201dcondition number\u201d Hs of the polytope: (v+\nt are the FW and\n\nt and v\u2212\n\naway-directions) (cid:112)2Hsht/\u03b1(cid:10)v+\n\nt ,\u2207f (xt)(cid:11) \u2264 (cid:104)x\u2217 \u2212 xt,\u2207f (xt)(cid:105) .\n\nt \u2212 v\u2212\n\n(4)\nIn Theorem 3 of [13], this Hs is essentially the pyramidal width inverse. In Lemma 3 of [18], it is the\ncardinality of the optimal solution, which, despite being better than the pyramidal width, is restricted\nto SLPs. Our \ufb01rst key step here is to relax this restriction to arbitrary polytopes and de\ufb01ne our Hs.\nLet {ui} be the set of vertices of the polytope P, and this set must be \ufb01nite. We do not assume ui is\nbinary. The following \u201cmargin\u201d for each separating hyperplane directions ak will be important:\n\n(cid:104)ak, ui(cid:105) \u2212 second max\n\n(cid:104)ak, ui(cid:105) \u2265 0.\n\ngk := max\n\n(5)\nHere the second max is the second distinct max in {(cid:104)ak, ui(cid:105) : i}. If (cid:104)ak, ui(cid:105) is invariant to i, then\nthis inequality (cid:104)ak, x(cid:105) \u2264 bk is indeed an equality constraint ((cid:104)ak, x(cid:105) = maxz\u2208P (cid:104)ak, z(cid:105)) hence can\nbe moved to Cx = d. So w.l.o.g, we assume gk > 0. Now we state the generalized result.\nLemma 1. Let P be de\ufb01ned as in (1). Suppose x can be written as some convex combination of s\ni=1 \u03b3iui, where \u03b3i \u2265 0, 1(cid:62)\u03b3 = 1. Then any y \u2208 P can be written\nHs (cid:107)x \u2212 y(cid:107) where\n\ni=1(\u03b3i \u2212 \u2206i)ui + (1(cid:62)\u2206)z, such that z \u2208 P, \u2206i\u2208 [0, \u03b3i], and 1(cid:62)\u2206 \u2264\u221a\n\nnumber of vertices of P: x =(cid:80)s\nas y =(cid:80)s\n\ni\n\ni\n\n(cid:33)2\n\n(cid:32)(cid:88)\n\nn(cid:88)\n\nj=1\n\nk\u2208S\n\nakj\ngk\n\nHs := max\n\nS\u2286[m],|S|=s\n\n.\n\n(6)\n\nIn addition, Equation (4) holds with this de\ufb01nition of Hs. Note our Hs is de\ufb01ned here, not in (4).\n\nSome intuitive interpretations of Hs are in order. First the de\ufb01nition in (6) admits a much more\nexplicit characterization than pyramidal width. The maximization in (6) ranges over all possible\nsubsets of constraints with cardinality s, and can hence be much lower than if s = m (taking all\nconstraints). Recall that pyramidal width is oblivious to, hence not bene\ufb01ting from, the sparsity of\nthe optimal solution. More comparisons are hard to make because [13] only provided an existential\nproof of pyramidal width, along with its value for simplex and hypercube only.2\nHowever, Hs is clearly not intrinsic of the polytope. For example, by de\ufb01nition Hs = n for Q2.\nBy contrast, we can introduce a slack variable y to Q2, leading to a polytope over [x; y] (vertical\n2[21] showed pyramidal width is equivalent to a more interpretable quantity called \u201dfacial distance\u201d, and\n\nthey derived its value for more examples. But the evaluation of its value remains challenging in general.\n\n4\n\n\fconcatenation), with x \u2265 0, y \u2265 0, y + 1(cid:62)x = 2. The augmented polytope enjoys Hs = s.\nNevertheless, adding slack variables increases the diameter of the space and the vertices may no\nlonger be binary. It also incurs more computation.\nSecond, gk may approach 0 (tending Hs to in\ufb01nity) when more linear constraints are introduced and\nvertices get closer neighbors. Hs is in\ufb01nity if the domain is not a polytope, requiring an uncountable\nnumber of supporting hyperplanes. Third, due to the square in (6), Hs grows more rapidly as one\nvariable participates in a larger number of constraints, than as a constraint involves a larger number\nof variables. When all gk = 1 and all akj are nonnegative, Hs grows with the magnitude of akj.\nHowever this is not necessarily the case when akj elements have mixed sign. Finally, Hs is relative\nto the af\ufb01ne subspace that P lies in, and is independent of linear equality constraints.\nThe proof of Lemma 1 utlizes the fact that the lowest value of 1(cid:62)\u2206 is the optimal objective value of\n(7)\n\ns.t. 0 \u2264 \u2206 \u2264 \u03b3, y = x \u2212 (u1, . . . , us)\u2206 + (1(cid:62)\u2206)z,\n\nmin\u2206,z 1(cid:62)\u2206,\n\nwhere the inequalities are both elementwise. To ensure z \u2208 P, we require Az \u2264 b, i.e.\n\nz \u2208 P,\n\n(b1(cid:62) \u2212 AU )\u2206 \u2265 A(y \u2212 x), where U = (u1, . . . , us).\n\n(8)\nThe rest of the proof utilizes the optimality conditions of \u2206, and is relegated to Appendix A.\nCompared with Lemma 2 of [18], our Lemma 1 does not require ext(P) to be binary, and allows\narbitrary inequality constraints rather than only x \u2265 0. Note Hs depends on b indirectly, and employs\na more explicit form for computation than pyramidal width. Obviously Hs is non-decreasing in s.\nExample 1. To get some idea, consider the k-simplex Pk or more general polytopes {x \u2208 [0, 1]n :\nCx = d}. In this case, the inequality constraints are exclusively xi \u2208 [0, 1], meaning ak = \u00b1ek for\nall k \u2208 [2n] in (1). Here ek stands for a canonical vector of straight 0 except a single 1 in the k-th\ncoordinate. Obviously all gk = 1. Therefore by Lemma 1, one can derive Hs = s, \u2200 s \u2264 n.\nExample 2. To include inequality, let us consider Qk, the convex hull of a k-simplex. Lemma 1\nimplies its Hs = n + 3s \u2212 3, independent of k. One might hope to get better Hs when k = 1, since\nthe constraint x \u2264 1 can be dropped in this case. Unfortunately, still Hs = n.\nRemark 1. The L0 norm of the optimal x can be connected with s simply by Caratheodory\u2019s theorem.\nObviously s = (cid:107)x(cid:107)0 (L0 norm) for P1 and Q1. In general, an x in P may be decomposed in multiple\nways, and Lemma 1 immediately applies to the lowest (best) possible value of s (which we will refer\nto as the cardinality of x following [18]). For example, the smallest s for any x \u2208 Pk (or Qk) must\nbe at most (cid:107)x(cid:107)0 + 1, because x must be in the convex hull of V := {y \u2208 {0, 1}n : 1(cid:62)y = k, xi =\n0 \u21d2 yi = 0 \u2200 i}. Clearly its af\ufb01ne hull has dimension (cid:107)x(cid:107)0, and V is a subset of ext(Pk) = ext(Qk).\n3.2 Tightness of Hs under a Given Representation of the Polytope\nWe show some important examples that demonstrate the tightness of Lemma 1 with respect to the\ndimensionality (n) and the cardinality of x (s). Note the tightness is in the sense of satisfying the\nconditions in Lemma 1, not in the rate of convergence for the optimization algorithm.\nExample 3. Consider Q2. u1 = e1 is a vertex and let x = u1 (hence s = 1) and y = (1, \u0001, . . . , \u0001)(cid:62),\nwhere \u0001 > 0 is a small scalar. So in the necessary condition (8), the row corresponding to 1(cid:62)x \u2264 2\nbecomes \u22061 \u2265 (n \u2212 1)\u0001 =\nExample 4. Let us see another example that is not simplex-like. Let ak = \u2212ek + en+1 + en+2\nfor k \u2208 [n]. Let A = (a1, . . . , an)(cid:62) = (\u2212I, 1, 1) where I is the identity matrix. De\ufb01ne P as\ni=1 i\u0001ei + ren+1 + (1 \u2212 r\u0001)en+2, where r = n(n + 1)/2 and\n\u0001 > 0 is a small positive constant. x can be represented as the convex combination of n + 1 vertices\n(9)\nx =\nWith U = (u1, . . . , un+1), we have b1(cid:62) \u2212 AU = (I, 0). Let y = x + \u0001en+1, which is clearly in P.\nn2 (cid:107)y \u2212 x(cid:107). Applying Lemma 1 with s = n + 1 and\nThen (8) becomes \u2206 \u2265 \u00011, and so 1(cid:62)\u2206 \u2265\ngk = 1 for all k, we get Hs = 2n2 + n \u2212 1, which is of the same order of magnitude as n2.\n3.3 Analysis for Pairwise Frank-Wolfe (PFW-1) on SLPs\nEquipped with Lemma 1, we can now extend the analysis in [18] to SLPs where the constraint of\nx \u2264 1 can be explicitly accommodated without having to introduce a slack variable which increases\nthe diameter D and costs more computations.\n\nP =(cid:8)x \u2208 [0, 1]n+2 : Ax \u2264 1(cid:9) , i.e. b = 1. Since A is totally unimodular, all the vertices of P must\nbe binary. Let us consider x =(cid:80)n\n\ni\u0001ui + (1 \u2212 r\u0001)un+1, where ui = ei + en+1 for i \u2264 n, and un+1 = en+2.\n\nn \u2212 1 \u00b7 (cid:107)x \u2212 y(cid:107) . By Lemma 1, Hs = n which is almost n \u2212 1.\n\n(cid:88)n\n\n\u221a\n\n5\n\n\u221a\n\ni=1\n\n\f1\n\n\u03b1\n\n(1\u2212c1)\n\nt\u22121\n2 , where c1 =\n\n2 (1 \u2212 c1)t\u22121 if we\n16\u03b2HsD2 . The proof just replaces all card(x\u2217) in [18] with Hs.\n\nTheorem 1. Applying PFW-1 to SLP, all iterates must be feasible and ht \u2264 \u03b2D2\nset \u03b3t = c1/2\nSlight effort is needed to guarantee the feasibility and we show it as Lemma 6 in Appendix A.\nWhen P is not an SLP or general inequality constraints are present, we resort to line search (PFW-2),\nwhich is more ef\ufb01cient than PFW-1 in practice. However, the analysis becomes challenging [13, 18],\nbecause it is dif\ufb01cult to bound the number of steps where the step size is clamped due to the feasibility\nconstraint (the swap step in [13]). So [13] appealed to a bound that is the factorial of the number of\nvertices. Fortunately, we will show below that by switching to AFW, the line search version achieves\nlinear rates with improved dimension dependency for general polytopes, and the pre-de\ufb01ned step\nversion preserves the strong rates of PFW-1 on SLPs. These are all facilitated by the Hs in Lemma 1.\n\n3.4 Analysis for Away-step Frank-Wolfe with Pre-de\ufb01ned Step Size (AFW-1) on SLPs\n\nWe \ufb01rst show that AFW-1 achieves the same rate of convergence as PFW-1 on SLPs. Although this\ndoes not appear surprising and the proof architecture is similar to [18], we stress that the step size\nneeds delicate modi\ufb01cations because the descent direction dt in PFW does not rescale xt, while\nAFW does. Our key novelty is to \ufb01rst run a constant number of FW-steps (O( 1\nt ) rate), and start\naccepting away-steps when the step size is small enough to ensure feasibility and linear convergence.\nWe \ufb01rst establish the feasibility of iterates under the pre-de\ufb01ned step sizes. Proofs are in Appendix A.\nLemma 2 (Feasibility of iterates for AFW-1). Suppose P is an SLP and the reference step sizes\n{\u03b3t}t\u2265n0 are contained in [0, 1]. Then the iterates generated by AFW-1 are always feasible.\nChoosing the step size. Key to the AFW-1 algorithm is the delicately chosen sequence of step\nsizes. For AFW-1, de\ufb01ne (logarithms are natural basis)\nc0(1 \u2212 c1)(t\u22121)/2, where M1 =\n\n(cid:114) \u03b1\n\n, M2 =\n\n\u03b8 = 52\n\n\u03b3t =\n\n\u03b2D2\n\n(10)\n\n\u221a\n\n,\n\n8Hs\n\n2\n\n3M2 log n0\n\n(1 \u2212 c1)1\u2212n0 .\n\n(11)\n\n,\n\nc1\n\nn0\n\nc0 =\n\nc1 =\n\n, n0 =\nt M2 log t for all t \u2208 [2, n0]. Obviously n0 \u2265 200 by (11).\nLemma 3. In AFW-1, we have ht \u2264 3\nThis result is similar to Theorem 1 in [4]. However, their step size is 2/(t + 2) leading to a 2\nt+2 M2\nrate of convergence. Such a step size will break the integrality of the iterates, and hence we adjusted\nthe step size, at the cost of a log t term in the rates which can be easily handled in the sequel.\nThe condition number c1 gets better (bigger) when: the strongly convex parameter \u03b1 is larger, the\nsmoothness constant \u03b2 is smaller, the diameter D of the domain is smaller, and Hs is smaller.\nLemma 4. For all t \u2265 n0, AFW-1 satis\ufb01es a) \u03b3t \u2264 1, b) \u03b3\u22121\nBy Lemma 2 and Lemma 4a, we know that the iterates generated by AFW-1 are all feasible.\nTheorem 2. Applying AFW-1 to SLP, the gap decays as ht \u2264 c0(1 \u2212 c1)t\u22121 for all t \u2265 n0.\nProof. By Lemma 3, hn0\u22643M2\n\nlog n0 = c0(1 \u2212 c1)n0\u22121. Let the result hold for some t \u2265 n0. Then\n\nt \u2265 1, and c) \u03b7t \u2208 [ 1\n\nt+1 \u2212 \u03b3\u22121\n\n4 \u03b3t, \u03b3t].\n\nn0\n\nM1\n\u03b8M2\nM 2\n1\nM2\n\n\u03b8 \u2212 4\n4\u03b82 <\n\n1\n200\n\n(cid:24) 1\n\n(cid:25)\n\nht+1 \u2264 ht + \u03b7t (cid:104)dt,\u2207f (xt)(cid:105) +\n\n\u03b2\n2\n\n\u03b72\nt D2\n\n(cid:10)v+\n(cid:114) \u03b1\n\nt ,\u2207f (xt)(cid:11) +\nt \u2212 v\u2212\n(cid:112)\n\nht +\n\n\u03b72\nt D2\n\n\u03b2\n2\n\n2Hs\n\n(smoothness of f)\n\n(by step 5 of Algorithm 1)\n\n\u03b2\n\u03b72\nt D2\n2\n(by (4) and the fact (cid:104)x\u2217 \u2212 xt,\u2207f (xt)(cid:105) \u2264 \u2212ht)\n\nM1\u03b3th1/2\n\n(Lemma 4c and the defn. of M1)\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n(17)\n\n\u2264 ht +\n\u03b7t\n2\n\u2264 ht \u2212 \u03b7t\n2\n\u2264 ht \u2212 1\n4\n= ht \u2212 M 2\n4\u03b8M2\n\u2264 c0(1 \u2212 c1)t\u22121\n\n\u221a\n\n1\n\n\u03b2\n2\n\n\u03b32\nt D2\n\n(cid:18)\n\nt +\nc0(1 \u2212 c1)(t\u22121)/2h1/2\nM 2\n1 \u2212 M 2\n1\n\u03b82M2\n4\u03b8M2\n\n+\n\n1\n\nt +\n\n(cid:19)\n\n6\n\nc0(1 \u2212 c1)t\u22121\n\nM 2\n1\n\u03b82M2\n= c0(1 \u2212 c1)t\n\n(by defn. of c1).\n\n(by defn. of \u03b3t)\n\n\fHere the inequality in step (17) is by treating (16) as a quadratic of h1/2\nassumption on ht. The last step completes the induction: the conclusion also holds for step t + 1.\n\nand applying the induction\n\nt\n\n3.5 Analysis for Away-step Frank-Wolfe with Line Search (AFW-2)\nWe \ufb01nally analyze AFW-2 on general polytopes with line search. Noting that f (xt + \u03b7dt)\u2212 f (x\u2217) \u2264\n(14) (with \u03b7t in (14) replaced by \u03b7), we minimize both sides over \u03b7 : xt + \u03b7dt \u2208 P. If none of the\ninequality constraints are satis\ufb01ed as equality at the optimal \u03b7t of line search, then we call it a good\nstep and in this case\n\n(cid:18)\n\n(cid:19)\n\nht+1 \u2264\n\n1 \u2212\n\n\u03b1\n\n256\u03b2D2Hs\n\nht,\n\n(Eq 14 in \u03b7 is minimized at \u03b7\u2217\n\nt :=\n\n1\n\u03b2D2 M1h1/2\n\nt\n\n).\n\n(18)\n\nThe only task left is to bound the number of bad steps (i.e. \u03b7t clamped by its upper bound). In [13]\nwhere the set of atoms is maintained, it is easily shown that up to step t there can be only at most t/2\nbad steps, and so the overall rate of convergence is slowed down by at most a factor of two. This\nfavorable result no longer holds in our decomposition-invariant AFW. However, thanks to the special\nproperty of AFW, it is still not hard to bound the number of bad steps between two good steps.\nt \u2264 1 and for FW-steps,\nFirst we notice that such clamping never happens for FW-steps, because \u03b7\u2217\nxt + \u03b7tdt \u2208 P implicitly enforces \u03b7t \u2264 1 only (after \u03b7t \u2265 0 is imposed). For an away-step, if the\nline search is blocked by some constraint, then at least one inequality constraint will turn into an\nequality constraint if the next step is still away. Since AFW selects the away-direction by respecting\nall equality constraints, the succession of away-steps (called an away epoch) must terminate when the\nset of equalities de\ufb01ne a singleton. For any index set of inequality constraints S \u2286 [m], let P(S) :=\n{x \u2208 P : (cid:104)aj, x(cid:105) = bj, \u2200 j \u2208 S} be the set of points that satisfy these inequalities with equality. Let\n(19)\n\nn(P) := max{|S| : S \u2286 [m], |P(S)| = 1, |P(S(cid:48))| = \u221e for all S(cid:48) (cid:40) S}\n\nbe the maxi-min number of constraints to de\ufb01ne a singleton. Then obviously n(P) \u2264 n, and so\nTheorem 3. To \ufb01nd an \u0001 accurate solution, AFW-2 requires at most O\nsteps.\n\n(cid:16) n\u03b2D2Hs\n\n\u03b1\n\n(cid:17)\n\nlog 1\n\u0001\n\n2 (cid:107)x + 1(cid:107)2 with P = [0, 1]n. Clearly n(P) = n. Unfortunately\nExample 5. Suppose f (x) = 1\nwe can construct an initial x1 as a convex combination of only O(log n) vertices, but AFW-2 will\nthen run O(n) number of away-steps consecutively. Hence our above analysis on the max length of\naway epoch seems tight, although having n consecutive away-steps between two good steps once is\ndifferent than this happening multiple times. See the construction of x1 in Appendix A.\nTighter bounds. By re\ufb01ning the analysis of the polytopes, we may improve upon the n(P) bound.\nFor example it is not hard to show that n(Pk) = n(Qk) = n. Let us consider the number of non-zeros\nin the iterates xt. A bad step (which must be an away-step) will either a) set an entry to 1, which will\nforce the corresponding entry of v\u2212\nto be 1 in the future steps of the away epoch, hence can happen\nat most k times; or b) set at least one nonzero entry of xt into 0, and will never switch a zero entry to\nnonzero. But each FW-step may introduce at most k nonzeros. So the number of bad steps cannot be\nover 2k times of that of FW-step, and the overall iteration complexity is at most O( k\u03b2D2Hs\n\u0001 ).\nWe can now revisit Table 1 and observe the generality and ef\ufb01ciency of AFW-2. It is noteworthy that\non SLPs, we are not yet able to establish the same rate as AFW-1. We believe that the vertices being\nbinary is very special, making it hard to generalize the analysis.\n\nlog 1\n\n\u03b1\n\nt\n\n4 Application to Kernel Binary SVM\n\nAs a concrete example, we apply AFW to the dual objective of a binary SVM with bias:\n\nmin\n\nx\n\nf (x) := 1\n\n2 x(cid:62)Qx \u2212 1\n\nC 1(cid:62)x, s.t. x \u2208 [0, 1]n, y(cid:62)x = 0.\n\n(SVM-Dual)\n(20)\nHere y = (y1, . . . , yn)(cid:62) is the label vector with yi \u2208 {\u22121, 1}, and Q is the signed kernel matrix\nwith Qij = yiyjk(xi, xj). Since the feasible region is an SLP with diameter O(\nn), we can use\nboth AFW-1 and PFW-1 to solve it with O(#SV \u00b7 n\u03ba log 1\n\u0001 ) iterations, where \u03ba is the ratio between\nthe maximum and minimum eigenvalues of Q (assume Q is positive de\ufb01nite), and #SV stands for the\nnumber of support vectors in the optimal solution.\n\n\u221a\n\n7\n\n\ft = Q(v+\n\nt and v\u2212\n\nt \u2212 xt) = \u2212\u2207f (xt) \u2212 1\n\nComputational ef\ufb01ciency per iteration. The key technique for computational ef\ufb01ciency is to keep\nupdating the gradient \u2207f (x) over the iterations, exploiting the fact that v+\nt might be sparse\nand \u2207f (x) = Qx \u2212 1\nC 1 is af\ufb01ne in x. In particular, when AFW takes a FW-step in line 5, we have\nQdt = QdFW\n(21)\nSimilar update formulas can be shown for away-step dA\nt ) has k\nnon-zeros, all these three updates can be performed in O(kn) time. Based on them, we can update\nthe gradient by \u2207f (xt+1) = \u2207f (xt) + \u03b7tQdt. The FW-oracle and away-oracle cost O(n) time\ngiven the gradient, and the line search has a closed form solution. See more details in Appendix B.\nt and v\u2212\nMajor drawback. This approach unfortunately provides no control of the sparseness of v+\nt .\nAs a result, each iteration may require evaluating the entire kernel matrix (O(n2) kernel evaluations),\nleading to an overall computational cost O(#SV \u00b7 n3\u03ba log 1\n\nC 1 + Qv+\nt .\n. So if v+ (or v\u2212\n\nt and PFW-step dPFW\n\n\u0001 ) . This can be prohibitive.\n\nt\n\n4.1 Reformulation by Relaxed Convex Hull\n\nTo ensure the sparsity of each update, we reformulate the SVM dual objective (20) by using the\nreduced convex hull (RC-Hull, [22]). Let P and N be the set of positive and negative examples, resp.\n\n1\nK\n\n(1(cid:62)\u03be+ + 1(cid:62)\u03be\u2212) +\n\n(cid:107)\u03b8(cid:107)2 \u2212 \u03b1 + \u03b2,\n\n1\n2\n\n(RC-Margin)\n\nmin\n\n\u03b8, \u03be+\u2208R|P |, \u03be\u2212\u2208R|N|, \u03b1, \u03b2\ns.t. A(cid:62)\u03b8 \u2212 \u03b11 + \u03be+ \u2265 0, \u2212B(cid:62)\u03b8 + \u03b21 + \u03be\u2212 \u2265 0, \u03be+ \u2265 0, \u03be\u2212 \u2265 0.\n\n(22)\n\n1\n\n2 (cid:107)Au \u2212 Bv(cid:107)2 ,\n\ns.t. u \u2208 PK, v \u2208 PK.\n\n(RC-Hull)\n\nmin\n\nu\u2208R|P |,v\u2208R|N|\n\n(23)\nHere A (or B) is a matrix whose i-th column is the (implicit) feature representation of the i-th positive\n(or negative) example. RC-Margin resembles the primal SVM formulation, except that the bias term\nis split into two terms \u03b1 and \u03b2. RC-Hull is the dual problem of RC-Margin, and it has a very intuitive\ngeometric meaning. When K = 1, RC-Hull tries to \ufb01nd the distance between the convex hull of P\nK Au is a reduced convex hull of the positive\nand N. When the integer K is greater than 1, then 1\nexamples, and the objective \ufb01nds the distance of the reduced convex hull of P and N.\nSince the feasible region of RC-Hull is a simplex, dt in AFW and PFW have at most 2K and 4K\nnonzeros respectively, and it costs O(nK) time to update the gradient (see Appendix B.1). Given\nK, Appendix B.2 shows how to recover the corresponding C in (20), and to translate the optimal\nsolutions. Although solving RC-Hull requires the knowledge of K (which is unknown a priori if we\nare only given C), in practice, it is equally justi\ufb01ed to tune the value of K via model selection tools\nin the \ufb01rst place, which is approximately tuning the number of support vectors.\n\n4.2 Discussion and Comparison of Rates of Convergence\n\nClearly, the feasible region of RC-Hull is an SLP, allowing us to apply AFW-1 and PFW-1 with\noptimal linear convergence: O(#SV \u00b7 \u03baK log 1\n\u0001 ), because K = 1(cid:62)u \u2264 #SV.\nSo overall, the computational cost is O(n\u03ba(#SV)3 log 1\n\n\u0001 ) \u2264 O(\u03ba(#SV)2 log 1\n\n\u0001 ).\n\n\u0001 ) when #SV \u2264 n2/3. [24] requires O(\u03ba2n(cid:107)Q(cid:107)sp log 1\n\n\u0001 ) computations. This\n[20] shows sequential minimal optimization (SMO) [19, 23] costs O(n3\u03ba log 1\nis greater than O(n\u03ba(#SV)3 log 1\n\u0001 ) iterations,\nand each iteration costs O(n). SVRG [25], SAGA [26], SDCA [27] require losses to be decomposable\nand smooth, which do not hold for hinge loss with a bias. SDCA can be extended to almost smooth\nlosses such as hinge loss, but still the dimension dependency is unclear and it cannot handle bias.\nAs a \ufb01nal remark, despite the superior rates of AFW-1 and PFW-1, their pre-de\ufb01ned step size makes\nthem impractical. With line search, AFW-2 is much more ef\ufb01cient in practice, and at the same time\nprovides theoretical guarantees of O(n\u03ba(#SV)4log 1\n\u0001 ) computational cost, just slightly worse by #SV\ntimes. Such an advantage in both theory and practice by one method is not available in PFW [18].\n\n5 Experiments and Future Work\n\nIn this section we compare the empirical performance of AFW-2 against related methods. We \ufb01rst\nillustrate the performance on kernel binary SVM, then we investigate a problem whose domain is not\nan SLP, and \ufb01nally we demonstrate the scalability of AFW-2 on a large scale dataset.\n\n8\n\n\f(a) Breast-cancer (K = 10)\n\n(b) a1a (K = 30)\n\n(c) ijcnn1 (K = 20)\n\nFigure 1: Comparison of SMO and AFW-2 on three different datasets\n\nBinary SVM Our \ufb01rst comparison is on solving kernel binary SVMs with bias. Three datasets are\nused. breast-cancer and a1a are obtained from the UCI repository [28] with n = 568 and 1, 605\ntraining examples respectively, and ijcnn1 is from [29] with a subset of 5, 000 examples.\nAs a competitor, we adopted the well established Sequential Minimal Optimization (SMO) algorithm\n[19]. The implementation updates all cached errors corresponding to each examples if any variable is\nbeing updated at each step. Using these cached error, the algorithm heuristically picks the best subset\nof variable to update at each iteration.\nWe \ufb01rst run AFW-2 on the RC-Hull objective in (23), with the value of K set to optimize the test\naccuracy (K shown in Figure 1). After obtaining the optimal solution, we compute the equivalent C\nvalue based on the conversion rule in Appendix B.2, and then run SMO on the dual objective (20).\nFigure 1 shows the decay of the primal SVM objective (hence\n\ufb02uctuation) as a function of (the number of kernel evaluations\ndivided by n). This avoids the complication of CPU frequency\nand kernel caching. Clearly, AFW-2 outperforms SMO on breast-\ncancer and ijcnn1, and overtakes SMO on a1a after a few iterations.\nPFW-1 and PFW-2 are also applicable to the RC-Hull formulation.\nAlthough the rate of PFW-1 is better than AFW-2, it is much slower\nin practice. Although empirically we observed that PFW-2 is similar\nto our AFW-2, unfortunately PFW-2 has no theoretical guarantee.\nGeneral Polytope Our next comparison uses Qk as the domain.\nSince it is not an SLP, neither PFW-1 nor PFW-2 provides a bound.\nHere we aim to show that AFP-2 is not only advantageous in pro-\nviding a good rate of convergence, it is also comparable to (or better\nthan) PFW-2 in terms of practical ef\ufb01ciency. Our objective is a least\nsquare (akin to lasso):\n\nFigure 2: Least square w. Q375\n\nminx f (x) = (cid:107)Ax \u2212 b(cid:107)2,\n\n0 \u2264 x \u2264 1,\n\n1(cid:62)x \u2264 375.\n\nHere A \u2208 R100\u00d71000, and both A and b were generated randomly.\nBoth the FW-oracle and away-oracle are simply based on sorting\nthe gradient. As shown in Figure 2, AFW-2 is indeed slightly faster than PFW-2.\nScalability To demonstrate the scalability of AFP-2, we plot its convergence curve (K = 100)\nalong with SMO on the full ijcnn1 dataset with 49, 990 examples. In Figure 3, AFW-2 starts with\na higher primal objective value, but after a while it outperforms SMO near the optimum. In this\nproblem, kernel evaluation is the major computational bottleneck, hence used as the horizontal axis.\nThis also helps avoiding the complication of CPU speed (e.g. when wall-clock time is used).\n6 Future work\n\nFigure 3: Full ijcnn1\n\nWe will extend the decomposition invariant method to gauge regularized problems [30\u201332], and\nderive comparable linear convergence rates. Moreover, although it is hard to evaluate pyramidal\nwidth, it will be valuable to compare it with Hs even in terms of upper/lower bounds.\nAcknowledgements. We thank Dan Garber for very helpful discussions and clari\ufb01cations on [18].\nMohammad Ali Bashiri is supported in part by NSF grant RI-1526379.\n\n9\n\n050100150# Kernel evaluations / # of examples101102103104Primal ObjectiveAFW-2SMO0204060# Kernel evaluations / # of examples103104105106Primal ObjectiveAFW-2SMO03006009001200# Kernel evaluations / # of examples103104105106107Primal ObjectiveAFW-2SMO0204060# steps10-2010-101001010GapAFW-2PFW-22505007501000# Kernel evaluations / # of examples104105106107108Primal ObjectiveAFW-2SMO\fReferences\n\n[1] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics\n\nQuarterly, 3(1-2):95\u2013110, 1956.\n\n[2] Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, and J. Malick. Large-scale image classi\ufb01cation\nwith trace-norm regularization. In Proc. IEEE Conf. Computer Vision and Pattern Recognition.\n2012.\n\n[3] E. S. Levitin and B. T. Polyak. Constrained minimization methods. USSR Computational\n\nMathematics and Mathematical Physics, 6(5):787\u2013823, 1966.\n\n[4] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings\n\nof International Conference on Machine Learning. 2013.\n\n[5] A. Beck and M. Teboulle. A conditional gradient method with linear rate of convergence for\nsolving convex linear systems. Mathematical Methods of Operations Research, 59(2):235\u2013247,\n2004.\n\n[6] J. Gu\u00b4eLat and P. Marcotte. Some comments on Wolfe\u2019s \u2018away step\u2019. Mathematical Program-\n\nming, 35(1):110\u2013119, 1986.\n\n[7] P. Wolfe. Convergence theory in nonlinear programming. In Integer and Nonlinear Program-\n\nming. North-Holland, 1970.\n\n[8] S. D. Ahipasaoglu, P. Sun, and M. J. Todd. Linear convergence of a modi\ufb01ed Frank-Wolfe algo-\nrithm for computing minimum-volume enclosing ellipsoids. Optimization Methods Software,\n23(1):5\u201319, 2008.\n\n[9] R. \u02dcNanculef, E. Frandi, C. Sartori, and H. Allende. A novel Frank-Wolfe algorithm. analysis\n\nand applications to large-scale svm training. Information Sciences, 285(C):66\u201399, 2014.\n\n[10] P. Kumar and E. A. Yildirim. A linearly convergent linear-time \ufb01rst-order algorithm for support\nvector classi\ufb01cation with a core set result. INFORMS J. on Computing, 23(3):377\u2013391, 2011.\n[11] S. M. Robinson. Generalized equations and their solutions, part II: Applications to nonlinear\n\nprogramming. Springer Berlin Heidelberg, 1982.\n\n[12] D. Garber and E. Hazan. A linearly convergent variant of the conditional gradient algorithm\nunder strong convexity, with applications to online and stochastic optimization. SIAM Journal\non Optimization, 26(3):1493\u20131528, 2016.\n\n[13] S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank-Wolfe optimization\n\nvariants. In Neural Information Processing Systems. 2015.\n\n[14] S. Lacoste-Julien and M. Jaggi. An af\ufb01ne invariant linear convergence analysis for Frank-Wolfe\n\nalgorithms. In NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends. 2013.\n\n[15] A. Beck and S. Shtern. Linearly convergent away-step conditional gradient for non-strongly\n\nconvex functions. Mathematical Programming, pp. 1\u201327, 2016.\n\n[16] G. Lan. The complexity of large-scale convex programming under a linear optimization oracle.\n\nTechnical report, University of Florida, 2014.\n\n[17] D. Garber. Faster projection-free convex optimization over the spectrahedron.\n\nInformation Processing Systems. 2016.\n\nIn Neural\n\n[18] D. Garber and O. Meshi. Linear-memory and decomposition-invariant linearly convergent\nconditional gradient algorithm for structured polytopes. In Neural Information Processing\nSystems. 2016.\n\n[19] J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vector\n\nmachines. Tech. Rep. MSR-TR-98-14, Microsoft Research, 1998.\n\n[20] N. List and H. U. Simon. SVM-optimization and steepest-descent line search. In S. Dasgupta\n\nand A. Klivans, eds., Proc. Annual Conf. Computational Learning Theory. Springer, 2009.\n\n[21] J. Pena and D. Rodriguez.\n[22] K. P. Bennett and E. J. Bredensteiner. Duality and geometry in SVM classi\ufb01ers. In Proceedings\n\nof International Conference on Machine Learning. 2000.\n\n10\n\n\f[23] S. S. Keerthi, O. Chapelle, and D. DeCoste. Building support vector machines with reduced\n\nclassi\ufb01er complexity. Journal of Machine Learning Research, 7:1493\u20131515, 2006.\n\n[24] Y. You, X. Lian, J. Liu, H.-F. Yu, I. S. Dhillon, J. Demmel, and C.-J. Hsieh. Asynchronous\n\nparallel greedy coordinate descent. In Neural Information Processing Systems. 2016.\n\n[25] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Neural Information Processing Systems. 2013.\n\n[26] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with\nIn Neural Information Processing\n\nsupport for non-strongly convex composite objectives.\nSystems. 2014.\n\n[27] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[28] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/\n\nml.\n\n[29] D. Prokhorov. Ijcnn 2001 neural network competition. Slide presentation in IJCNN, 1:97, 2001.\n[30] M. Jaggi and M. Sulovsky. A simple algorithm for nuclear norm regularized problems. In\n\nProceedings of International Conference on Machine Learning. 2010.\n\n[31] X. Zhang, Y. Yu, and D. Schuurmans. Accelerated training for matrix-norm regularization: A\n\nboosting approach. In Neural Information Processing Systems. 2012.\n\n[32] Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for norm-\n\nregularized smooth convex optimization. Mathematical Programming, 152:75\u2013112, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1531, "authors": [{"given_name": "Mohammad Ali", "family_name": "Bashiri", "institution": "University of Illinois at Chicago"}, {"given_name": "Xinhua", "family_name": "Zhang", "institution": "University of Illinois at Chicago (UIC)"}]}