{"title": "Lower Bounds on Rate of Convergence of Cutting Plane Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 2541, "page_last": 2549, "abstract": "In a recent paper Joachims (2006) presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an $\\epsilon$ accurate solution in $O(1/\\epsilon^{2})$ iterations. By tightening the analysis, Teo et al. (2010) showed that $O(1/\\epsilon)$ iterations suffice. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a \\emph{multivariate} performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure of the objective function we can devise an algorithm that converges in $O(1/\\sqrt{\\epsilon})$ iterations.", "full_text": "Lower Bounds on Rate of Convergence of Cutting\n\nPlane Methods\n\nXinhua Zhang\n\nDept. of Computing Science\n\nUniversity of Alberta\nxinhua2@ualberta.ca\n\nAnkan Saha\n\nDept. of Computer Science\n\nUniversity of Chicago\n\nankans@cs.uchicago.edu\n\nS.V. N. Vishwanathan\nDept. of Statistics and\n\nDept. of Computer Science\n\nPurdue University\n\nvishy@stat.purdue.edu\n\nAbstract\n\nIn a recent paper Joachims [1] presented SVM-Perf, a cutting plane method\n(CPM) for training linear Support Vector Machines (SVMs) which converges to\nan \u0001 accurate solution in O(1/\u00012) iterations. By tightening the analysis, Teo et al.\n[2] showed that O(1/\u0001) iterations suf\ufb01ce. Given the impressive convergence speed\nof CPM on a number of practical problems, it was conjectured that these rates\ncould be further improved. In this paper we disprove this conjecture. We present\ncounter examples which are not only applicable for training linear SVMs with\nhinge loss, but also hold for support vector methods which optimize a multivari-\nate performance score. However, surprisingly, these problems are not inherently\n\u221a\nhard. By exploiting the structure of the objective function we can devise an algo-\nrithm that converges in O(1/\n\n\u0001) iterations.\n\nw\n\n(cid:107)w(cid:107)2\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n\u03bb\n2\nregularizer\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n1\nn\n\nIntroduction\n\n1\nThere has been an explosion of interest in machine learning over the past decade, much of which\nhas been fueled by the phenomenal success of binary Support Vector Machines (SVMs). Driven by\nnumerous applications, recently, there has been increasing interest in support vector learning with\nlinear models. At the heart of SVMs is the following regularized risk minimization problem:\nmax(0, 1 \u2212 yi (cid:104)w, xi(cid:105)).\n\nwith Remp(w) :=\n\nn(cid:88)\n\n+ Remp(w)\n\n(1)\n\nmin\n\nJ(w) :=\n\nHere we assume access to a training set of n labeled examples {(xi, yi)}n\n\n{\u22121, +1}, and use the square Euclidean norm (cid:107)w(cid:107)2 =(cid:80)\n\nempirical risk\n\ni=1\n\ni=1 where xi \u2208 Rd and yi \u2208\ni as the regularizer. The parameter \u03bb\n\ni w2\n\ncontrols the trade-off between the empirical risk and the regularizer.\nThere has been signi\ufb01cant research devoted to developing specialized optimizers which minimize\nJ(w) ef\ufb01ciently.\nIn an award winning paper, Joachims [1] presented a cutting plane method\n(CPM)1, SVM-Perf, which was shown to converge to an \u0001 accurate solution of (1) in O(1/\u00012) iter-\nations, with each iteration requiring O(nd) effort. This was improved by Teo et al. [2] who showed\nthat their Bundle Method for Regularized Risk Minimization (BMRM) (which encompasses SVM-\nPerf as a special case) converges to an \u0001 accurate solution in O(nd/\u0001) time.\nWhile online learning methods are becoming increasingly popular for solving (1), a key advantage\nof CPM such as SVM-Perf and BMRM is their ability to directly optimize nonlinear multivariate\nperformance measures such as F1-score, ordinal regression loss, and ROCArea which are widely\nused in some application areas. In this case Remp does not decompose into a sum of losses over\nindividual data points like in (1), and hence one has to employ batch algorithms. Letting \u2206(y, \u00afy)\ndenote the multivariate discrepancy between the correct labels y := (y1, . . . , yn)(cid:62) and a candidate\nlabeling \u00afy (to be concretized later), the Remp for the multivariate measure is formulated by [3] as\n\n1In this paper we use the term cutting plane methods to denote specialized solvers employed in machine\n\nlearning. While clearly related, they must not be confused with cutting plane methods used in optimization.\n\n1\n\n\f(cid:34)\n\nRemp(w) = max\n\n\u00afy\u2208{\u22121,1}n\n\n\u2206(y, \u00afy) +\n\n(cid:35)\n\n(cid:104)w, xi(cid:105) (\u00afyi \u2212 yi)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n.\n\n(2)\n\nIn another award winning paper by Joachims [3], the regularized risk minimization problems corre-\nsponding to these measures are optimized by using a CPM.\nGiven the widespread use of CPM in machine learning, it is important to understand their conver-\ngence guarantees in terms of the upper and lower bounds on the number of iterations needed to\nconverge to an \u0001 accurate solution. The tightest, O(1/\u0001), upper bounds on the convergence speed\nof CPM is due to Teo et al. [2], who analyzed a restricted version of BMRM which only optimizes\nover one dual variable per iteration. However, on practical problems the observed rate of conver-\ngence is signi\ufb01cantly faster than predicted by theory. Therefore, it had been conjectured that the\nupper bounds might be further tightened via a more re\ufb01ned analysis. In this paper we construct\ncounter examples for both decomposable Remp like in equation (1) and non-decomposable Remp\nlike in equation (2), on which CPM requires \u2126(1/\u0001) iterations to converge, thus disproving this con-\njecture2. We will work with BMRM as our prototypical CPM. As Teo et al. [2] point out, BMRM\nincludes many other CPM such as SVM-Perf as special cases.\nOur results lead to the following natural question: Do the lower bounds hold because regularized\nrisk minimization problems are fundamentally hard, or is it an inherent limitation of CPM? In other\nwords, to solve problems such as (1), does there exist a solver which requires less than O(nd/\u0001)\neffort (better in n, d and \u0001)? We provide partial answers. To understand our contribution one needs\nto understand the two standard assumptions that are made when proving convergence rates:\n\n\u2022 A1: The data points xi lie inside a L2 (Euclidean) ball of radius R, that is, (cid:107)xi(cid:107) \u2264 R.\n\u2022 A2: The subgradient of Remp is bounded, i.e., at any point w, there exists a subgradient g\n\nof Remp such that (cid:107)g(cid:107) \u2264 G < \u221e.\n\n\u221a\n\n(cid:104)\u00b7,\u00b7(cid:105) denotes the Euclidean dot product (cid:104)x, w(cid:105) = (cid:80)\n\nClearly assumption A1 is more restrictive than A2. By adapting a result due to [6] we show that one\n\u0001) algorithm for the case when assumption A1 holds. Finding a fast optimizer\ncan devise an O(nd/\nunder assumption A2 remains an open problem.\nNotation: Lower bold case letters (e.g., w, \u00b5) denote vectors, wi denotes the i-th component of\nw, 0 refers to the vector with all zero components, ei is the i-th coordinate vector (all 0\u2019s except\n1 at the i-th coordinate) and \u2206k refers to the k dimensional simplex. Unless speci\ufb01ed otherwise,\ni xiwi, and (cid:107)\u00b7(cid:107) refers to the Euclidean norm\n(cid:107)w(cid:107) := ((cid:104)w, w(cid:105))1/2. We denote R := R \u222a {\u221e}, and [t] := {1, . . . , t}.\nOur paper is structured as follows. We brie\ufb02y review BMRM in Section 2. Two types of lower\nbounds are subsequently de\ufb01ned in Section 3, and Section 4 contains descriptions of various counter\nexamples that we construct. In Section 5 we describe an algorithm which provably converges to an\n\u0001) iterations under assumption A1. The paper concludes with a\n\u0001 accurate solution of (1) in O(1/\ndiscussion and outlook in Section 6. Technical proofs and a ready reckoner of the convex analysis\nconcepts used in the paper can be found in [7, Appendix A].\n\n\u221a\n\n2 BMRM\nAt every iteration, BMRM replaces Remp by a piecewise linear lower bound Rcp\n\nk and optimizes [2]\n\n(cid:107)w(cid:107)2 + Rcp\n\n\u03bb\n2\n\nmin\n\nJk(w) :=\n\nw\n\n(3)\nto obtain the next iterate wk. Here ai \u2208 \u2202Remp(wi\u22121) denotes an arbitrary subgradient of Remp\nat wi\u22121 and bi = Remp(wi\u22121) \u2212 (cid:104)wi\u22121, ai(cid:105). The piecewise linear lower bound is successively\ntightened until the gap\n(4)\n\nk (w) := max\n1\u2264i\u2264k\n\nk (w), where Rcp\n\nJ(wt) \u2212 Jk(wk)\n\n(cid:104)w, ai(cid:105) + bi,\n\n\u0001k := min\n0\u2264t\u2264k\n\nfalls below a prede\ufb01ned tolerance \u0001.\nSince Jk in (3) is a convex objective function, one can compute its dual. Instead of minimizing Jk\nwith respect to w one can equivalently maximize the dual [2] over the k dimensional simplex:\n\nDk(\u03b1) = \u2212 1\n2\u03bb\n\n(cid:107)Ak\u03b1(cid:107)2 + (cid:104)bk, \u03b1(cid:105) ,\n\nwhere \u03b1 \u2208 \u2206k,\n\n(5)\n\n2Because of the specialized nature of these solvers, lower bounds for general convex optimizers such as\n\nthose studied by Nesterov [4] and Nemirovski and Yudin [5] do not apply.\n\n2\n\n\fAlgorithm 1: qp-bmrm: solving the inner loop\nof BMRM exactly via full QP.\nRequire: Previous subgradients {ai}k\n1: Set Ak := (a1, . . . , ak) , bk := (b1, . . . bk)(cid:62).\n\nintercepts {bi}k\n\ni=1 and\n\ni=1.\n\n(cid:8)\u2212 1\n2\u03bb(cid:107)Ak\u03b1(cid:107)2 + (cid:104)\u03b1, bk(cid:105)(cid:9).\n\n2: \u03b1k \u2190 argmax\n\u03b1\u2208\u2206k\n\n3: return wk = \u2212\u03bb\u22121Ak\u03b1k.\n\ni=1.\n\ni=1 and\n\nintercepts {bi}k\n\nAlgorithm 2: ls-bmrm: solving the inner loop\nof BMRM approximately via line search.\nRequire: Previous subgradients {ai}k\n1: Set Ak := (a1, . . . , ak) , bk := (b1, . . . bk)(cid:62).\n\nk\u22121, 1 \u2212 \u03b7(cid:1)(cid:62)\n2: Set \u03b1(\u03b7) :=(cid:0)\u03b7\u03b1(cid:62)\n(cid:8)\u22121\n2\u03bb (cid:107)Ak\u03b1(\u03b7)(cid:107)2 +(cid:104)\u03b1(\u03b7),bk(cid:105)(cid:9).\n4: \u03b1k \u2190(cid:0)\u03b7k\u03b1(cid:62)\n(cid:1)(cid:62)\nk\u22121, 1 \u2212 \u03b7k\n\n3: \u03b7k\u2190argmax\n\u03b7\u2208[0,1]\n\n.\n5: return wk = \u2212\u03bb\u22121Ak\u03b1k.\n\n.\n\nand set \u03b1k = argmax\u03b1\u2208\u2206k\nDk(\u03b1). Note that Ak and bk in (5) are de\ufb01ned in Algorithm 1. Since\nmaximizing Dk(\u03b1) is a quadratic programming (QP) problem, we call this algorithm qp-bmrm.\nPseudo-code can be found in Algorithm 1.\nNote that at iteration k the dual Dk(\u03b1) is a QP with k variables. As the number of iterations increases\nthe size of the QP also increases. In order to avoid the growing cost of the dual optimization at each\niteration, [2] proposed using a one-dimensional line search to calculate an approximate maximizer\n\u03b1k on the line segment {(\u03b7\u03b1(cid:62)\nk\u22121, (1\u2212 \u03b7))(cid:62) : \u03b7 \u2208 [0, 1]}, and we call this variant ls-bmrm. Pseudo-\ncode can be found in Algorithm 2. We refer the reader to [2] for details.\nEven though qp-bmrm solves a more expensive optimization problem Dk(\u03b1) per iteration, Teo\net al. [2] could only show that both variants of BMRM converge at O(1/\u0001) rates:\n\nTheorem 1 ([2]) Suppose assumption A2 holds. Then for any \u0001 < 4G2/\u03bb, both ls-bmrm and qp-\nbmrm converge to an \u0001 accurate solution of (1) as measured by (4) after at most the following\nnumber of steps:\n\nlog2\n\n\u03bbJ(0)\nG2 +\n\n8G2\n\u03bb\u0001\n\n\u2212 1.\n\nGenerality of BMRM Thanks to the formulation in (3) which only uses Remp, BMRM is applica-\nble to a wide variety of Remp. For example, when used to train binary SVMs with Remp speci\ufb01ed by\n(1), it yields exactly the SVM-Perf algorithm [1]. When applied to optimize the multivariate score,\ne.g. F1-score with Remp speci\ufb01ed by (2), it immediately leads to the optimizer given by [3].\n\n3 Upper and Lower Bounds\nSince most rates of convergence discussed in the machine learning community are upper bounds,\nit is important to rigorously de\ufb01ne the meaning of a lower bound with respect to \u0001, and to study\nits relationship with the upper bounds. At this juncture it is also important to clarify an important\ntechnical point. Instead of minimizing the objective function J(w) de\ufb01ned in (1), if we minimize a\nscaled version cJ(w) this scales the approximation gap (4) by c. Assumptions such as A1 and A2\n\ufb01x this degree of freedom by bounding the scale of the objective function.\nGiven a function f \u2208 F and an optimization algorithm A, suppose {wk} are the iterates produced\nby the algorithm A when minimizing f. De\ufb01ne T (\u0001; f, A) as the \ufb01rst step index k when wk becomes\nan \u0001 accurate solution3:\n\nT (\u0001; f, A) = min{k : f (wk) \u2212 minw f (w) \u2264 \u0001} .\n\n(6)\nUpper and lower bounds are both properties for a pair of F and A. A function g(\u0001) is called an\nupper bound of (F, A) if for all functions f \u2208 F and all \u0001 > 0, it takes at most order g(\u0001) steps for\nA to reduce the gap to less than \u0001, i.e.,\n\n\u2200 \u0001 > 0,\u2200 f \u2208 F, T (\u0001; f, A) \u2264 g(\u0001).\n\n(7)\nOn the other hand, lower bounds can be de\ufb01ned in two different ways depending on how the above\ntwo universal quali\ufb01ers are \ufb02ipped to existential quali\ufb01ers.\n3 The initial point also matters, as in the best case we can just start from the optimal solution. Thus the quan-\ntity of interest is actually T (\u0001; f, A) := maxw0 min{k : f (wk)\u2212 minw f (w) \u2264 \u0001, starting point being w0}.\nHowever, without loss of generality we assume some pre-speci\ufb01ed way of initialization.\n\n(UB)\n\n3\n\n\fAlgorithms\n\nls-bmrm\nqp-bmrm\nNesterov\n\nAssuming A1\n\nAssuming A2\n\nUB\n\nSLB\n\u2126(1/\u0001)\nopen\n\u221a\n\u0001) \u2126(1/\n\nWLB\n\u2126(1/\u0001)\nopen\n\u221a\n\u0001) \u2126(1/\n\n\u0001)\n\nO(1/\u0001)\n\u221a\nO(1/\u0001)\nO(1/\n\nUB\n\nSLB\n\nWLB\nO(1/\u0001) \u2126(1/\u0001) \u2126(1/\u0001)\nO(1/\u0001)\n\u2126(1/\u0001)\n\nn/a\n\nn/a\n\nopen\nn/a\n\nTable 1: Summary of the known upper bounds and our lower bounds. Note: A1 \u21d2 A2, but not vice\nversa. SLB \u21d2 WLB, but not vice versa. UB is tight, if it matches WLB.\n\n\u2022 Strong lower bounds (SLB) h(\u0001) is called a SLB of (F, A) if there exists a function \u02dcf \u2208 F,\n\nsuch that for all \u0001 > 0 it takes at least h(\u0001) steps for A to \ufb01nd an \u0001 accurate solution of \u02dcf:\n\n(SLB)\n\n\u2203 \u02dcf \u2208 F, s.t. \u2200 \u0001 > 0, T (\u0001; \u02dcf , A) \u2265 h(\u0001).\n\n(8)\n\u2022 Weak lower bound (WLB) h(\u0001) is called a WLB of (F, A) if for any \u0001 > 0, there exists a\nfunction f\u0001 \u2208 F depending on \u0001, such that it takes at least h(\u0001) steps for A to \ufb01nd an \u0001 accurate\nsolution of f\u0001:\n\n(WLB)\n\n\u2200 \u0001 > 0,\u2203 f\u0001 \u2208 F, s.t. T (\u0001; f\u0001, A) \u2265 h(\u0001).\n\n(9)\n\nClearly, the existence of a SLB implies a WLB. However, it is usually much harder to establish SLB\nthan WLB. Fortunately, WLBs are suf\ufb01cient to refute upper bounds or to establish their tightness.\nThe size of the function class F affects the upper and lower bounds in opposite ways. Suppose\nF(cid:48) \u2282 F. Proving upper (resp. lower) bounds on (F(cid:48), A) is usually easier (resp. harder) than proving\nupper (resp. lower) bounds for (F, A).\n\n4 Constructing Lower Bounds\nLetting the minimizer of J(w) be w\u2217, we are interested in bounding the primal gap of the iterates\nwk : J(wk) \u2212 J(w\u2217). Datasets will be constructed explicitly whose resulting objective J(w) will\nbe shown to attain the lower bounds of the algorithms. The Remp for both the hinge loss in (1)\nand the F1-score in (2) will be covered, and our results are summarized in Table 1. Note that as\nassumption A1 implies A2 and SLB implies WLB, some entries of the table imply others.\n\n4.1 Strong Lower Bounds for Solving Linear SVMs using ls-bmrm\n\nWe \ufb01rst prove the \u2126(1/\u0001) lower bound for ls-bmrm on SVM problems under assumption A1. Con-\n2 ,\u22121),\nsider a one dimensional training set with four examples: (x1, y1) = (\u22121,\u22121), (x2, y2) = (\u2212 1\n16, the regularized risk (1) can be written as (using\n(x3, y3) = ( 1\nw instead of w as it is now a scalar):\n\n2 , 1), (x4, y4) = (1, 1). Setting \u03bb = 1\n\n(cid:104)\n\n(cid:105)\n\nmin\nw\u2208R J(w) =\n\nof J at w\u2217 : 0 \u2208 \u2202J(2) =(cid:8) 2\n\n16 \u2212 1\n\n2\n\n1\n32\n\nw2 +\n\n1\n2\n\n1 \u2212 w\n2\n\n+\n\n1\n2\n\n[1 \u2212 w]+ .\n\n2 \u03b1 : \u03b1 \u2208 [0, 1](cid:9). So J(w\u2217) = 1\n\n+\n\n1\n\nThe minimizer of J(w) is w\u2217 = 2, which can be veri\ufb01ed by the fact that 0 is in the subdifferential\n\n8. Choosing w0 = 0, we have\n\n(10)\n\nTheorem 2 limk\u2192\u221e k (J(wk) \u2212 J(w\u2217)) = 1\nThe proof relies on two lemmata. The \ufb01rst shows that the iterates generated by ls-bmrm on J(w)\nsatisfy the following recursive relations.\nLemma 3 For k \u2265 1, the following recursive relations hold true\n\n4 , i.e. J(wk) converges to J(w\u2217) at 1/k rate.\n\nw2k+1 = 2 +\n\n8\u03b12k\u22121,1 (w2k\u22121 \u2212 4\u03b12k\u22121,1)\nw2k\u22121 (w2k\u22121 + 4\u03b12k\u22121,1)\n\n> 2,\n\nand w2k = 2 \u2212 8\u03b12k\u22121,1\nw2k\u22121\n\n\u2208 (1, 2).\n\n(11)\n\n\u03b12k+1,1 =\n\n2k\u22121 + 16\u03b12\n\nw2\n(w2k\u22121 + 4\u03b12k\u22121,1)2 \u03b12k\u22121,1, where \u03b12k+1,1 is the \ufb01rst coordinate of \u03b12k+1.\n\n2k\u22121,1\n\n(12)\n\n4\n\n\fThe proof is lengthy and is available at [7, Appendix B]. These recursive relations allow us to derive\nthe convergence rate of \u03b12k\u22121,1 and wk (see proof in [7, Appendix C]):\n4 . Combining with (11), we get limk\u2192\u221e k|2 \u2212 wk| = 2.\nLemma 4 limk\u2192\u221e k\u03b12k\u22121,1 = 1\nNow that wk approaches 2 at the rate of O(1/k), it is \ufb01nally straightforward to translate it into the\nrate at which J(wk) approaches J(w\u2217). See the proof of Theorem 2 in [7, Appendix D].\n4.2 Weak Lower Bounds for Solving Linear SVMs using qp-bmrm\n\n\u221a\n\nTheorem 1 gives an upper bound on the convergence rate of qp-bmrm, assuming that Remp satis\ufb01es\nthe assumption A2. In this section we further demonstrate that this O(1/\u0001) rate is also a WLB (hence\ntight) even when the Remp is specialized to SVM objectives satisfying A2.\nGiven \u0001 > 0, de\ufb01ne n = (cid:100)1/\u0001(cid:101) and construct a dataset {(xi, yi)}n\n(\u22121)i (nei+1 +\n(cid:107)w(cid:107)2\n2\n\nne1) \u2208 Rn+1. Then the corresponding objective function (1) is\nn(cid:88)\n[1\u2212\u221a\n(cid:17)(cid:62)\nn )(cid:62) and J(w\u2217) = 1\ni=1 \u03b1i, \u03b11, . . . , \u03b1n\n\nn(cid:88)\nw\u2217 \u2212(cid:16) 1\u221a\n\n(13)\n4n. In fact, simply\n: \u03b1i \u2208 [0, 1]\n, and\n\ni=1 as yi = (\u22121)i and xi =\n\n+Remp(w), where Remp(w) =\n\n[1\u2212yi(cid:104)w, xi(cid:105)]+ =\n\n(cid:80)n\n\nn , . . . , 1\n\nnw1\u2212nwi+1]+.\n\nJ(w) =\n\n2 ( 1\u221a\n\n(cid:27)\n\n(cid:26)\n\nn , 1\n\nn , 1\n\n1\nn\n\n1\nn\n\ni=1\n\ni=1\n\nn\n\nIt is easy to see that the minimizer w\u2217 = 1\ncheck that yi (cid:104)w\u2217, xi(cid:105) = 1, so \u2202J(w\u2217) =\nsetting all \u03b1i = 1\nTheorem 5 Let w0 = ( 1\u221a\n\n2n yields the subgradient 0. Our key result is the following theorem.\n\nproduces iterates w1, . . . , wk, . . .. Then it takes qp-bmrm at least(cid:4) 2\n\nn , 0, 0, . . .)(cid:62). Suppose running qp-bmrm on the objective function (13)\n\n(cid:5) steps to \ufb01nd an \u0001 accurate\n\n3\u0001\n\nsolution. Formally,\nmin\ni\u2208[k]\n\nJ(wi) \u2212 J(w\u2217) =\n\n1\n2k\n\n+\n\n1\n4n\n\nfor all k \u2208 [n], hence min\ni\u2208[k]\n\nJ(wi) \u2212 J(w\u2217) > \u0001 for all k <\n\n2\n3\u0001\n\n.\n\n,\n\nn\n\n.\n\n(cid:18)\n\n1\nn\n\n1\nn\n\nn\n\n=\n\n(cid:32)\n\nw\n\n2\n\n=\n\n, and\n\ny1x1 =\n\n1\nn\n\n(cid:122)\n\n1\nk\n\n, 1, 0, . . .\n\n(cid:19)(cid:62)\n\n(cid:19)(cid:62)\n\n(cid:26) 1\n\n\u2212 1\u221a\nn\n\nw1 = argmin\n\n,\u22121, 0, . . .\n\nw1 \u2212 w2 +\n\na1 = \u2212 1\nn\n\n(cid:107)w(cid:107)2 \u2212 1\u221a\nn\n\nb1 = Remp(w0) \u2212 (cid:104)a1, w0(cid:105) = 0 +\n\nIn general, we claim that the k-th iterate wk produced by qp-bmrm is given by\n\nIndeed, after taking n steps, wn will cut a subgradient an+1 = 0 and bn+1 = 0, and then the\nminimizer of Jn+1(w) gives exactly w\u2217.\n\nProof Since Remp(w0) = 0 and \u2202Remp(w0) =(cid:8)\u22121\ni=1 \u03b1iyixi : \u03b1i \u2208 [0, 1](cid:9), we can choose\n(cid:80)n\n(cid:18) 1\u221a\n(cid:27)\n(cid:33)(cid:62)\n(cid:123)\n(cid:125)(cid:124)\ni=k+1 \u03b1iyixi : \u03b1i \u2208 [0, 1](cid:9). Thus we\neasy to check that Remp(wk) = 0 and \u2202Remp(wk) =(cid:8)\u22121\n(cid:80)n\n(cid:33)(cid:62)\n(cid:32)\n(cid:122)\n(cid:27)\n(cid:110)\n(cid:111) (cid:51)\nwk+1 +(cid:80)\n2n while J(w\u2217) = 1\n\nwhich can be veri\ufb01ed by checking that \u2202Jk+1(wk+1) =\n0. All that remains is to observe that J(wk) = 1\n2k + 1\nthat J(wk) \u2212 J(w\u2217) = 1\n\nWe prove this claim by induction on k. Assume the claim holds true for steps 1, . . . , k, then it is\n\n, 0, . . .\nk + 1\ni\u2208[k+1] \u03b1iai : \u03b1 \u2208 \u2206k+1\n\nbk+1 = Remp(wk) \u2212 (cid:104)ak+1, wk(cid:105) =\n\n(cid:107)w(cid:107)2 + max\n1\u2264i\u2264k+1\n\n4n from which it follows\n\nak+1 = \u2212 1\nn\n\n{(cid:104)ai, w(cid:105) + bi}\n\ncan again choose\n\nwk+1 = argmin\n\n(cid:26) 1\n\nyk+1xk+1,\n\n1\u221a\nn\n\n1\u221a\nn\n\n2k + 1\n\n4n as claimed.\n\n1\nn\n\n(cid:125)(cid:124)\n\n,\n\n, . . . ,\n\n, 0, . . .\n\n1\n\n,\n\n1\n\nk + 1\n\nw\n\n2\n\n=\n\n, . . . ,\n\n, so\n\nk+1 copies\n\n1\nk\n\nn\n\nwk =\n\n.\n\nk copies\n\nand\n\n(cid:123)\n\n,\n\n5\n\n\f\u221a\n\nAs an aside, the subgradient of the Remp in (13) does have Euclidean norm\n2n at w = 0. However,\nin the above run of qp-bmrm, \u2202Remp(w0), . . . , \u2202Remp(wn) always contains a subgradient with\n\nnorm 1. So if we restrict the feasible region to(cid:8)n\u22121/2(cid:9) \u00d7 [0,\u221e]n, then J(w) does satisfy the\n\nassumption A2 and the optimal solution does not change. This is essentially a local satisfaction of\nA2. In fact, having a bounded subgradient of Remp at all wk is suf\ufb01cient for qp-bmrm to converge\nat the rate in Theorem 1.\nHowever when we assume A1 which is more restrictive than A2, it remains an open question to\ndetermine whether the O(1/\u0001) rates are optimal for qp-bmrm on SVM objectives. Also left open is\nthe SLB for qp-bmrm on SVMs.\n\n4.3 Weak Lower Bounds for Optimizing F1-score using qp-bmrm\nF1-score is de\ufb01ned by using the contingency table: F1(\u00afy, y) := 2a\n2a+b+c.\nGiven \u0001 > 0, de\ufb01ne n = (cid:100)1/\u0001(cid:101)+1 and construct a dataset {(xi, yi)}n\ni=1 as\ne1\u2212 n\n2 ei+1 \u2208 Rn+1 with yi = \u22121 for all i \u2208 [n\u22121],\nfollows: xi = \u2212 n\n\u221a\n2\n2 en+1 \u2208 Rn+1 with yn = +1. So there is only one\nand xn =\npositive training example. Then the corresponding objective function is\n\n\u221a\n2 e1 + n\n\n3n\n\n3\n\n(cid:107)w(cid:107)2 + max\n\nn(cid:88)\ne1. Then qp-bmrm takes at least(cid:4) 1\n(cid:18) 1\n\n1 \u2212 F1(y, \u00afy) +\n\n1\nn\n\ni=1\n\n\u00afy\n\n3\n\n3\u0001\n\nJ(w) =\n\n1\n2\nTheorem 6 Let w0 = 1\u221a\nJ(w) \u2265 1\nJ(wk)\u2212min\n2\n\nw\n\ny = 1 y =\u22121\na\nc\n\nb\nd\n\n\u00afy = 1\n\u00afy =\u22121\n\nContingency table.\n\n(cid:35)\n\nyi (cid:104)w, xi(cid:105) (yi \u00afyi \u2212 1)\n\n(cid:5) steps to \ufb01nd an \u0001 accurate solution.\n\n(14)\n\n.\n\n1\n3\u0001\nProof A rigorous proof can be found in [7, Appendix E], we provide a sketch here. The crux is to\nshow\n\n\u2200k \u2208 [n\u22121], hence min\ni\u2208[k]\n\nJ(w) > \u0001 \u2200k <\n\nJ(wi)\u2212min\n\n\u2212 1\n\nn \u2212 1\n\nk\n\nw\n\n.\n\n(cid:34)\n(cid:19)\n\n(cid:122)\n\n1\nk\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nWe prove (15) by induction. Assume it holds for steps 1, . . . , k. Then at step k + 1 we have\n\n(cid:32)\n\n1\u221a\n3\n\n,\n\n(cid:125)(cid:124)\n\nk copies\n\n(cid:123)\n\n1\nk\n\n(cid:33)(cid:62)\n\nwk =\n\n, . . . ,\n\n, 0, . . .\n\n\u2200k \u2208 [n \u2212 1].\n\nyi (cid:104)wk, xi(cid:105) =\n\n1\nn\n\n1\n\n6 + 1\n\n2k\n\n1\n6\n1\n2\n\nif i \u2208 [k]\nif k + 1 \u2264 i \u2264 n \u2212 1\nif i = n\n\n.\n\n(15)\n\n(16)\n\nFor convenience, de\ufb01ne the term in the max in (14) as\n\n\u03a5k(\u00afy) := 1 \u2212 F1(y, \u00afy) +\n\n1\nn\n\nyi (cid:104)wk, xi(cid:105) (yi \u00afyi \u2212 1).\n\nn(cid:88)\n\ni=1\n\nThen it is not hard to see that the following assignments of \u00afy (among others) maximize \u03a5k: a)\ncorrect labeling, b) only misclassify the positive training example xn (i.e., \u00afyn = \u22121), c) only\nmisclassify one negative training example in xk+1, . . . , xn\u22121 into positive. And \u03a5k equals 0 at all\nthese assignments. For a proof, consider two cases. If \u00afy misclassi\ufb01es the positive training example,\nthen F1(y, \u00afy) = 0 and by (16) we have\n\u03a5k(\u00afy) = 1\u22120 +\n\nyi(cid:104)wk, xi(cid:105) (yi \u00afyi\u22121)+\n\n(yi \u00afyi\u22121)+\n\n(\u22121\u22121) =\n\nn\u22121(cid:88)\n\nn\u22121(cid:88)\n\nk(cid:88)\n\n(yi \u00afyi\u22121)\u2264 0.\n\nk + 3\n\n6k\n\ni=1\n\n1\n6\n\ni=k+1\n\n1\nn\n\ni=1\n\n1\n2\n\nSuppose \u00afy correctly labels the positive example, but misclassi\ufb01es t1 examples in x1, . . . , xk and t2\nexamples in xk+1, . . . , xn\u22121 (into positive). Then F1(y, \u00afy) =\n\n, and\n\n2\n\n\u03a5k(\u00afy) = 1 \u2212\n\n2\n\n2 + t1 + t2\n\u2212\n\nt1 + t2\n\n2 + t1 + t2\n\n=\n\n+\n\n(cid:18) 1\n\n3\n\n2+t1+t2\n\n(cid:18) 1\n\n6\n\n1\nk\n\n+\n\n+\n\n(cid:19)\n\n(cid:19) k(cid:88)\n(yi \u00afyi \u2212 1) +\nt2 \u2264 t \u2212 t2\n\ni=1\n\n1\n2k\nt1 \u2212 1\n3\n\n3(2 + t)\n\n1\n6\n\u2264 0\n\nn\u22121(cid:88)\n\ni=k+1\n\n(yi \u00afyi \u2212 1)\n\n(t := t1 + t2).\n\n6\n\n\fSo we can pick \u00afy as (\n\nak+1 =\n\n\u22122\nn\n\nyk+1xk+1 = \u2212 1\u221a\n3\n\n(cid:122)\n\n(cid:125)(cid:124)\n\n(cid:123)\n\nk copies\n\n\u22121, . . . ,\u22121, +1,\n\n(cid:122)\n\n(cid:125)(cid:124)\n\n(cid:123)\n\n(cid:125)(cid:124)\n\n:=Jk+1(w)\n\nn\u2212k\u22121 copies\n\u22121, . . . ,\u22121, +1)(cid:62) which only misclassi\ufb01es xk+1, and get\n1\n3\n\nbk+1 = Remp(wk) \u2212 (cid:104)ak+1, wk(cid:105) = 0 +\n\n1\n3\n\n=\n\n,\n\ne1 \u2212 ek+2,\n\n(cid:125)(cid:124)\n\nk+1 copies\n\n(cid:122)\n\n(cid:32)\n(cid:123)\n(cid:110)\nwk+1 +(cid:80)k+1\n\n,\n\n1\n\n1\n\n, . . . ,\n\nk + 1\n\n1\u221a\nk + 1\n3\ni=1 \u03b1iai : \u03b1 \u2208 \u2206k+1\n\n(cid:33)(cid:62)\n\n(cid:123)\n(cid:111) (cid:51) 0 (just set all\n\n, 0, . . .\n\n.\n\n(cid:122)\n\n1\n2\n\nwk+1 = argmin\n\nw\n\n(cid:107)w(cid:107)2 + max\ni\u2208[k+1]\n\n{(cid:104)ai, w(cid:105) + bi} =\n\nwhich can be veri\ufb01ed by \u2202Jk+1(wk+1) =\n\u03b1i = 1\n\nk+1). So (15) holds for step k + 1. End of induction.\n\nAll that remains is to observe that J(wk) = 1\n3 + 1\nfrom which it follows that J(wk) \u2212 minw J(w) \u2265 1\n\n2 ( 1\n\nk ) while minw J(w) \u2264 J(wn\u22121) = 1\n2 ( 1\nn\u22121 ) as claimed in Theorem 6.\n2 ( 1\n\nk \u2212 1\n\n3 + 1\n\nn\u22121 )\n\n\u0001) Algorithm for Training Binary Linear SVMs\n\n\u221a\n5 An O(nd/\nThe lower bounds we proved above show that CPM such as BMRM require \u2126(1/\u0001) iterations to\nconverge. We now show that this is an inherent limitation of CPM and not an artifact of the problem.\n\u221a\nTo demonstrate this, we will show that one can devise an algorithm for problems (1) and (2) which\nwill converge in O(1/\n\u0001) iterations. The key dif\ufb01culty stems from the non-smoothness of the\nobjective function, which renders second and higher order algorithms such as L-BFGS inapplicable.\nHowever, thanks to [7, Theorem 7 in Appendix A], the Fenchel dual of (1) is a convex smooth\nfunction with a Lipschitz continuous gradient, which are easy to optimize.\nTo formalize the idea of using the Fenchel dual, we can abstract from the objectives (1) and (2) a\ncomposite form of objective functions used in machine learning with linear models:\nJ(w) = f (w) + g(cid:63)(Aw), where Q1 is a closed convex set.\n\n(17)\n\nmin\nw\u2208Q1\n\nHere, f (w) is a strongly convex function corresponding to the regularizer, Aw stands for the output\nof a linear model, and g(cid:63) encodes the empirical risk measuring the discrepancy between the correct\nlabels and the output of the linear model. Let the domain of g be Q2. It is well known that [e.g. 8,\nTheorem 3.3.5] under some mild constraint quali\ufb01cations, the adjoint form of J(w):\n\nD(\u03b1) = \u2212g(\u03b1) \u2212 f (cid:63)(\u2212A(cid:62)\u03b1), \u03b1 \u2208 Q2\n\n(18)\n\nsatis\ufb01es J(w) \u2265 D(\u03b1) and inf w\u2208Q1 J(w) = sup\u03b1\u2208Q2 D(\u03b1).\n(cid:80)n\nExample 1: binary SVMs with bias. Let A := \u2212Y X(cid:62) where Y := diag(y1, . . . , yn) and X :=\ni=1 [1 + ui \u2212 yib]+ which corresponds to\n(x1, . . . , xn), f (w) = \u03bb\n(cid:110)\ni \u03b1i. Then the adjoint form turns out to be the well known SVM dual objective function:\n\n2 (cid:107)w(cid:107)2, g(cid:63)(u) = minb\u2208R 1\n\ng(\u03b1) =\u2212(cid:80)\n\nn\n\n(cid:88)\n\n\u03b1(cid:62)Y X(cid:62)XY \u03b1, \u03b1 \u2208 Q2 =\n\nD(\u03b1) =\n\n(19)\n\n\u03b1i \u2212 1\n2\u03bb\n\nExample 2: multivariate scores. Denote A as a 2n-by-d matrix where the \u00afy-th row is\nn u\u00afy\n\u00afy \u2206(y, \u00afy)\u03b1\u00afy, we recover the primal objective (2) for multi-\n\ni (\u00afyi \u2212 yi) for each \u00afy \u2208 {\u22121, +1}n, f (w) = \u03bb\n\ni\n\n(cid:80)n\nwhich corresponds to g(\u03b1) = \u2212n(cid:80)\ni=1 x(cid:62)\n(cid:88)\n\nvariate performance measure. Its adjoint form is\nD(\u03b1) =\u2212 1\n2\u03bb\n\n\u03b1(cid:62)AA(cid:62)\u03b1 + n\n\n\u00afy\n\n\u2206(y, \u00afy)\u03b1\u00afy, \u03b1 \u2208 Q2 =\n\ni\n\n.\n\nyi\u03b1i = 0\n\n(cid:88)\n\n\u03b1 \u2208 [0, n\u22121]n :\n\n(cid:111)\n(cid:2)\u2206(y, \u00afy) + 1\n2 (cid:107)w(cid:107)2, g(cid:63)(u) = max\u00afy\n(cid:111)\n(cid:110)\n(cid:88)\n\n\u03b1 \u2208 [0, n\u22121]2n\n\n\u03b1\u00afy =\n\n:\n\n1\nn\n\n. (20)\n\n\u00afy\n\n(cid:3)\n\nIn a series of papers [6, 9, 10], Nesterov developed optimal gradient based methods for minimizing\nthe composite objectives with primal (17) and adjoint (18). A sequence of wk and \u03b1k is produced\nsuch that under assumption A1 the duality gap J(wk) \u2212 D(\u03b1k) is reduced to less than \u0001 after at\n\u221a\nmost k = O(1/\n\n\u0001) steps. We refer the readers to [9, 11] for details.\n\n7\n\n\f(cid:16)\n\n(cid:88)\n\ni\n\n(cid:17)\n\ni \u03b1i log \u03b1i.\n\n(21)\n\np(\u00afy; w) \u221d exp\n\nc\u2206(\u00afy, y) +\n\nai (cid:104)xi, w(cid:105) \u00afyi\n\nV (\u03b1, g) := argmin\n\u00af\u03b1\u2208Q2\n\nF ( \u00af\u03b1) \u2212 (cid:104)\u2207F (\u03b1) \u2212 g, \u00af\u03b1(cid:105) .\n\n5.1 Ef\ufb01cient Projections in Training SV Models with Optimal Gradient Methods\nHowever, applying Nesterov\u2019s algorithm is challenging, because it requires an ef\ufb01cient subroutine\nfor computing projections onto the set of constraints Q2. This projection can be either an Euclidean\nprojection or a Bregman projection.\nExample 1: binary SVMs with bias.\nIn this case we need to compute the Euclidean projection to\nQ2 de\ufb01ned by (19), which entails solving a Quadratic Programming problem with a diagonal Hes-\nsian, many box constraints, and a single equality constraint. We present an O(n) algorithm for\n\u221a\nthis task in [11, Section 5.5.1]. Plugging this into the algorithm described in [9] and noting that\nall intermediate steps of the algorithm can be computed in O(nd) time directly yield a O(nd/\n\u0001)\nalgorithm. More detailed description of the algorithm is available in [11].\nExample 2: multivariate scores. Since the dimension of Q2 in (20) is exponentially large in\nn, Euclidean projection is intractable and we resort to Bregman projection. Given a differentiable\nconvex function F on Q2, a point \u03b1, and a direction g, we can de\ufb01ne the Bregman projection as:\n\nScaling up \u03b1 by a factor of n, we can choose F (\u03b1) as the negative entropy F (\u03b1) = \u2212(cid:80)\nThe solver will request the expectation E\u00afy [(cid:80)\n\nThen the application of the algorithm in [9] will endow a distribution over all possible labelings:\n\n, where c and ai are constant scalars.\n\n\u221a\n\n\u221a\n\ni aixi \u00afyi] which in turn requires that marginal distri-\nbution of p(\u00afyi). This is not as straightforward as in graphical models because \u2206(\u00afy, y) may not\ndecompose. Fortunately, for multivariate scores de\ufb01ned by contingency tables, it is possible to com-\npute the marginals in O(n2) time by using dynamic programming, and this cost is similar to the\nalgorithm proposed by [3]. The detail of the dynamic programming is given in [11, Section 5.4].\n6 Outlook and Conclusion\nCPM are widely employed in machine learning especially in the context of structured prediction\n[12]. While upper bounds on their rates of convergence were known, lower bounds were not studied\nbefore. In this paper we set out to \ufb01ll this gap by exhibiting counter examples in binary classi\ufb01cation\non which CPM require \u2126(1/\u0001) iterations. Our examples are substantially different from the one in\n\u221a\n[13] which requires an increasing number of classes. The \u2126(1/\u0001) lower bound is a fundamental lim-\nitation of these algorithms and not an artifact of the problem. We show this by devising an O(1/\n\u0001)\nalgorithm borrowing techniques from [9]. However, this algorithm assumes that the dataset is con-\ntained in a ball of bounded radius (assumption A1 Section 1). Devising a O(1/\n\u0001) algorithm under\nthe less restrictive assumption A2 remains an open problem.\nIt is important to note that the linear time algorithm in [11, Section 5.5.1] is the key to obtaining a\n\u0001) computational complexity for binary SVMs with bias mentioned in Section 5.1. How-\nO(nd/\never, this method has been rediscovered independently by many authors (including us), with the\nearliest known reference to the best of our knowledge being [14] in 1990. Some recent work in\noptimization [15] has focused on improving the practical performance, while in machine learning\n[16] gave an expected linear time algorithm via randomized median \ufb01nding.\nChoosing an optimizer for a given machine learning task is a trade-off between a number of poten-\ntially con\ufb02icting requirements. CPM are one popular choice but there are others. If one is interested\nin classi\ufb01cation accuracy alone, without requiring deterministic guarantees, then online to batch\nconversion techniques combined with stochastic subgradient descent are a good choice [17]. While\nthe dependence on \u0001 is still \u2126(1/\u0001) or worse [18], one gets bounds independent of n. However, as\nwe pointed out earlier, these algorithms are applicable only when the empirical risk decomposes\nover the examples.\nOn the other hand, one can employ coordinate descent in the dual as is done in the Sequential Mini-\nmal Optimization (SMO) algorithm of [19]. However, as [20] show, if the kernel matrix obtained by\nstacking xi into a matrix X and X(cid:62)X is not strictly positive de\ufb01nite, then SMO requires O(n/\u0001)\niterations with each iteration costing O(nd) effort. However, when the kernel matrix is strictly pos-\nitive de\ufb01nite, then one can obtain an O(n2 log(1/\u0001)) bound on the number of iterations, which has\nbetter dependence on \u0001, but is prohibitively expensive for large n. Even better dependence on \u0001 can\nbe achieved by using interior point methods [21] which require only O(log(log(1/\u0001)) iterations, but\nthe time complexity per iteration is O(min{n2d, d2n}).\n\n8\n\n\fReferences\n[1] T. Joachims. Training linear SVMs in linear time. In Proc. ACM Conf. Knowledge Discovery\n\nand Data Mining (KDD), pages 217\u2013226, 2006.\n\n[2] C. H. Teo, S. V. N. Vishwanthan, A. J. Smola, and Q. V. Le. Bundle methods for regularized\n\nrisk minimization. J. Mach. Learn. Res., 11:311\u2013365, January 2010.\n\n[3] T. Joachims. A support vector method for multivariate performance measures. In Proc. Intl.\n\nConf. Machine Learning, pages 377\u2013384, 2005.\n\n[4] Y. Nesterov. Introductory Lectures On Convex Optimization: A Basic Course. Springer, 2003.\n[5] A. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization.\n\nJohn Wiley and Sons, 1983.\n\n[6] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of con-\n\nvergence O(1/k2). Soviet Math. Docl., 269:543\u2013547, 1983.\n\n[7] Xinhua Zhang, Ankan Saha, and S.V.N. Vishwanathan.\nof convergence of cutting plane methods (long version).\nhttp://www.stat.purdue.edu/\u223cvishy/papers/ZhaSahVis10 long.pdf.\n\nLower bounds on rate\nTechnical\nreport, 2010.\n\n[8] J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization: Theory and\n\nExamples. CMS books in Mathematics. Canadian Mathematical Society, 2000.\n\n[9] Y. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on\n\nOptimization, 16(1):235\u2013249, 2005. ISSN 1052-6234.\n\n[10] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Re-\n\nport 76, CORE Discussion Paper, UCL, 2007.\n\n[11] Xinhua Zhang, Ankan Saha, and S.V.N. Vishwanathan. Regularized risk minimization by Nes-\nterov\u2019s accelerated gradient methods: Algorithmic extensions and empirical studies. Technical\nreport arXiv:1011.0472, 2010. http://arxiv.org/abs/1011.0472.\n\n[12] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\n\nand interdependent output variables. J. Mach. Learn. Res., 6:1453\u20131484, 2005.\n\n[13] T. Joachims, T. Finley, and C.N.\u02dcJ. Yu. Cutting-plane training of structural SVMs. Machine\n\nLearning Journal, 77(1):27\u201359, 2009.\n\n[14] P. M. Pardalos and N. Kovoor. An algorithm for singly constrained class of quadratic programs\n\nsubject to upper and lower bounds. Mathematical Programming, 46:321\u2013328, 1990.\n\n[15] Y.-H. Dai and R. Fletcher. New algorithms for singly linearly constrained quadratic programs\nsubject to lower and upper bounds. Mathematical Programming: Series A and B archive, 106\n(3):403\u2013421, 2006.\n\n[16] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the (cid:96)1-ball\nfor learning in high dimensions. In Proc. Intl. Conf. Machine Learning, pages 272\u2013279, 2008.\n[17] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor SVM. In Proc. Intl. Conf. Machine Learning, pages 807\u2013814, 2007.\n\n[18] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. Wainwright.\n\nInformation-theoretic lower\nbounds on the oracle complexity of convex optimization. In Neural Information Processing\nSystems, pages 1\u20139, 2009.\n\n[19] J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vector\n\nmachines. Technical Report MSR-TR-98-14, Microsoft Research, 1998.\n\n[20] N. List and H. U. Simon. SVM-optimization and steepest-descent line search. In S. Dasgupta\n\nand A. Klivans, editors, Proc. Annual Conf. Computational Learning Theory, 2009.\n\n[21] M. C. Ferris and T. S. Munson. Interior-point methods for massive support vector machines.\n\nSIAM Journal on Optimization, 13(3):783\u2013804, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1232, "authors": [{"given_name": "Xinhua", "family_name": "Zhang", "institution": null}, {"given_name": "Ankan", "family_name": "Saha", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}