{"title": "A Polylog Pivot Steps Simplex Algorithm for Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 629, "page_last": 637, "abstract": "We present a simplex algorithm for linear programming in a linear classification formulation. The paramount complexity parameter in linear classification problems is called the margin. We prove that for margin values of practical interest our simplex variant performs a polylogarithmic number of pivot steps in the worst case, and its overall running time is near linear. This is in contrast to general linear programming, for which no sub-polynomial pivot rule is known.", "full_text": "A provably ef\ufb01cient simplex algorithm for\n\nclassi\ufb01cation\n\nElad Hazan \u2217\n\nHaifa, 32000\n\nTechnion - Israel Inst. of Tech.\n\nehazan@ie.technion.ac.il\n\nZohar Karnin\nYahoo! Research\n\nHaifa\n\nzkarnin@ymail.com\n\nAbstract\n\nWe present a simplex algorithm for linear programming in a linear classi\ufb01cation\nformulation. The paramount complexity parameter in linear classi\ufb01cation prob-\nlems is called the margin. We prove that for margin values of practical interest\nour simplex variant performs a polylogarithmic number of pivot steps in the worst\ncase, and its overall running time is near linear. This is in contrast to general linear\nprogramming, for which no sub-polynomial pivot rule is known.\n\n1\n\nIntroduction\n\nLinear programming is a fundamental mathematical model with numerous applications in both com-\nbinatorial and continuous optimization. The simplex algorithm for linear programming is a corner-\nstone of operations research. Despite being one of the most useful algorithms ever designed, not\nmuch is known about its theoretical properties.\nAs of today, it is unknown whether a variant of the simplex algorithm (de\ufb01ned by a pivot rule) exists\nwhich makes it run in strongly polynomial time. Further, the simplex algorithm, being a geomet-\nrical algorithm that is applied to polytopes de\ufb01ned by linear programs, relates to deep questions in\ngeometry. Perhaps the most famous of which is the \u201cpolynomial Hirsh conjecture\u201d, that states that\nthe diameter of a polytope is polynomial in its dimension and the number of its facets.\nIn this paper we analyze a simplex-based algorithm which is guaranteed to run in worst-case poly-\nnomial time for large class of practically-interesting linear programs that arise in machine learn-\ning, namely linear classi\ufb01cation problems. Further, our simplex algorithm performs only a poly-\nlogarithmic number of pivot steps and overall near linear running time. The only previously known\npoly-time simplex algorithm performs a polynomial number of pivot steps [KS06].\n\n1.1 Related work\n\nThe simplex algorithm for linear programming was invented by Danzig [Dan51]. In the sixty years\nthat have passed, numerous attempts have been made to devise a polynomial time simplex algorithm.\nVarious authors have proved polynomial bounds on the number of pivot steps required by simplex\nvariants for inputs that are generated by various distributions, see e.g. [Meg86] as well as articles\nreferenced therein. However, worst case bounds have eluded researchers for many years.\nA breakthrough in the theoretical analysis of the simplex algorithm was obtained by Spielman and\nTeng [ST04], who have shown that its smoothed complexity is polynomial, i.e.\nthat the expected\nrunning time under a polynomially small perturbation of an arbitrary instance is polynomial. Kelner\nand Spielman [KS06] have used similar techniques to provide for a worst-case polynomial time\nsimplex algorithm.\n\n\u2217Work conducted at and funded by the Technion-Micorsoft Electronic Commerce Research Center\n\n1\n\n\fIn this paper we take another step at explaining the success of the simplex algorithm - we show that\nfor one of the most important and widely used classes of linear programs a simplex algorithm runs\nin near linear time.\nWe note that more ef\ufb01cient algorithms for linear classi\ufb01cation exist, e.g. the optimal algorithm of\n[CHW10]. The purpose of this paper is to expand our understanding of the simplex method, rather\nthan obtain a more ef\ufb01cient algorithm for classi\ufb01cation.\n\n2 Preliminaries\n\n2.1 Linear classi\ufb01cation\n\nLinear classi\ufb01cation is a fundamental primitive of machine learning, and is ubiquitous in applica-\ntions. Formally, we are given a set of vectors-labels pairs {Ai, yi|i \u2208 [n]}, such that Ai \u2208 Rd, yi \u2208\n{\u22121, +1} has (cid:96)2 (Euclidean) norm at most one. The goal is to \ufb01nd a hyperplane x \u2208 Rd that parti-\ntions the vectors into two disjoint subsets according to their sign, i.e. sign(Aix) = yi. W.l.o.g we\ncan assume that all labels are positive by negating the corresponding vectors of negative labels, i.e.\n\u2200iyi = 1.\nLinear classi\ufb01cation can be written as a linear program as follows:\n\n\ufb01nd x \u2208 Rd s.t. \u2200i \u2208 [n] (cid:104)Ai, x(cid:105) > 0\n\n(1)\n\nThe original linear classi\ufb01cation problem is then separable, i.e. there exists a separating hyperplane,\nif and only if the above linear program has a feasible solution. Further, any linear program in stan-\ndard form can be written in linear classi\ufb01cation form (1) by elementary manipulations and addition\nof a single variable (see [DV08] for more details).\nHenceforth we refer to a linear program in format (1) by its coef\ufb01cient matrix A. All vectors are\ncolumn vectors, and we denote inner products by (cid:104)x, y(cid:105). A parameter of paramount importance to\nlinear classi\ufb01cation is the margin, de\ufb01ned as follows\nDe\ufb01nition 1. The margin of a linear program in format (1) , such that \u2200i(cid:107)Ai(cid:107) \u2264 1, is de\ufb01ned as\n\nWe say that the instance A is a \u03bb-margin LP.\n\n\u03bb = \u03bb(A) = max\n(cid:107)x(cid:107)\u22641\n\nmin\ni\u2208[n]\n\n(cid:104)Ai, x(cid:105)\n\nNotice that we have restricted x as well as the rows of A to have bounded norm, since otherwise the\nmargin is ill-de\ufb01ned as it can change by scaling of x. Intuitively, the larger the margin, the easier\nthe linear program is to solve.\nWhile any linear program can be converted to an equivalent one in form (1), the margin can be ex-\nponentially small in the representation. However, in practical applications the margin is usually a\nconstant independent of the problem dimensions; a justi\ufb01cation is given next. Therefore we hence-\nforth treat the margin as a separate parameter of the linear program, and devise ef\ufb01cient algorithms\nfor solving it when the margin is a constant independent of the problem dimensions.\n\nSupport vector machines - why is the margin large ?\nseparable. This is due to many reasons, most prominently noise and modeling errors.\nHence practitioners settle for approximate linear classi\ufb01ers. Finding a linear classi\ufb01er that mini-\nmizes the number of classi\ufb01cation errors is NP-hard, and inapproximable [FGKP06]. The relaxation\nof choice is to minimize the sum of errors, called \u201csoft-margin SVM\u201d (Support Vector Machine)\n[CV95], and is one of the most widely used algorithms in machine learning. Formally, a soft-margin\nSVM instance is given by the following mathematical program:\n\nIn real-world problems the data is seldom\n\n(2)\n\n(cid:88)\n\ni\n\nmin\n\n\u03bei\n\u2200i \u2208 [n] yi((cid:104)x, Ai(cid:105) + b) + \u03bei \u2265 0\n(cid:107)x(cid:107) \u2264 1\n\n2\n\n\fThe norm constraint on x is usually taken to be the Euclidean norm, but other norms are also com-\nmon such as the (cid:96)1 or (cid:96)\u221e constraints that give rise to linear programs.\nIn this paper we discuss the separable case (formulation (1)) alone. The non-separable case turns out\nto be much easier when we allow an additive loss of a small constant to the margin. We elaborate\non this point in Section 6.1. We will restrict our attention only to the case where the bounding norm\nof x is the (cid:96)2 norm as it is the most common case.\n\n2.2 Linear Programming and Smoothed analysis\n\nSmoothed analysis was introduced in [ST04] to explain the excellent performance of the simplex\nalgorithm in practice. A \u03c3-smooth LP is an LP where each coef\ufb01cient is perturbed by a Gaussian\nnoise of variance \u03c32.\nIn their seminal paper, Spielman and Teng proved the existence of a simplex algorithm that solves\n\u03c3-smooth LP in polynomial time (polynomial also in \u03c3\u22121). Consequently, Vershynin [Ver09] pre-\nsented a simpler algorithm and signi\ufb01cantly improved the running time. In the next sections we will\ncompare our results to the mentioned papers and point out a crucial lemma used in both papers that\nwill also be used here.\n\n2.3 Statement of our results\n\nFor a separable SVM instance of n variables in a space of d dimensions and margin \u03bb, we provide\na simplex algorithm with at most poly(log(n), \u03bb\u22121) many pivot steps. Our statement is given for\nthe (cid:96)2-SVM case, that is the case where the vector w (see De\ufb01nition 1) has a bounded (cid:96)2 norm.\n\nThe algorithm achieves a solution with margin O((cid:112)log(n)/d) when viewed as a separator in the d\n(cid:112)log n/d where c1 is some suf\ufb01ciently large universal constant. Let 0 < \u03b5 < \u03bb\nin Rd is O((cid:112)log(n)/d). When projecting the data points onto V , the margin of the solution is \u03bb\u2212\u03b5.\n\ndimensional space. However, in an alternative yet (practically) equivalent view, the margin of the\nsolution is in fact arbitrarily close to \u03bb.\nTheorem 1. Let L be a separable (cid:96)2-SVM instance of dimension d with n examples and margin \u03bb.\nAssume that \u03bb > c1\nbe a parameter. The simplex algorithm presented in this paper requires \u02dcO(nd) preprocessing time\nand poly(\u03b5\u22121, log(n)) pivot steps. The algorithm outputs a subspace V \u2286 Rd of dimension k =\n\u0398(log(n)/\u03b52) and a hyperplane within it. The margin of the solution when viewed as a hyperplane\n\nIn words, the above theorem states that when viewed as a classi\ufb01cation problem the obtained margin\nis almost optimal. We note that when classifying a new point one does not have to project it to the\nsubspace V , but rather assign a sign according to the classifying hyperplane in Rd.\n\nTightness of the Generalization Bound In \ufb01rst sight it seems that our result gives a week gener-\nalization bound since the margin obtained in the original dimension is low. However, the margin of\nthe found solution in the reduced dimension (i.e., within V ) is almost optimal (i.e. \u03bb \u2212 \u03b5 where \u03bb is\nthe optimal margin). It follows that the generalization bound essentially the same one obtained by\nan exact solution.\n\nLP perspective and the smoothed analysis framework As mentioned earlier, any linear program\ncan be viewed as a classi\ufb01cation LP by introducing a single new variable. Furthermore, any solution\nwith a positive margin translates into an optimal solution to the original LP. Our algorithm solves\nthe classi\ufb01cation LP in a sub-optimal manner in the sense that it does not \ufb01nd a separator with an\noptimal margin. However, in the perspective of a general LP solver1, the solution is optimal as any\npositive margin suf\ufb01ces. It stands to reason that in many practical settings the margin of the solution\nis constant or polylogarithmically small at worst. In such cases, our simplex algorithm solves the LP\nby using at most a polylogarithmic number of pivot steps. We further mention that without the large\nmargin assumption, in the smoothed analysis framework it is known ([BD02], Lemma 6.2) that the\nmargin is w.h.p. polynomially bounded by the parameters. Hence, our algorithm runs in polynomial\ntime in the smoothed analysis framework as well.\n\n1The statement is true only for feasibility LPs. However, any LP can be transformed into a feasibility LP by\n\nperforming a binary search for its solution value.\n\n3\n\n\f3 Our Techniques\n\nThe process involves \ufb01ve preliminary steps. Reducing the dimension, adding arti\ufb01cial constraints\nto bound the norm of the solution, perturbing the low dimensional LP, \ufb01nding a feasible point and\nshifting the polytope. The process of reducing the dimension is standard. We use the Johnson and\nLindenstrauss Lemma [JL84] to reduce the dimension of the data points from d to k = O(log(n)/\u03b52)\nwhere \u03b5 is an error parameter that can be considered as a constant. This step reduces the time\ncomplexity by reducing both the number and running time of the pivot steps. In order to bound\nthe (cid:96)2 norm of the original vector, we bound the (cid:96)\u221e norm of the low dimensional vector. This will\n\neventually result in a multiplicative loss of(cid:112)log(k) to the margin. We note that we could have\n\navoided this loss by bounding the (cid:96)1 norm of the vector at the cost of a more technically involved\nproof. Speci\ufb01cally, one should bound the (cid:96)1 norm of the embedding of the vector into a space where\nthe (cid:96)1 and (cid:96)2 norms behave similarly, up to a multiplicative distortion of 1\u00b1\u03b5. Such an embedding of\n2 in (cid:96)K\n1 exists for K = O(k/\u03b52) [Ind00]. Another side effect is a larger constant in the polynomial\n(cid:96)k\ndependence of \u03b5 in the running time.\nThe perturbation step involves adding a random Gaussian noise vector to the matrix of constraints,\nwhere the amplitude of each row is determined by the norm of the corresponding constraint vector.\nThis step ensures the bound on the number of pivot step performed by the simplex algorithm. In\norder to \ufb01nd a feasible point we exploit the fact that when the margin is allowed to be negative, there\nis always a feasible solution. We prove for a \ufb01xed set of constraints, one of which is a negative lower\nbound on the margin, that the corresponding point v0 is not only feasible but is the unique optimal\nsolution for \ufb01xed direction. The direction is independent of the added noise, which is a necessary\nproperty when bounding the number of pivot steps.\nOur \ufb01nal step is a shift of the polytope. Since we use the shadow vertex pivot rule we must have\nan LP instance for which 0 is an interior point of the polytope. This property is not held for our\npolytope as the LP contains inequalities of the form (cid:104)a, x(cid:105) \u2265 0. However, we prove that both 0\nand v0 are feasible solution to the LP that do not share a common facet. Hence, their average is\nan interior point of the polytope and a shift by \u2212v0/2 would ensure that 0 is an interior point as\nrequired.\nOnce the preprocessing is done we solve the LP via the shadow vertex method which is guaranteed\nto \ufb01nish after a polylogarithmic number of pivot steps. Given a suf\ufb01ciently small additive noise and\nsuf\ufb01ciently large target dimension we are guaranteed that the obtained solution is an almost optimal\n\nsolution to the unperturbed low dimensional problem and a \u02dcO((cid:112)k/d) approximation to the higher\n\ndimensional problem.\n\n4 Tool Set\n\n4.1 Dimension reduction\n\nThe Johnson-Lindenstrauss Lemma [JL84] asserts that one can project vectors onto a lower dimen-\nsional space and roughly preserve their norms, pairwise distances and inner products. The following\nis an immediate consequence of Theorem 2.1 and Lemma 2.2 of [DG03].\nTheorem 2. Let \u03b5 0 and let k, d be integers where d > k > 9/\u03b52. Consider a linear projection\nM : Rd (cid:55)\u2192 Rk onto a uniformly chosen subspace2. For any pair of \ufb01xed vector u, v \u2208 Rd where\n(cid:107)u(cid:107),(cid:107)v(cid:107) \u2264 1, it holds that\n\nPr(cid:2)(cid:12)(cid:12)(cid:107)u(cid:107)2 \u2212 (cid:107)M u(cid:107)2(cid:12)(cid:12) > \u03b5(cid:3) < exp(\u2212k\u03b52/9)\n\nPr [|(cid:104)u, v(cid:105) \u2212 (cid:104)M u, M v(cid:105)| > 3\u03b5] < 3 exp(\u2212k\u03b52/9)\n\n4.2 The number of vertices in the shadow of a perturbed polytope\n\nA key lemma in the papers of [ST04, Ver09] is a bound on the expected number of vertices in the\nprojection of a perturbed polytope onto a plane. The following geometric theorem is will be used in\nour paper:\n\n2Alternatively, M can be viewed as the composition of a random rotation U followed by taking the \ufb01rst k\n\ncoordinates\n\n4\n\n\fTheorem 3 ([Ver09] Theorem 6.2). Let A1, ..., An be independent Gaussian vectors in Rd with\ncenters of norm at most 1, and whose varaince satis\ufb01es:\n\n\u03c32 \u2264\n\n1\n\n36d log n\n\nLet E be a \ufb01xed plane in Rd. Then the random polytope P = conv(0, A1, ..., An) satis\ufb01es\n\nE[| edges(P \u2229 E)|] = O(d3\u03c3\u22124)\n\n4.3 The shadow vertex method\n\nThe shadow vertex method is a pivot rule used to solve LPs. In order to apply it, the polytope of the\nLP must have 0 as an interior point. Algebraically, all the inequalities must be of the form (cid:104)a, x(cid:105) \u2264 1\n(or alternatively (cid:104)a, x(cid:105) \u2264 b where b > 0). The input consists of a feasible point v in the polytope\nand a direction u in which it is farthest, compared to all other feasible points. In a nutshell, the\nmethod involves gradually turning the vector u towards the direction of the target direction c, while\ntraversing through the optimal solutions to the temporary direction at every stage. For more on the\nshadow vertex method we refer the reader to [ST04], Section 3.2\nThe manner in which Theorem 3 is used, both in the above mentioned papers and the current one, is\nthe following. Consider an LP of the form\n\nmax c(cid:62)x\n\u2200i \u2208 [n] (cid:104)Ai, x(cid:105) \u2264 1\n\nWhen solving the LP via the shadow vertex method, the number of pivot steps is upper bounded by\nthe number of edges in P \u2229 E where P = conv(0, A1, ..., An) and E is the plane spanned by the\ntarget direction c and the initial direction u obtained in the phase-1 step.\n\n5 Algorithm and Analysis\n\nOur simplex variant is de\ufb01ned in Algorithm 1 below. It is composed of projecting the polytope\nonto a lower dimension, adding noise, \ufb01nding an initial vertex (Phase 1), shifting and applying the\nshadow vertex simplex algorithm [GS55].\nTheorem 4. Algorithm 1 performs an expected number of O(poly(log n, 1\ninstance A with \u03bb-margin it returns, with probability at least 1 \u2212 O( 1\nwith margin \u2126( \u03bb\n\n\u03bb )) pivot steps. Over\nn ), a feasible solution \u00afx\n\nk + 1\n\n\u221a\nk\u221a\nd log k ).\n\nNote that the algorithm requires knowledge of \u03bb. This can be overcome with a simple binary search.\nTo prove Theorem 4, we \ufb01rst prove several auxilary lemmas. Due to space restrictions, some of the\nproofs are replaced with a brief sketch.\nLemma 5. With probability at least 1 \u2212 1/k there exists a feasible solution to LPbounded, denoted\n(\u02c6x, \u03c4 ) that satis\ufb01es \u03c4 \u2265 \u03bb \u2212 \u03b5 and (cid:107)\u02c6x(cid:107)\u221e \u2264 5\n\nlog(k)\u221a\n\n\u221a\n\n.\n\nk\n\nProof Sketch. Since A has margin \u03bb, there exists x\u2217 \u2208 Rd such that \u2200i . (cid:104)Ai, x\u2217(cid:105) \u2265 \u03bb and (cid:107)x\u2217(cid:107)2 =\n1. We use Theorem 2 to show that the projection of x\u2217 has, w.h.p., both a large margin and a small\n(cid:96)\u221e norm.\n\nDenote the k + 1 dimensional noise vectors that were added in step 3 by err1, . . . , errn+2k. The\nfollowing lemma will provide some basic facts that occur w.h.p. for the noise vectors. The proof\nis an easy consequence of the 2-stability of Gaussians, and standard tail bounds of the Chi-Squared\ndistribution and is thus omitted.\nLemma 6. Let err1, . . . , errn+2k be de\ufb01ned as above:\n\n1. w.p. at least 1 \u2212 1/n, \u2200i, (cid:107)erri(cid:107)2 \u2264 O(\u03c3(cid:112)k log(n)) \u2264 1\n\n\u221a\n\n20\n\nk\n\n5\n\n\f1\n\n\u03b52\n\n, \u03c32 =\n\nAlgorithm 1 large margin simplex\n1: Input: a \u03bb-margin LP instance A.\n6 , k = 9\u00b7162 log(n/\u03b5)\n2: Let \u03b5 = \u03bb\n3: (step 1: dimension reduction) Generate M \u2208 Rk\u00d7d, a projection onto a random k-dimensional\n\n100k log k log n.\n\nsubspace. Let \u02c6A \u2208 Rn\u00d7(k+1) be given by \u02c6Ai = (\n\nk MAi,\u22121)\n\u221a\n4: (step 2: bounding (cid:107)x(cid:107)) Add the k constraints (cid:104)ei, x(cid:105) \u2265 \u2212 6\nlog k\u221a\n\n, and one additional constraint \u03c4 \u2265 \u22128(cid:112)log(k). Denote the coef\ufb01cient vectors\n\n\u221a\n\u2212 6\nlog k\u221a\n\u02c6An+1, . . . , \u02c6An+2k and \u02c6A0 correspondingly. We obtain the following LP denoted by LPbounded,\n(3)\n\n, the k constraints (cid:104)\u2212ei, x(cid:105) \u2265\n\nmax(cid:104)ek+1, (x, \u03c4 )(cid:105)\n\n(cid:113) d\n\nk\n\nk\n\n\u2200i \u2208 [0, . . . , n + 2k] .\n\n(cid:68) \u02c6Ai, (x, \u03c4 )\n\n(cid:69) \u2265 bi\n\n5: (step 3: adding noise) Add a random independently distributed Gaussian noise to every entry of\n2). Denote the resulting constraint\n6: (step 4: phase-1): Let v0 \u2208 Rk+1 be the vertex for which inequalities 0, n + k + 1, . . . , n + 2k\n\nevery constraint vector except \u02c6A0 according to N (0, \u03c32\u00b7(cid:107) \u02c6Ai(cid:107)2\nvectors as \u02dcAi. Denote the resulting LP by LPnoise.\nare held as equalities. De\ufb01ne u0 \u2208 Rk+1 as u0\n\n\u2206= (1, . . . , 1,\u22121).\n\n7: (step 5: shifting the polytope) For all i \u2208 [0, . . . , n + 2k], change the value of bi to \u02c6bi\n\n\u2206=\n\nbi +\n8: (step 6 - shadow vertex simplex): Let E = span(u0, ek+1). Apply the shadow ver-\ntex simplex algorithm on the polygon which is the projection of conv{V}, where V =\n{0, \u02c6A0/\u02c6b0, \u02dcA1/\u02c6b1, . . . , \u02dcAn+2k/\u02c6bn+2k}, onto E. Let the solution be \u02dcx.\n\n(cid:68) \u02dcAi, v0/2\n(cid:69)\n\n.\n\n9: return\n\n\u00afx(cid:107)\u00afx(cid:107)2\n\nwhere \u00afx = M(cid:62)(\u02dcx + v0/2)\n\n2. Fix some I \u2282 [n + 2k] of size |I| = k and de\ufb01ne BI be the (k + 1)\u00d7 (k + 1) matrix whose\n\ufb01rst k columns consist of {erri}i\u2208I and k + 1 column is the 0 vector. W.p. at least 1\u2212 1/n\nit holds that the top singular value of BI is at most 1/2. Furthermore, w.p. at least 1\u2212 1/n\nthe 2-norms of the rows of B are upper bounded by\n\n.\n\n\u221a\n4\n\n1\nk+1\n\nLemma 7. Let \u02c6A, \u02dcA, \u02c6x \u2208 Rk be as above. Then with probability at least 1 \u2212 O(1/k):\n\n1. for \u03c4 = \u03bb \u2212 2\u03b5, the point (\u02c6x, \u03c4 ) \u2208 Rk+1 is a feasible solution of LPnoise.\n2. for every x \u2208 Rk, \u03c4 \u2208 R where (x, \u03c4 ) is a feasible solution of the LPnoise it holds that\n\n(cid:112)log(k)\u221a\n\nk\n\n(cid:107)x(cid:107)\u221e \u2264 7\n\n,\n\n\u2200i .\n\n(cid:12)(cid:12)(cid:12)(cid:68) \u02dcAi, (x, \u03c4 )\n\n(cid:69) \u2212(cid:68) \u02c6Ai, (x, \u03c4 )\n\n(cid:69)(cid:12)(cid:12)(cid:12) \u2264\n\n\u221a\nlog k\u221a\nk\n\nProof of this Lemma is deferred to the full version of this paper.\nLemma 8. with probability 1\u2212O(1/k), the vector v0 is a basic feasible solution (vertex) of LPnoise\n\nproof sketch. The vector v0 is a basic solution as it is de\ufb01ned by k + 1 equalities. To prove that is\nfeasible we exploit the fact that the last entry corresponding to \u03c4 is suf\ufb01ciently small and that all of\nthe constraints are of the form (cid:104)a, x(cid:105) \u2265 \u03c4.\n\nThe next lemma provides us with a direction u0 for which v0 is the unique optimal solution w.r.t to\nthe objective maxx\u2208P (cid:104)u0, x(cid:105), where P is the polytope of LPnoise. The vector u0 is independent of\nthe added noise. This is crucial for the following steps.\nLemma 9. Let u0 = (1, . . . , 1,\u22121). With probability at least 1 \u2212 O(1/n), the point v0 is the\noptimal solution w.r.t to the objective maxx\u2208P (cid:104)u0, x(cid:105), where P is the polytope of LPnoise.\n\n6\n\n\f{(cid:80)\nu0 =(cid:80)k\n\nProof Sketch. The set of points u in which v0 is the optimal solution is de\ufb01ned by a (blunt) cone\ni \u03b1iai| \u2200i, \u03b1i > 0}, where ai = \u2212 \u02dcAn+k+i for i \u2208 [k], ak+1 = \u2212 \u02c6A0. Consider the cone\ncorresponding to the constraints \u02c6A; u0 resides in its interior, far away from its boarders. Speci\ufb01cally,\ni=1(\u2212 \u02c6An+k+i) + (\u2212 \u02c6A0). Since the difference between \u02c6Ai and \u02dcAi is small w.h.p., we get\n\nthat u0 resides, w.h.p., in the cone of points in which v0 is optimal, as required.\nLemma 10. The point v0/2 is a feasible interior point of the polytope with probability at least\n1 \u2212 O(1/n).\n\nProof. By Lemma 9, v0 is a feasible point. Also, according to its de\ufb01nition it is clear that w.p 1, it\nlies on k + 1 facets of the polytope, neither of which contains the point 0. In other words, no facet\ncontains both v0 and 0. Since 0 is clearly a feasible point of the polytope, we get that v0/2 is a\nfeasible interior point as claimed.\n\n(cid:68) \u02c6Ai, (\u02dcx, \u03c4(cid:48))\n\n(cid:69) \u2265(cid:68) \u02dcAi, (\u02dcx, \u03c4(cid:48))\n\n(cid:69) \u2212\n\n\u221a\nlog k\u221a\nk\n\nProof of Theorem 4. We \ufb01rst note that in order to use the shadow vertex method, 0 must be an\ninterior point of the polytope. This does not happen in the original polytope, hence the shift of step\n5. Indeed according to Lemma 10, v0/2 is an interior point of the polytope, and by shifting it to 0,\nthe shadow vertex method can indeed be implemented.\nWe will assume that the statements of the auxiliary lemmas are held. This happens with probability\nat least 1 \u2212 O( 1\nn ), which is the stated success probability of the algorithm. By Lemma 7,\nLPnoise has a basic feasible solution with \u03c4 \u2265 \u03bb \u2212 2\u03b5. The vertex v0, along with the direction u0\nwhich it optimizes, is a feasible starting vector for the shadow vertex simplex algorithm on the plane\nE, and hence applying the simplex algorithm with the shadow vertex pivot rule will return a basic\nfeasible solution in dimension k + 1, denoted (\u02dcx, \u03c4(cid:48)), for which \u2200i \u2208 [n] .\n\u03c4(cid:48) \u2265 \u03bb \u2212 2\u03b5. Using Lemma 7 part two, we have that for all i \u2208 [n],\n\n(cid:68) \u02dcAi, (\u02dcx, \u03c4(cid:48))\n\n(cid:69) \u2265 0 and\n\nk + 1\n\n\u2265 \u2212\u03b5 \u21d2 (cid:104)MAi, \u02dcx(cid:105) \u2265 \u03bb \u2212 3\u03b5.\n\n(4)\ni M(cid:62) \u02dcx = (cid:104)f (Ai), \u02dcx(cid:105) \u2265\n\n\u03bb \u2212 3\u03b5 and this provides a solution to the original LP.\nTo compute the margin of this solution, note that the rows of M consist of an orthonormal set.\n\nSince \u00afx = (cid:112)d/kM(cid:62) \u02dcx, we get that for all i \u2208 [n], (cid:104)Ai, \u00afx(cid:105) = (cid:112)d/kA(cid:62)\nHence, by Lemma 7, (cid:107)M(cid:62) \u02dcx(cid:107)2 = (cid:107)\u02dcx(cid:107)2 \u2264 7(cid:112)log(k) meaning that (cid:107)\u00afx(cid:107)2 \u2264 7(cid:112)log(k)d/k. It\nk/(7(cid:112)log(k)d)\nfollows that the margin of the solution is at least \u2265 (\u03bb \u2212 3\u03b5) \u00b7 \u221a\nRunning time: The number of steps in this simplex step is bounded by the number of vertices in\nthe polygon which is the projection of the polytope of LPnoise onto the plane E = span{u0, vT}.\nLet V = { \u02dcA}n+2k\ni=1 . Since all of the points in V are perturbed, the number of vertices in the polygon\nconv(V) \u2229 E is bounded w.h.p. as in Theorem 3 by O(k3\u03c3\u22124) = \u02dcO(log11(n)/\u03bb14). Since the\npoints 0, \u02c6A0 reside in the plane E, the the number of vertices of (conv(V \u222a {0, \u02c6A0})) \u2229 E is at\nmost the number of vertices in conv(V) \u2229 E plus 4, which is asymptotically the same. Each pivot\nstep in the shadow vertex simplex method can be implemented to run in time O(nk) = \u02dcO(n/\u03bb14)\nfor n constraints in dimension k. The dimension reduction step required \u02dcO(nd) time. All other\noperations including adding noise and shifting the polytope are faster than the shadow vertex simplex\nprocedure, leading to an overall running time of \u02dcO(nd) (assuming \u03bb is a constant or sub polynomial\nin d).\n\nproof of Theorem 1. The statement regarding the margin of the solution, viewed as a point in Rd is\nimmediate from Theorem 4. To prove the claim regarding the view in the low dimensional space,\nconsider Equation 4 in the above proof. Put in words, it states the following: Consider the projec-\ntion M of the algorithm (or alternatively its image V ) and the classi\ufb01cation problem of the points\nprojected onto V . The margin of the solution produced by the algorithm (i.e., of \u02dcx) is at least \u03bb\u2212 3\u03b5.\n\nThe (cid:96)\u221e-norm \u02dcx of is clearly bounded by O((cid:112)log(k)/k). Hence, the margin of the normalized point\n\u02dcx/(cid:107)\u02dcx(cid:107)2 is \u2126(\u03bb/(cid:112)log(k)). In order to achieve a margin of \u03bb \u2212 O(\u03b5), one should replace the (cid:96)\u221e\n\nbound in the LP with an approximate (cid:96)2 bound. This can be done via linear constraints by bounding\n\n7\n\n\f(cid:12)(cid:12)(cid:12)(cid:107)F x(cid:107)1\n\n(cid:107)x(cid:107)2\n\n\u2212 1\n\nthe (cid:96)1 norm of F x where F : Rk \u2192 RK, K = O(k/\u03b52) and F has the property that for every\nx \u2208 Rk,\n\n(cid:12)(cid:12)(cid:12) < \u03b5. A properly scaled matrix of i.i.d. Gaussians has this property [Ind00].\nThis step would eliminate the need for the extra(cid:112)log(k) factor. The other multiplicative constants\n\ncan be reduced to 1 + O(\u03b5), thus ensuring the norm of \u02dcx is at most 1 + O(\u03b5), by assigning a slightly\nsmaller value for \u03c3; speci\ufb01cally, \u03c3/\u03b5 would do. Once the 2-norm of \u02dcx is bounded by 1 + O(\u03b5), the\nmargin of the normalized point is \u03bb \u2212 O(\u03b5).\n\n6 Discussion\n\nThe simplex algorithm for linear programming is a cornerstone of operations research whose com-\nputational complexity remains elusive despite decades of research. In this paper we examine the\nsimplex algorithm in the lens of machine learning, and in particular via linear classi\ufb01cation, which\nis equivalent to linear programming. We show that in the cases where the margin parameter is large,\nsay a small constant, we can construct a simplex algorithm whose worst case complexity is (quasi)\nlinear. Indeed in many practical problems the margin parameter is a constant unrelated to the other\nparameters. For example, in cases where a constant inherent noise exists, the margin must be large\notherwise the problem is simply unsolvable.\n\n6.1\n\nsoft margin SVM\n\nIn the setting of this paper, the case of soft margin SVM turns out to be algorithmically easier to\nsolve than the separable case. In a nutshell, the main hardship in the separable case is that a large\nnumber of data points may be problematic. This is since the separating hyperplane must separate\nall of the points and not most of them, meaning that every one of the data points must be taken in\nconsideration. A more formal statement is the following. In our setting we have three parameters.\nThe number of points n, the dimension d and the \u2018sub optimality\u2019 \u03b5. In the soft margin (e.g. hinge\nloss) case, the number of points may be reduced to poly(\u03b5\u22121) by elementary methods. Speci\ufb01cally,\nit is in easy task to prove that if we omit all but a random subset of log(\u03b5\u22121)/\u03b52 data points, the\nhinge loss corresponding to the obtained separator w.r.t the full set of points will be O(\u03b5). In fact,\nit suf\ufb01ces to solve the problem with the reduced number of points, up to an additive loss of \u03b5 to the\nmargin to obtain the same result. As a consequence of the reduced number of points, the dimension\ncan be reduced, analogously to the separable case to d(cid:48) = O(log(\u03b5\u22121)/\u03b52).\nThe above essentially states that the original problem can be reduced, by performing a single pass\nover the input (perhaps even less than that), to one where all the only parameter is \u03b5. From this\npoint, the only challenge is to solve the resulting LP, up to an \u03b5 additive loss to the optimum, in time\npolynomial to its size. There are many methods available for this problem.\nTo conclude, the soft margin SVM problem is much easier than the separable case hence we do not\nanalyze it in this paper.\n\nReferences\n[BD02]\n\nA. Blum and J. Dunagan. Smoothed analysis of the perceptron algorithm for linear pro-\ngramming. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete\nalgorithms, pages 905\u2013914. Society for Industrial and Applied Mathematics, 2002.\n\n[CHW10] Kenneth L. Clarkson, Elad Hazan, and David P. Woodruff. Sublinear optimization for\n\nmachine learning. In FOCS, pages 449\u2013457. IEEE Computer Society, 2010.\nCorinna Cortes and Vladimir Vapnik. Support-vector networks. In Machine Learning,\npages 273\u2013297, 1995.\n\n[CV95]\n\n[DG03]\n\n[DV08]\n\n[Dan51] G. B. Dantzig. Maximization of linear function of variables subject to linear inequalities.\n\nActivity Analysis of Production and Allocation, pages 339\u2013347, 1951.\nSanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and\nlindenstrauss. Random Struct. Algorithms, 22:60\u201365, January 2003.\nJohn Dunagan and Santosh Vempala. A simple polynomial-time rescaling algorithm for\nsolving linear programs. Math. Program., 114(1):101\u2013114, 2008.\n\n8\n\n\f[GS55]\n\n[Ind00]\n\n[FGKP06] Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and Ashok Kumar Ponnuswami.\nNew results for learning noisy parities and halfspaces. In FOCS, pages 563\u2013574. IEEE\nComputer Society, 2006.\nS. Gass and T. Saaty. The computational algorithm for the parameteric objective func-\ntion. Naval Research Logistics Quarterly, 2:39\u201345, 1955.\nP. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream\nIn Foundations of Computer Science, 2000. Proceedings. 41st Annual\ncomputation.\nSymposium on, pages 189\u2013197. IEEE, 2000.\nW. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mapping into hilbert space.\nContemporary Mathematics, 26:189\u2013206, 1984.\nJonathan A. Kelner and Daniel A. Spielman. A randomized polynomial-time simplex\nIn Proceedings of the thirty-eighth annual ACM\nalgorithm for linear programming.\nsymposium on Theory of computing, STOC \u201906, pages 51\u201360, New York, NY, USA,\n2006. ACM.\n\n[JL84]\n\n[KS06]\n\n[Meg86] Nimrod Megiddo. Improved asymptotic analysis of the average number of steps per-\n\n[ST04]\n\n[Ver09]\n\nformed by the self-dual simplex algorithm. Math. Program., 35:140\u2013172, June 1986.\nDaniel A. Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the\nsimplex algorithm usually takes polynomial time. J. ACM, 51:385\u2013463, May 2004.\nRoman Vershynin. Beyond hirsch conjecture: Walks on random polytopes and smoothed\ncomplexity of the simplex method. SIAM J. Comput., 39(2):646\u2013678, 2009.\n\n9\n\n\f", "award": [], "sourceid": 296, "authors": [{"given_name": "Elad", "family_name": "Hazan", "institution": null}, {"given_name": "Zohar", "family_name": "Karnin", "institution": null}]}