{"title": "Approximating Concavely Parameterized Optimization Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 2105, "page_last": 2113, "abstract": "We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the parameter can always be approximated with accuracy $\\varepsilon >0$ by a set of size $O(1/\\sqrt{\\varepsilon})$. A lower bound of size $\\Omega (1/\\sqrt{\\varepsilon})$ shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an approximate path of size $O(1/\\sqrt{\\varepsilon})$. Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion.", "full_text": "Approximating Concavely Parameterized\n\nOptimization Problems\n\nJoachim Giesen\n\nGermany\n\nJens K. Mueller\n\nGermany\n\nS\u00a8oren Laue\n\nGermany\n\nSascha Swiercy\n\nGermany\n\nFriedrich-Schiller-Universit\u00a8at Jena\n\nFriedrich-Schiller-Universit\u00a8at Jena\n\njoachim.giesen@uni-jena.de\n\nsoeren.laue@uni-jena.de\n\nFriedrich-Schiller-Universit\u00a8at Jena\n\nFriedrich-Schiller-Universit\u00a8at Jena\n\njkm@informatik.uni-jena.de\n\nsascha.swiercy@googlemail.com\n\nAbstract\n\nWe consider an abstract class of optimization problems that are parameterized\n\u221a\nconcavely in a single parameter, and show that the solution path along the param-\n\u221a\neter can always be approximated with accuracy \u03b5 > 0 by a set of size O(1/\n\u03b5). A\nlower bound of size \u2126(1/\n\u03b5) shows that the upper bound is tight up to a constant\n\u221a\nfactor. We also devise an algorithm that calls a step-size oracle and computes an\napproximate path of size O(1/\n\u03b5). Finally, we provide an implementation of the\noracle for soft-margin support vector machines, and a parameterized semi-de\ufb01nite\nprogram for matrix completion.\n\nIntroduction\n\n1\nProblem description. Let D be a set, I \u2286 R an interval, and f : I \u00d7 D \u2192 R such that\n\n(1) f(t,\u00b7) is bounded from below for every t \u2208 I, and\n(2) f(\u00b7, x) is concave for every x \u2208 D.\n\nt \u2208 D is called optimal at parameter value t if f(t, x\u2217\n\nWe study the parameterized optimization problem h(t) = minx\u2208D f(t, x).\nt ) = h(t), and x \u2208 D is called\nA solution x\u2217\nan \u03b5-approximation at t if \u03b5(t, x) := f(t, x) \u2212 h(t) \u2264 \u03b5. Of course it holds \u03b5(t, x\u2217\nt ) = 0. A subset\nP \u2286 D is called an \u03b5-path if P contains an \u03b5-approximation for every t \u2208 I. The size of a smallest\n\u03b5-approximation path is called the \u03b5-path complexity of the parameterized optimization problem.\nThe aim of this paper is to derive upper and lower bounds on the path complexity, and to provide\nef\ufb01cient algorithms to compute \u03b5-paths.\n\nMotivation. The rather abstract problem from above is motivated by regularized optimization\nproblems that are abundant in machine learning, i.e., by problems of the form\n\nf(t, x) := r(x) + t \u00b7 l(x),\n\nmin\nx\u2208D\n\nwhere r(x) is a regularization- and l(x) a loss term. The parameter t controls the trade-off between\nregularization and loss. Note that here f(\u00b7, x) is always linear and hence concave in the parameter t.\n\n1\n\n\fPrevious work. Due to the widespread use of regularized optimization methods in machine learn-\ning regularization path following algorithms have become an active area of research. Initially, exact\npath tracking methods have been developed for many machine learning problems [16, 18, 3, 9]\nstarting with the algorithm for SVMs by Hastie et al. [10]. Exact tracking algorithms tend to be\nslow and numerically unstable as they need to invert large matrices. Also, the exact regularization\npath can be exponentially large in the input size [5, 14]. Approximation algorithms can overcome\nthese problems [4]. Approximation path algorithms with approximation guarantees have been de-\nveloped for SVMs with square loss [6], the LASSO [14], and matrix completion and factorization\nproblems [8, 7].\n\n\u221a\nContributions. We provide a structural upper bound in O(1/\n\u03b5) for the \u03b5-path complexity for\nthe abstract problem class described above. We show that this bound is tight up to a multiplicative\nconstant by constructing a lower bound in \u2126(1/\n\u03b5). Finally, we devise a generic algorithm to com-\npute \u03b5-paths that calls a problem speci\ufb01c oracle providing a step-size certi\ufb01cate. If such a certi\ufb01cate\nexists, then the algorithm computes a path of complexity in O(1/\n\u03b5). Finally, we demonstrate the\nimplementation of the oracle for standard SVMs and a matrix completion problem. Resulting in the\n\n\ufb01rst algorithms for both problems that compute \u03b5-paths of complexity in O(cid:0)1/\nonly known approximation path algorithm with complexity in O(cid:0)1/\n\napproximation path algorithms have been known for standard SVMs but only a heuristic [12] and an\napproximation algorithm for square loss SVMs [6] with complexity in O(1/\u03b5). The best approxi-\nmation path algorithm for matrix completion also has complexity in O(1/\u03b5). To our knowledge, the\n\n\u03b5(cid:1). Previously, no\n\u03b5(cid:1) is [14] for the LASSO.\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n2 Upper Bound\n\n\u03b5).\n\n\u2212(t) and g(cid:48)\n\n\u221a\nHere we show that any problem that \ufb01ts the problem de\ufb01nition from the introduction for a compact\ninterval I = [a, b] has an \u03b5-path with complexity in O(1/\nLet (a, b) be the interior of [a, b] and let g : (a, b) \u2192 R be concave, then g is continuous and has a\n+(t), respectively, at every point t \u2208 I (see for example [15]).\nleft- and right derivative g(cid:48)\nNote that f(\u00b7, x) is concave by assumption and h is concave as the minimum over a family of\nconcave functions.\nLemma 1. For all t \u2208 (a, b), h(cid:48)\nProof. For all t(cid:48) < t it holds h(t(cid:48)) \u2264 f(t(cid:48), x\u2217\nimplies\n\nt ) and hence h(t) \u2212 h(t(cid:48)) \u2265 f(t, x\u2217\n\nt ) \u2212 f(t(cid:48), x\u2217\n\n\u2212(t) \u2265 f(cid:48)\n\nt ) \u2265 f(cid:48)\n\nt ) \u2265 h(cid:48)\n\n\u2212(t, x\u2217\n\n+(t, x\u2217\n\nt ) which\n\n+(t).\n\nf(t, x\u2217\n\nh(t) \u2212 h(t(cid:48))\n\nt ) \u2212 f(t(cid:48), x\u2217\nt )\nt \u2212 t(cid:48)\n\n\u2265 lim\nt(cid:48)\u2191t\n\nt ) \u2265 h(cid:48)\n\nh(cid:48)\n\u2212(t) := lim\nt(cid:48)\u2191t\nThe inequality f(cid:48)\n+(t, x\u2217\nsome algebra from the concavity of f(\u00b7, x\u2217\nDe\ufb01nition 2. Let I = [a, b] be a compact interval, \u03b5 > 0, and t0 = a. Let\n\nt \u2212 t(cid:48)\n+(t) follows analogously, and f(cid:48)\n\nTk =(cid:8)t| t \u2208 (tk\u22121, b] such that \u03b5(t, x\u2217\n\ntk\u22121) := f(t, x\u2217\nand tk = min Tk for all integral k > 0 such that Tk (cid:54)= \u2205. Finally, let\n| k \u2208 N such that Tk (cid:54)= \u2205}.\n\nP \u2217 = {x\u2217\n\n\u2212(t, x\u2217\n\n=: f(cid:48)\nt ) \u2265 f(cid:48)\n\n\u2212(t, x\u2217\nt ).\n+(t, x\u2217\n\ntk\u22121) \u2212 h(t) = \u03b5(cid:9),\n\nt ) and the de\ufb01nition of the derivatives (see [15]).\n\nt ) follows after\n\ntk\n\n1 + . . . + s\u22121\n\nLemma 3. Let s1, . . . , sn \u2208 R>0, then (s1 + . . . + sn)(s\u22121\nProof. The claim holds for n = 1 as s1s\u22121\nlet a = s1 + . . . + sn\u22121 and b = s\u22121\nbsn has circumference 2(as\u22121\ncircumference for a given area we have 2(as\u22121\n(a + sn)(b + s\u22121\n\n1 = 1 = 12. Assume the claim holds for n \u2212 1 and\n1 + . . . + s\u22121\nn\u22121. The rectangle with side lengths as\u22121\nn and\nn + bsn) and area as\u22121\nn bsn = ab. Since the square minimizes the\nab. The claim for n now follows from\n\u221a\nab + 1)2 \u2265 ((n\u2212 1) + 1)2 = n2.\n\nn + bsn + 1 \u2265 ab + 2\n\n\u221a\nab + 1 = (\n\nn ) = ab + as\u22121\n\nn + bsn) \u2265 4\n\nn ) \u2265 n2.\n\n\u221a\n\n2\n\n\fLemma 4. The size of P \u2217 is at most\n\n(cid:113)(cid:0)(b \u2212 a)(h(cid:48)\u2212(a) \u2212 h(cid:48)\u2212(b))(cid:1)/\u03b5 \u2208 O(cid:0)1/\n\n\u221a\n\n\u03b5(cid:1).\n\nProof. Let a = t0 \u2264 t1 \u2264 . . . be the sequence from De\ufb01nition 2. De\ufb01ne \u03b4k = tk+1 \u2212 tk and\n\u2206k = h(cid:48)\n\n\u2212(tk+1). We have\n\n\u2212(tk) \u2212 h(cid:48)\n\n\u2206k \u03b4k \u2265 (f(cid:48)\n\n\u2212(tk, x\u2217\n\n(cid:18) f(tk+1, x\u2217\n\ntk\n\n\u2265\n= f(tk+1, x\u2217\n\ntk\n\n\u2212(tk+1))(tk+1 \u2212 tk)\n\n) \u2212 h(cid:48)\n) \u2212 f(tk, x\u2217\n\ntk+1 \u2212 tk\n) \u2212 h(tk+1) = \u03b5(tk+1, x\u2217\n\ntk\n\ntk\n\n)\n\n\u2212 h(tk+1) \u2212 h(tk)\n\ntk+1 \u2212 tk\n\n),\n\ntk\n\n(cid:19)\n\n(tk+1 \u2212 tk)\n\nwhere the \ufb01rst inequality follows from Lemma 1 and the second inequality follows from concavity\nand the de\ufb01nition of derivatives (see [15]).\nThus, there exists sk > 0 such that \u03b4k \u2265 \u03b5sk and \u2206k \u2265 s\u22121\n\n\u03b5n2 \u2264 \u03b5(s1 + . . . + sn)(s\u22121\n\n1 + . . . + s\u22121\n\n\u2264 (b \u2212 a)(\u22061 + . . . + \u2206n) \u2264 (b \u2212 a)(h(cid:48)\n\n\u2212(a) \u2212 h(cid:48)\n\n\u2212(b)),\n\nk . It follows from Lemma 3 that\nn ) \u2264 (\u03b41 + . . . + \u03b4n)(\u22061 + . . . + \u2206n)\n\n\u2212(b) \u2264 h(cid:48)\n\n(cid:113)(cid:0)(b \u2212 a)(h(cid:48)\u2212(a) \u2212 h(cid:48)\u2212(b))(cid:1)/\u03b5.\n\n\u2212(t) for t \u2264 b (which can be proved from con-\nwhere the last inequality follows from h(cid:48)\ncavity, see again [15]). Hence, the sequence (tk) and thus the size of P \u2217 must be \ufb01nite, or more\nspeci\ufb01cally n is bounded by\nTheorem 5. P \u2217 is an \u03b5-path for I = [a, b].\nProof. For any x \u2208 D, \u03b5(\u00b7, x) is a continuous function. Hence, x\u2217\ntk is an \u03b5-approximation for all\nt \u2208 [tk, tk+1], because if there would be t \u2208 (tk, tk+1] with \u03b5(t, x\u2217\n) > \u03b5, then by continuity, there\nwould be also t(cid:48) \u2208 (tk, tk+1) with \u03b5(t, x\u2217\ntk\n) = \u03b5 which contradicts the minimality of tk+1. The\nclaim of the theorem follows since the proof of Lemma 4 shows that the sequence (tk) is \ufb01nite and\nhence the intervals [tk, tk+1] cover the whole [a, b].\n\ntk\n\n3 Lower Bound\n\n=\n\nh(t) = min\nx\u2208R\n\n(cid:1)2 \u2212 tx\u2217\n\n(cid:19)\n(cid:18)1\n2 x2 \u2212 tx and thus\n2 x2 \u2212 tx\n\n\u221a\nHere we show that there exists a problem that \ufb01ts the problem description from the introduction\nwhose \u03b5-path complexity is in \u2126(1/\n\u03b5). This shows that the upper bound from the previous section\nis tight up to a constant.\nLet I = [a, b], D = R, f(t, x) = 1\n\nwhere the last equality follows from the convexity and differentiability of f(t, x) in x which together\nimply \u2202f\n\n(cid:0)x\u2217\nFor \u03b5 > 0 and x \u2208 R let Ix = (cid:8)t \u2208 [a, b](cid:12)(cid:12) \u03b5(t, x) := 1\n(cid:113)(cid:0)(b \u2212 a)(h(cid:48)\u2212(a) \u2212 h(cid:48)\u2212(b))(cid:1)/\u03b5 =\n\n2 t2 \u2264 \u03b5(cid:9) , which is an interval\n2 x2 \u2212 tx + 1\n\u221a\nsince 1\nof x. Hence, the \u03b5-path complexity for the problem is at least (b \u2212 a)/2\n(cid:113) (b\u2212a)2\nLet us compare this lower bound with the upper from the previous section which gives for the\nspeci\ufb01c problem at hand,\n\u03b5 . Hence the upper\n\u221a\nbound is tight up to constant of at most 2\n\n2 t2 is a quadratic function in t. The length of this interval is 2\n\n2 x2 \u2212 tx + 1\n\nt \u2212 t = 0.\n\n2\u03b5 independent\n\n\u2202x (t, x\u2217\n\nt ) = x\u2217\n\nt = \u22121\n\n= b\u2212a\u221a\n\n2 t2,\n\n2\u03b5.\n\n\u221a\n\n1\n2\n\n2.\n\n\u03b5\n\nt\n\n4 Generic Algorithm\n\nSo far we have only discussed structural complexity bounds for \u03b5-paths. Now we give a generic\n\u03b5). When applying the generic algorithm to\nalgorithm to compute an \u03b5-path of complexity in O(1/\n\n\u221a\n\n3\n\n\fa speci\ufb01c problem a plugin-subroutine PATHPOLYNOMIAL needs to be implemented for the speci\ufb01c\nproblem. The generic algorithm builds on the simple idea that has been introduced in [6] to compute\nan (\u03b5/\u03b3)-approximation (for \u03b3 > 1) and only update this approximation along the parameter inter-\nval I = [a, b] when it fails to be an \u03b5-approximation. The plugin-subroutine PATHPOLYNOMIAL\nprovides a bound on the step-size for the algorithm, i.e., a certi\ufb01cate for how long the approximation\nis valid along the interval I. Hence we describe the idea behind the construction of this certi\ufb01cate\n\ufb01rst.\n\n4.1 Step-size certi\ufb01cate and algorithm\n\nWe always consider a problem that \ufb01ts the problem description from the introduction.\nDe\ufb01nition 6. Let P be the set of all concave polynomials p : I \u2192 R of degree at most 2. For\nt \u2208 I, x \u2208 D and \u03b5 > 0 let\n\nPt(x, \u03b5) := {p \u2208 P | p \u2264 h, f(t, x) \u2212 p(t) \u2264 \u03b5},\n\nwhere p \u2264 h means p(t(cid:48)) \u2264 h(t(cid:48)) for all t(cid:48) \u2208 I.\nNote that P contains constant and linear polynomials with second derivative p(cid:48)(cid:48) = 0 and quadratic\npolynomials with constant second derivative p(cid:48)(cid:48) < 0. If Pt(x, \u03b5) (cid:54)= \u2205, then x is an \u03b5-approximation\nat parameter value t, because there exists p \u2208 P such that \u03b5(t, x) \u2264 f(t, x) \u2212 p(t) \u2264 \u03b5.\nDe\ufb01nition 7. [Step-size] For t \u2208 I = [a, b], p \u2208 P, \u03b5 > 0, and \u03b3 > 1, let \u03b4t := t \u2212 a and\n\n\u03c1t(p, \u03b5) =\n\n\u03b5\n\u03b3 \u03b42\n\nt |p(cid:48)(cid:48)| , if p(cid:48)(cid:48) < 0 and \u03b4t > 0.\n\nThe step-size is given as\n\n\u2206t(p, \u03b5) =\n\nwhere\n\n\u2206(1)\n\nt (p)\n\n= \u03b4t(\u03b3 \u2212 1)\n\n\u2206(2)\n\u2206(3)\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u2206(1)\n(cid:115)\n(cid:115)\n\n\u2206(2)\n\nt (p, \u03b5)\n\n\u2206(3)\n\nt (p, \u03b5)\n\n=\n\n=\n\nt (p)\nt (p, \u03b5)\nt (p, \u03b5)\n\n: p(cid:48)(cid:48) = 0\n: p(cid:48)(cid:48) < 0, \u03c1t(p, \u03b5) \u2265 1\n: p(cid:48)(cid:48) < 0, \u03c1t(p, \u03b5) \u2264 1\n\n2\n\n2\n\n(cid:19)2 \u2212 \u03b4t\n\n(cid:18)\n\n\u03c1t(p, \u03b5) +\n\n(cid:19)\n\n1\n2\n\n(cid:18)\n\u03c1t(p, \u03b5) \u2212 1\n(cid:19)\n2\n\nt\n\n2\u03b5\n(cid:18)\n|p(cid:48)(cid:48)| + \u03b42\n2\u03b5\n|p(cid:48)(cid:48)|\n\n1 \u2212 1\u221a\n\u03b3\n\nTo simplify the notation we will skip the argument \u03b5 of the step-size \u2206t whenever the value of \u03b5 is\nobvious from the context.\nObservation 8. If \u03c1t(p, \u03b5) = 1/2, then \u2206(2)\n\nt (p), because \u03c1t(p, \u03b5) = 1/2 implies \u03b4t =\n\nt (p) = \u2206(3)\n\n(cid:113) 2\u03b5\n\n\u03b3 |p(cid:48)(cid:48)| .\n\nLemma 9. For t \u2208 (a, b), x \u2208 D, \u03b5 > 0 and \u03b3 > 1. If there exists p \u2208 Pt(x, \u03b5/\u03b3), then x is an\n\u03b5-approximation for all t(cid:48) \u2208 [t, b] with t(cid:48) \u2264 t + \u2206t(p).\nProof. Let g : [a, b] \u2192 R be the following linear function,\ng(t(cid:48)) = (t(cid:48) \u2212 t) p(t) + \u03b5/\u03b3 \u2212 p(a)\n\n.\n\nt \u2212 a\n\n+ p(t) + \u03b5\n\u03b3\n\nThen, for all t(cid:48) \u2208 [t, b],\nf(t(cid:48), x) \u2264 (t(cid:48) \u2212 t) f(t, x) \u2212 f(a, x)\n\nt \u2212 a\n\n+ f(t, x) \u2264 (t(cid:48) \u2212 t) p(t) + \u03b5/\u03b3 \u2212 p(a)\n\nt \u2212 a\n\n+ p(t) + \u03b5\n\u03b3\n\n= g(t(cid:48))\n\n4\n\n\fwhere the \ufb01rst inequality follows from the concavity of f(\u00b7, x), and the second inequality follows\nfrom f(t, x) \u2212 p(t) \u2264 \u03b5/\u03b3 and from p(a) \u2264 h(a) \u2264 f(a, x). Thus, x is an \u03b5-approximation for all\nt(cid:48) \u2208 [t, b] that satisfy g(t(cid:48)) \u2212 p(t(cid:48)) \u2264 \u03b5 because\n\n\u03b5(t(cid:48), x) = f(t(cid:48), x) \u2212 h(t(cid:48)) \u2264 f(t(cid:48), x) \u2212 p(t(cid:48)) \u2264 g(t(cid:48)) \u2212 p(t(cid:48)) \u2264 \u03b5.\n\nt (p) = \u2206t(p).\n\nIf p(cid:48)(cid:48) = 0, then g(t(cid:48)) \u2212 p(t(cid:48)) is a linear function in t(cid:48), and g(t(cid:48)) \u2212 p(t(cid:48)) \u2264 \u03b5 solves to t(cid:48) \u2212 t \u2264\n\nWe \ufb01nish the proof by considering three cases.\n(i)\n\u03b4t(\u03b3 \u2212 1) = \u2206(1)\nIf p(cid:48)(cid:48) < 0, then g(t(cid:48)) \u2212 p(t(cid:48)) is a quadratic polynomial in t(cid:48) with second derivative \u2212p(cid:48)(cid:48) > 0,\n(ii)\nand the equation g(t(cid:48))\u2212p(t(cid:48)) \u2264 \u03b5 solves to t(cid:48)\u2212t \u2264 \u2206(2)\nt (p). Note that we do not need the condition\n\u03c1t(p) \u2265 1/2 here.\n(iii) The case p(cid:48)(cid:48) < 0 and \u03c1t(p) \u2264 1/2 can be reduced to Case (ii). From \u03c1t(p) \u2264 1/2 we obtain\n\nt \u2212 a = \u03b4t \u2265(cid:113) 2\u03b5|p(cid:48)(cid:48)|\u03b3 and thus a \u2264 t \u2212(cid:113) 2\u03b5|p(cid:48)(cid:48)|\u03b3 =: \u02c6a. Let \u02c6p the restriction of p onto the interval\n[\u02c6a, b] and \u02c6\u03b4t = t \u2212 \u02c6a, then \u02c6p(cid:48)(cid:48) = p(cid:48)(cid:48), and thus \u03c1t(\u02c6p) = \u03b5/(cid:0)\u03b3 \u02c6\u03b42\n\nt |\u02c6p(cid:48)(cid:48)|(cid:1) = 1\n\n2. Hence by Observation 8,\n\nt (\u02c6p) = \u2206(2)\n\nt (\u02c6p). The claim follows from Case (ii).\n\n\u2206(3)\nt (p) = \u2206(3)\nAssume now that we have an oracle PATHPOLYNOMIAL available that on input t \u2208 (a, b) and\n\u03b5/\u03b3 > 0 returns x \u2208 D and p \u2208 Pt(x, \u03b5/\u03b3), then the following algorithm GENERICPATH returns an\n\u03b5-path if it terminates.\n\nAlgorithm 1 GENERICPATH\n\nInput: f : [a, b] \u00d7 D \u2192 R that \ufb01ts the problem description, and \u03b5 > 0\nOutput: \u03b5-path for the interval [a, b]\nchoose \u02c6t \u2208 (a, b)\nP := COMPUTEPATH (f, \u02c6t, \u03b5)\nde\ufb01ne \u02c6f : [a, b] \u00d7 D \u2192 R, (t, x) (cid:55)\u2192 f(a + b \u2212 t, x) [then \u02c6f also \ufb01ts the problem description]\nP := P \u222a COMPUTEPATH ( \u02c6f , a + b \u2212 \u02c6t, \u03b5)\nreturn P\n\nAlgorithm 2 COMPUTEPATH\n\nInput: f : [a, b] \u00d7 D \u2192 R that \ufb01ts the problem description, \u02c6t \u2208 (a, b) and \u03b5 > 0\nOutput: \u03b5-path for the interval [\u02c6t, b]\nt := \u02c6t and P := \u2205\nwhile t \u2264 b do\n\n(x, p) := PATHPOLYNOMIAL(cid:0)t, \u03b5/\u03b3(cid:1)\nt := min(cid:8)b, t + \u2206t(p)(cid:9)\n\nP := P \u222a {x}\n\nend while\nreturn P\n\n4.2 Analysis of the generic algorithm\n\nThe running time of the algorithm GENERICPATH is essentially determined by the complexity of\nthe computed path times the cost of the oracle PATHPOLYNOMIAL. In the following we show that\nthe complexity of the computed path is at most O(1/\n\n\u221a\n\n\u03b5).\n\nObservation 10. For c \u2208 R let \u03c6c : R(cid:2)>\n\n\u221a\n\n|c|(cid:3) \u2192 R, x (cid:55)\u2192 \u221a\n\nx2 + c \u2212 x. Then we have\n\n1. limx\u2192\u221e \u03c6c(x) = 0\n2. \u03c6(cid:48)\n\nc(x) =\nmonotonously increasing.\n\nx\u221a\n\nx2+c\n\n\u2212 1 for the derivative of \u03c6c. Thus, \u03c6(cid:48)\n\nc(x) > 0 for c < 0 and \u03c6c is\n\n5\n\n\fFurthermore, \u2206(2)\n\nt (p) =\n\n=\n\n=\n\n(cid:1)2 \u2212 \u03b4t\n\n(cid:113) 2\u03b5|p(cid:48)(cid:48)| + \u03b42\n(cid:114)\n(cid:16)\n(cid:114)\n(cid:16)\n\n\u03b42\nt\n\nt\n\n\u03b42\nt\n\n(cid:0)\u03c1t(p) \u2212 1\n(cid:17)2\n(cid:17)2\n\n2\n\n\u03c1t(p) + \u03b3 \u2212 1\n2\n\u03c1t(p) + \u03b3 \u2212 1\n2\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)(cid:17)\n\n+ \u03b42\n\u03c1t(p) + \u03b3 \u2212 1\n2\n\n(cid:0)\u03c1t(p) + 1\n(cid:1)\n(cid:16)\n(cid:16)\n\n2\n\n+ \u03b4t(\u03b3 \u2212 1).\n\n= \u03c6\u03b42\n\nt \u03b3(1\u2212\u03b3)\n\n\u03b4t\n\n+ \u03b42\n\nt \u03b3(1 \u2212 \u03b3) \u2212 \u03b4t\n\n\u03c1t(p) +\n\nt \u03b3(1 \u2212 \u03b3) \u2212 \u03b4t\n\n\u03c1t(p) + \u03b3 \u2212 1\n2\n\n(cid:17)\n\n+ \u03b4t(\u03b3 \u2212 1)\n\n(cid:17)\n\n1\n2\n\nLemma 11. Given t \u2208 I and p \u2208 P, then \u2206t(p) is continuous in |p(cid:48)(cid:48)|.\nProof. The continuity for |p(cid:48)(cid:48)| > 0 follows from the de\ufb01nitions of \u2206(2)\nt (p), and from\nObservation 8. Since \u03c1t(p) > 1/2 for small |p(cid:48)(cid:48)| the continuity at |p(cid:48)(cid:48)| = 0 follows from Observa-\ntion 10, because\n\nt (p) and \u2206(3)\n\n\u2206(2)\n\nt (p) = lim\n|p(cid:48)(cid:48)|\u21920\n\nt \u03b3(1\u2212\u03b3) (\u03b4t \u00b7 (\u03c1t(p) + \u03b3 \u2212 1/2)) + \u03b4t(\u03b3 \u2212 1) = \u03b4t(\u03b3 \u2212 1) = \u2206(1)\n\u03c6\u03b42\n\nlim\n|p(cid:48)(cid:48)|\u21920\nwhere we have used \u03c1t(p) \u2192 \u221e as |p(cid:48)(cid:48)| \u2192 0.\nLemma 12. Given t \u2208 I and p1, p2 \u2208 P, then \u2206t(p1) \u2265 \u2206t(p2) if |p(cid:48)(cid:48)\nProof. The claim is that \u2206t(p) is monotonously decreasing in |p(cid:48)(cid:48)|. Since \u2206t is continuous in |p(cid:48)(cid:48)|\nby Lemma 11 it is enough to check the monotonicity of \u2206(1)\nt (p). The mono-\ntonicity of \u2206(1)\nt (p) follows directly from the de\ufb01nitions of the latter. The monotonicity\nof \u2206(2)\n\nt (p) follows from Observation 10 since we have\n\nt (p) and \u2206(3)\n\nt (p) and \u2206(3)\n\n1| \u2264 |p(cid:48)(cid:48)\n2|.\n\nt (p), \u2206(2)\n\nt (p),\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)(cid:19)\n\n\u2206(2)\n\nt (p) = \u03c6\u03b42\n\nt \u03b3(1\u2212\u03b3)\n\n\u03b4t\n\n\u03c1t(p) + \u03b3 \u2212 1\n2\n\nt (p) is monotonously decreasing in |p(cid:48)(cid:48)| because \u03b42\n\nand thus \u2206(2)\nmonotonously decreasing in |p(cid:48)(cid:48)|.\nLemma 13. Given t \u2208 I and p \u2208 P, then \u2206t(p) is monotonously increasing in \u03b4t and hence in t.\n\n+ \u03b4t(\u03b3 \u2212 1),\nt \u03b3(1 \u2212 \u03b3) < 0 and \u03c1t(p) is\n\nt (p), \u2206(2)\n\nt (p) and \u2206(3)\n\nProof. Since \u2206t(p) is continuous in \u03b4t by Observation 8 it is enough to check the monotonicity of\n\u2206(1)\nt (p) and \u2206(3)\nt (p) follows directly from the\nt (p) for \u03c1t(p) \u2265 1\n2. For c \u2265 0\nde\ufb01nitions of the latter. It remains to show the monotonicity of \u2206(2)\nlet \u03c6\u22121 : R>0 \u2192 R, y (cid:55)\u2192 1\nc (y) > 0 we have\n\u03c6c(\u03c6\u22121\n\n. The notation is justi\ufb01ed because for \u03c6\u22121\n\nt (p). The monotonicity of \u2206(1)\n\nc (y)) = y. Apparently, \u03c6\u22121\n\n(cid:16) c\ny \u2212 y\n\n(cid:17)\n\n2\n\nc\n\nt (p) = \u03c6c1(\u03c6\u22121\n\u2206(2)\n\nis monotonously decreasing, and we have\nc2 (\u03b4t)) \u2212 \u03b4t = \u03c6c1(\u03c6\u22121\nc2 (\u03b4t)) \u2212 \u03c6c2(\u03c6\u22121\n\nc2 (\u03b4t)),\n\nc1 \u2212 \u03c6(cid:48)\n\n\u03b3 . Note that \u03c6\u22121\nc2 < 0 for c1 > c2, both \u03c6\u22121\n\nwith c1 = 2\u03b5|p(cid:48)(cid:48)| and c2 = c1\nBecause \u03c6(cid:48)\nrespective arguments. Hence, \u2206(2)\nTheorem 14. If there exists p \u2208 P and \u02c6\u03b5 > 0 such that |q(cid:48)(cid:48)| \u2264 |p(cid:48)(cid:48)| for all q that are returned by the\noracle PATHPOLYNOMIAL on input t \u2208 [a, b] and \u03b5 \u2264 \u02c6\u03b5. Then Algorithm 1 terminates after at most\n\nc2 (\u03b4t) > 0 since \u03c1t(p) \u2265 1\n2, and c2 < c1 since \u03b3 > 1.\nc2 and \u03c6c1 \u2212 \u03c6c2 are monotonously decreasing in their\n\n\u03b5(cid:1) steps, and thus returns an \u03b5-path of complexity in O(1/\n\nt (p) is monotonously increasing in \u03b4t.\n\nO(cid:0)1/\n\n\u221a\n\n\u221a\n\n\u03b5).\n\nProof. For all t \u2208 [\u02c6t, b], where \u02c6t \u2208 (a, b) is chosen in algorithm GENERICPATH, we have \u2206t(q) \u2265\n\u2206t(p) \u2265 \u2206\u02c6t(p). Here the \ufb01rst inequality is due to Lemma 12 and the second inequality is due\nto Lemma 13. Hence, the number of steps in the \ufb01rst call of COMPUTEPATH is upper bounded by\n(b\u2212\u02c6t)/(min{\u2206\u02c6t(p), b\u2212\u02c6t})+1. Similarly, the number of steps in the second call of COMPUTEPATH\nis upper bounded by (\u02c6t \u2212 a)/(min{\u2206a+b\u2212\u02c6t(p), \u02c6t \u2212 a}) + 1.\n\n6\n\n\f(p) does not depend on \u03b5 for p(cid:48)(cid:48) = 0. For\nFor the asymptotic behavior, observe that \u2206\u02c6t(p) = \u2206(1)\n\u02c6t\n|p(cid:48)(cid:48)| > 0 observe that lim\u03b5\u21920 \u03c1\u02c6t(p, \u03b5) = 0. Hence, there exists \u02c6\u03b5 > 0 such that \u03c1\u02c6t(p, \u03b5) < 1/2 and\n\u2206(3)\n\u02c6t\n\n(p, \u03b5) \u2264 b \u2212 \u02c6t for all \u03b5 < \u02c6\u03b5, and thus\nmin{\u2206\u02c6t(p), b \u2212 \u02c6t} + 1 = b \u2212 \u02c6t\n\n(cid:114)|p(cid:48)(cid:48)|\nAnalogously, (\u02c6t \u2212 a)/(min{\u2206a+b\u2212\u02c6t(p), \u02c6t \u2212 a}) + 1 \u2208 O(cid:0)1/\n\nb \u2212 \u02c6t\n\n+ 1 =\n\n\u2206(3)\n\u02c6t\n\n(p)\n\n(cid:18) 1\u221a\n(cid:19)\n\u03b5(cid:1), which completes the proof.\n\n(b \u2212 \u02c6t) + 1 \u2208 O\n\n\u221a\n\u03b3\u221a\n\u03b3 \u2212 1\n\u221a\n\n\u03b5\n\n2\u03b5\n\n.\n\n5 Applications\n\nHere we demonstrate on two examples that Lagrange duality can be a tool for implementing the ora-\ncle PATHPOLYNOMIAL in the generic path algorithm. This approach obtains the step-size certi\ufb01cate\nfrom an approximate solution that has to be computed anyway.\n\n5.1 Support vector machines\nGiven data points xi \u2208 Rd together with labels yi \u2208 {\u00b11} for i = 1, . . . , n. A support vector\nmachine (SVM) is the following parameterized optimization problem\n\n(cid:33)\n\nmax{0, 1 \u2212 yi(wT xi + b)} =: f(t, w)\n\n(cid:32)\n\nn(cid:88)\n\ni=1\n\nmin\n\nw\u2208Rd,b\u2208R\n\n(cid:107)w(cid:107)2 + t\n\n1\n2\n\n(cid:18)\n\n(cid:19)\n\u22121\n2 \u03b1T K\u03b1 + 1T \u03b1 =: d(\u03b1)\n\nmax\n\u03b1\u2208Rn\n\nparameterized in the regularization parameter t \u2208 [0,\u221e). The Lagrangian dual of the SVM is given\nas\n\ns.t.\n\n0 \u2264 \u03b1i \u2264 t, yT \u03b1 = 0,\n\nwhere K = AT A, A = (y1x1, . . . , ynxn) \u2208 Rd\u00d7n and y = (y1, . . . , yn) \u2208 Rn.\n\nAlgorithm 3 PATHPOLYNOMIALSVM\nInput: t \u2208 (0,\u221e) and \u03b5 > 0\nOutput: w \u2208 Rd and p \u2208 Pt(w, \u03b5)\ncompute a primal solution w \u2208 Rd and a dual solution \u03b1 \u2208 Rn such that f(t, w) \u2212 d(\u03b1) < \u03b5\n\nde\ufb01ne p : I \u2192 R, t(cid:48) (cid:55)\u2192 d(cid:0)\u03b1t(cid:48)/t(cid:1)\n\nreturn (w, p)\n\nLemma 15. Let (w, p) be the output of PATHPOLYNOMIALSVM on input t > 0 and \u03b5 > 0, then\np \u2208 Pt(w, \u03b5) and |p(cid:48)(cid:48)| \u2264 max0\u2264 \u02c6\u03b1\u22641 \u02c6\u03b1T K \u02c6\u03b1. [Hence, Theorem 14 applies here.]\n\nProof. Let \u03b1 be the dual solution computed by PATHPOLYNOMIALSVM and p be the polynomial\nde\ufb01ned in PATHPOLYNOMIALSVM. Then,\n\np(t(cid:48)) = \u2212 t(cid:48)2\n\nt2\n\n1\n\n2 \u03b1T K\u03b1 + t(cid:48)\n\nt\n\n1T \u03b1 and thus\n\np(cid:48)(cid:48)(t(cid:48)) = \u2212 1\n\nt2 \u03b1T K\u03b1 \u2264 0\n\nsince K is positive semide\ufb01nite. Hence, p \u2208 P. For p \u2208 Pt(w, \u03b5), it remains to show that p \u2264 h =\nminw\u2208Rd f(\u00b7, w) and f(t, w) \u2212 p(t) \u2264 \u03b5. The latter follows immediately from p(t) = d(\u03b1). For\nt(cid:48) > 0 let \u03b1(cid:48) = \u03b1t(cid:48)/t, then \u03b1(cid:48) is feasible for the dual SVM at parameter value t(cid:48) since \u03b1 is feasible\nfor the dual SVM at t. It follows, p(t(cid:48)) = d(\u03b1(cid:48)) \u2264 h(t(cid:48)) = minw\u2208Rd f(\u00b7, w). Finally, observe that\n\u03b1i \u2264 t implies |p(cid:48)(cid:48)| = 1\n\nt2 \u03b1T K\u03b1 \u2264 max0\u2264 \u02c6\u03b1\u22641 \u02c6\u03b1T K \u02c6\u03b1.\n\nThe same results hold when using any positive kernel K. In the kernel case one has the following\nprimal SVM (see [2]),\n\n(cid:33)(cid:41)\n\n(cid:33)\n\n(cid:32)\n\nmin\n\n\u03b2\u2208Rm,b\n\n\u03b2T K\u03b2 + t \u00b7\n\n1\n2\n\nmax\n\n0, 1 \u2212 yi\n\n\u03b2jyjKij + b\n\n=: f (t, \u03b2)\n\n.\n\n(cid:40)\n\nn(cid:88)\n\n(cid:32) n(cid:88)\n\ni=1\n\nj=1\n\n7\n\n\fWe have implemented the algorithm GENERICPATH for SVMs in Matlab using LIBSVM [1] as the\nSVM solver. To assess the practicability of the proposed algorithm we ran it on several datasets\ntaken from the LIBSVM website. For each dataset we have measured the size of the computed\n\u03b5-path (number of nodes) for t \u2208 [0.1, 10] and \u03b5 \u2208 {2\u2212i | i = 2, . . . , 10}. Figure 5.1 shows the\nsize of paths as a function of \u03b5 using double logarithmic plots. A straight line plot with slope \u2212 1\n\u221a\ncorresponds to an empirical path complexity that follows the function 1/\n\n\u03b5.\n\n2\n\n(a) Path complexity for a linear SVM\n\n(b) Path complexity for a SVM with Gaussian\nkernel exp(\u2212\u03b3(cid:107)u \u2212 v(cid:107)2\n\n2) for \u03b3 = 0.5\n\n5.2 Matrix completion\nMatrix completion asks for a completion X of an (n \u00d7 m)-matrix Y that has been observed only at\nthe indices in \u2126 \u2282 {1, . . . , m} \u00d7 {1, . . . , n}. The problem can be solved by the following convex\nsemide\ufb01nite optimization approach, see [17, 11, 13],\n\n(cid:18) A X\n\nX T B\n\n(cid:19)\n\n(cid:23) 0.\n\ns.t.\n\n(cid:88)\n\n(cid:0)Xij \u2212 Yij\n\n(i,j)\u2208\u2126\n\n1\n2\n\n\u039b2\nij + \u039bijYij\n\ns.t.\n\n(cid:1)2 + t \u00b7 1\n(cid:18) tI\n\n2\n\n\u039bT\n\n(cid:0)tr(A) + tr(B)(cid:1)\n(cid:19)\n\n\u039b\ntI\n\nX\u2208Rn\u00d7m, A\u2208Rn\u00d7n, B\u2208Rm\u00d7m\n\nmin\n\n\u2212 (cid:88)\n\n(i,j)\u2208\u2126\n\nmax\n\n\u039b\u2208Rn\u00d7m\n\nThe Lagrangian dual of this convex semide\ufb01nite program is given as\n\n(cid:23) 0, and \u039bij = 0 if (i, j) /\u2208 \u2126.\n\nLet f(t, \u02c6X) for \u02c6X = (X, A, B) be the primal objective function at parameter value t, and d(\u039b) be\nthe dual objective function. Analogously to the SVM case we have the following:\n\nAlgorithm 4 PATHPOLYNOMIALMATRIXCOMPLETION\n\nInput: t \u2208 (0,\u221e) and \u03b5 > 0\nOutput: \u02c6X and p \u2208 Pt( \u02c6X, \u03b5)\ncompute a primal solution \u02c6X and a dual solution \u039b \u2208 Rn\u00d7m such that f(t, \u02c6X) \u2212 d(\u039b) < \u03b5\n\nreturn ( \u02c6X, p)\n\nde\ufb01ne p : I \u2192 R, t(cid:48) (cid:55)\u2192 d(cid:0)t(cid:48)/t \u039b(cid:1)\n\u039b \u2208 Rn\u00d7m(cid:12)(cid:12)(cid:12) (cid:18) tI\n\n(cid:26)\n\n\u039bT\n\nLemma 16. Let ( \u02c6X, p) be the output of PATHPOLYNOMIALMATRIXCOMPLETION on in-\nput t > 0 and \u03b5 > 0,\nF , where\nFt =\n(cid:3)\n\u039b\ntI\n\n(cid:19)\nthen p \u2208 Pt( \u02c6X, \u03b5) and |p(cid:48)(cid:48)| \u2264 max \u02c6\u039b\u2208F1\n\n(cid:23) 0, \u039bij = 0, \u2200(i, j) /\u2208 \u2126\n\n(cid:107)\u02c6\u039b(cid:107)2\n\n(cid:27)\n\n.\n\nThe proof for Lemma 16 is similar to the proof of Lemma 15, and Lemma 16 shows that Theorem 14\ncan be applied here.\n\nAcknowledgments\nschaft (GI-711/3-2).\n\nThis work has been supported by a grant of the Deutsche Forschungsgemein-\n\n8\n\n10\u2212310\u2212210\u22121101epsilon#nodes1/sqrt(epsilon)a1adukefourclassscalemushroomsw1a10\u2212310\u2212210\u22121101epsilon#nodes1/sqrt(epsilon)a1adukefourclassscalemushroomsw1a\fReferences\n[1] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM\n\nTrans. Intell. Syst. Technol., 2(3):27:1\u201327:27, 2011.\n\n[2] Olivier Chapelle. Training a Support Vector Machine in the Primal. Neural Computation,\n\n19(5):1155\u20131178, 2007.\n\n[3] Alexandre d\u2019Aspremont, Francis R. Bach, and Laurent El Ghaoui. Full Regularization Path\nfor Sparse Principal Component Analysis. In Proceedings of the International Conference on\nMachine Learning (ICML), pages 177\u2013184, 2007.\n\n[4] Jerome Friedman, Trevor Hastie, Holger H\u00a8o\ufb02ing, and Robert Tibshirani. Pathwise Coordinate\n\nOptimization. The Annals of Applied Statistics, 1(2):302\u2013332, 2007.\n\n[5] Bernd G\u00a8artner, Martin Jaggi, and Clement Maria. An Exponential Lower Bound on the Com-\n\nplexity of Regularization Paths. arXiv.org, arXiv:0903.4817v, 2010.\n\n[6] Joachim Giesen, Martin Jaggi, and S\u00a8oren Laue. Approximating Parameterized Convex Op-\ntimization Problems. In Proceedings of Annual European Symposium on Algorithms (ESA),\npages 524\u2013535, 2010.\n\n[7] Joachim Giesen, Martin Jaggi, and S\u00a8oren Laue. Optimizing over the Growing Spectrahedron.\nIn Proceedings of Annual European Symposium on Algorithms (ESA), pages 503\u2013514, 2012.\n[8] Joachim Giesen, Martin Jaggi, and S\u00a8oren Laue. Regularization Paths with Guarantees for\nConvex Semide\ufb01nite Optimization. In Proceedings International Conference on Arti\ufb01cial In-\ntelligence and Statistics (AISTATS), pages 432\u2013439, 2012.\n\n[9] Bin Gu, Jian-Dong Wang, Guan-Sheng Zheng, and Yue cheng Yu. Regularization Path for \u03bd-\nSupport Vector Classi\ufb01cation. IEEE Transactions on Neural Networks and Learning Systems,\n23(5):800\u2013811, 2012.\n\n[10] Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. The entire regularization path\nfor the support vector machine. The Journal of Machine Learning Research, 5:1391\u20131415,\n2004.\n\n[11] Martin Jaggi and Marek Sulovsk\u00b4y. A Simple Algorithm for Nuclear Norm Regularized Prob-\nlems. In Proceedings of the International Conference on Machine Learning (ICML), pages\n471\u2013478, 2010.\n\n[12] Masayuki Karasuyama and Ichiro Takeuchi. Suboptimal Solution Path Algorithm for Sup-\nport Vector Machine. In Proceedings of the International Conference on Machine Learning\n(ICML), pages 473\u2013480, 2011.\n\n[13] S\u00a8oren Laue. A hybrid algorithm for convex semide\ufb01nite optimization. In Proceedings of the\n\nInternational Conference on Machine Learning (ICML), 2012.\n\n[14] Julien Mairal and Bin Yu. Complexity Analysis of the Lasso Regularization Path. In Proceed-\n\nings of the International Conference on Machine Learning (ICML), 2012.\n\n[15] A. Wayne Roberts and Dale Varberg. Convex functions. Academic Press, New York, 1973.\n[16] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. Annals of Statistics,\n\n35(3):1012\u20131030, 2007.\n\n[17] Nathan Srebro, Jason D. M. Rennie, and Tommi Jaakkola. Maximum-Margin Matrix Factor-\nIn Proceedings of Advances in Neural Information Processing Systems 17 (NIPS),\n\nization.\n2004.\n\n[18] Zhi-li Wu, Aijun Zhang, Chun-hung Li, and Agus Sudjianto. Trace Solution Paths for SVMs\nvia Parametric Quadratic Programming. In KDD Worskshop: Data Mining Using Matrices\nand Tensors, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1034, "authors": [{"given_name": "Joachim", "family_name": "Giesen", "institution": null}, {"given_name": "Jens", "family_name": "Mueller", "institution": null}, {"given_name": "Soeren", "family_name": "Laue", "institution": null}, {"given_name": "Sascha", "family_name": "Swiercy", "institution": null}]}