{"title": "Support Vector Machines on a Budget", "book": "Advances in Neural Information Processing Systems", "page_first": 345, "page_last": 352, "abstract": null, "full_text": "Support Vector Machines on a Budget\n\nOfer Dekel and Yoram Singer\n\nSchool of Computer Science and Engineering\n\nThe Hebrew University\nJerusalem 91904, Israel\n\n{oferd,singer}@cs.huji.ac.il\n\nAbstract\n\nThe standard Support Vector Machine formulation does not provide its user with\nthe ability to explicitly control the number of support vectors used to de\ufb01ne the\ngenerated classi\ufb01er. We present a modi\ufb01ed version of SVM that allows the user\nto set a budget parameter B and focuses on minimizing the loss attained by the B\nworst-classi\ufb01ed examples while ignoring the remaining examples. This idea can\nbe used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we\nobtain these new SVM variants by replacing the 1-norm in the standard SVM for-\nmulation with various interpolation-norms. We also adapt the SMO optimization\nalgorithm to our setting and report on some preliminary experimental results.\n\n1\n\nIntroduction\n\n(cid:1)(cid:2)\n\nThe L1 Support Vector Machine (L1-SVM or SVM for short) [1, 2, 3] is a powerful tech-\nnique for learning binary classi\ufb01ers from examples. Given a training set {(xi, yi)}m\ni=1 and\nthe SVM solution is a hypothesis of the form h(x) =\na positive semi-de\ufb01nite kernel K,\n, where S is a subset of {1, . . . , m}, {\u03b1i}i\u2208S are real valued weights,\nsign\nand b is a bias term. The set S de\ufb01nes the support of the classi\ufb01er, namely, the set of examples that\nactively participate in the classi\ufb01er\u2019s de\ufb01nition. The examples in this set are called support vectors,\nand we say that the SVM solution is sparse if the fraction of support vectors (|S|/m) is reasonably\nsmall.\n\ni\u2208S \u03b1iyiK(xi, x) + b\n\n(cid:3)\n\nOur \ufb01rst concern is usually with the accuracy of the classi\ufb01er. However, in some applications, the\nsize of the support is equally important. Assuming that the kernel operator K can be evaluated in\nconstant time, the time-complexity of evaluating the classi\ufb01er on a new instance is linear in the size\nof S. Therefore, a large support de\ufb01nes a slow classi\ufb01er. Classi\ufb01cation speed is often important\nand plays an especially critical role in real-time systems. For example, a classi\ufb01er that drives a\nphoneme detector in a speech recognition system is evaluated hundreds of times a second. If this\nclassi\ufb01er does not manage to keep up with the rate at which the speech signal is acquired then\nits classi\ufb01cations are useless, regardless of their accuracy. The size of the support also naturally\ndetermines the amount of memory required to store the classi\ufb01er. If a classi\ufb01er is intended to run\nin a device with a limited memory, such as a mobile telephone, there may be a physical limit on\nthe amount of memory available to store support vectors. The size of S may also effect the time\nrequired to train an SVM classi\ufb01er. Most modern SVM learning algorithms are active set methods,\nnamely, on every step of the training process, only a small set of active training examples are taken\ninto account. Knowing the size of S ahead of time would enable us to optimize the size of the active\nset and possibly gain a signi\ufb01cant speed-up in the training process.\n\nThe SVM mechanism does not give us explicit control over the size of the support. The user-de\ufb01ned\nparameters of SVM have some in\ufb02uence on the size of S, but we often require more than this.\nSpeci\ufb01cally, we would like the ability to specify a budget parameter, B, which directly controls the\nnumber of support vectors used to de\ufb01ne the SVM solution. In this paper, we address this issue\n\n\fand present budget-SVM, a minor modi\ufb01cation to the standard L1-SVM formulation that allows\nthe user to set a budget parameter. The budget-SVM optimization problem focuses only on the B\nworst-classi\ufb01ed examples in the training set, ignoring all other examples.\nThe problem of sparsity becomes even more critical when it comes to L2-SVM [3], a variant of the\nSVM problem that tends to have dense solutions. L2-SVM is sometimes preferred over L1-SVM\nbecause it exhibits good generalization properties, as well as other desirable statistical characteris-\ntics [4]. We derive the budget-L2-SVM formulation by following the same technique used to derive\nbudget-L1-SVM.\nThe technique used to derive these SVM variants is as follows. We begin by generalizing the L1-\nSVM formulation by replacing the 1-norm with an arbitrary norm. We obtain a general framework\nfor SVM-type problems, which we nickname Any-Norm-SVM. Next, we turn to the K-method of\nnorm interpolation to obtain the 1 \u2212 \u221e interpolation-norm and the 2 \u2212 \u221e interpolation-norm, and\nuse these norms in the Any-Norm-SVM framework. These norms have the property that they depend\nonly on the absolutely-largest elements of the vector. We rely on this property and show that our\nSVM variants construct sparse solutions. For each of these norms, we present a simple modi\ufb01cation\nof the SMO algorithm [5], which ef\ufb01ciently solves the respective optimization problem.\n\nRelated Work The problem of approximating the SVM solution using a reduced set of examples\nhas received much previous attention [6, 7, 8, 9]. This technique takes a two-step approach: begin\nby training a standard SVM classi\ufb01er, perhaps obtaining a dense solution. Then, try to \ufb01nd a sparse\nclassi\ufb01er which minimizes the L2 distance to the SVM solution. A potential drawback of this ap-\nproach is that once the SVM solution has been found, the distribution from which the training set was\nsampled no longer plays a role in the learning process. This ignores the fact that shifting the SVM\nclassi\ufb01er by a \ufb01xed amount in different directions may have dramatically different consequences on\nclassi\ufb01cation accuracy. We overcome this problem by taking the approach of [10] and reformulating\nthe SVM optimization problem itself in a way that promotes sparsity. Another technique used to\nobtain a sparse kernel-machine takes advantage of the inherent sparsity of linear programming solu-\ntions, and formalizes the kernel-machine learning problem as a linear program [11]. This approach,\noften called LP-SVM or Sparse-SVM, has been shown to generally construct sparse solutions, but\nstill lacks the ability to introduce an explicit budget parameter. Yet another approach involves ran-\ndomly selecting a subset of the training set to serve as support vectors [12]. The problem of learning\na kernel-machine on a budget also appears in the online-learning mistake-bound framework, and it\nis there where the term \u201clearning on a budget\u201d was coined [13]. Two recent papers [14, 15] propose\nonline kernel-methods on a budget with an accompanying theoretical mistake-bound.\n\nThis paper is organized as follows. We present the generalized Any-Norm-SVM framework in\nSec. 2. We discuss the K-method of norm interpolation in Sec. 3 and put various interpolation\nnorms to use within the Any-Norm-SVM framework in Sec. 4. Then, in Sec. 5, we present some\npreliminary experiments that demonstrate how the theoretical properties of our approach translate\ninto practice. We conclude with a discussion in Sec. 6. Due to the lack of space, some of the proofs\nare omitted from this paper.\n\n2 Any-Norm SVM\nLet {(xi, yi)}m\ni=1 be a training set, where every xi belongs to an instance space X and every yi \u2208\n{\u22121, +1}. Let K : X \u00d7 X \u2192 R be a positive semi-de\ufb01nite kernel, and let H be its corresponding\nReproducing Kernel Hilbert Space (RKHS) [16], with inner product (cid:4)\u00b7,\u00b7(cid:5)H. The L1 Support Vector\nMachine is de\ufb01ned as the solution to the following convex optimization problem:\n\n(cid:3) \u2265 1 \u2212 \u03bei ,\n\n(1)\n\nmin\n\nf\u2208H, b\u2208R, \u03be\u22650\n\n1\n\n2(cid:4)f, f(cid:5)H + C(cid:6)\u03be(cid:6)1\n\ns.t.\n\n\u2200 1 \u2264 i \u2264 m yi\n\nf(xi) +b\n\n(cid:1)\n\nwhere \u03be is a vector of m slack variables, and C is a positive constant that controls the tradeoff\nbetween the complexity of the learned classi\ufb01er and how well it \ufb01ts the training data. The value\nof \u03bei is sometimes referred to as the hinge-loss attained by the SVM classi\ufb01er on example i. The\n1-norm, de\ufb01ned by (cid:6)\u03be(cid:6)1 =\ni=1 |\u03bei|, is used to combine the individual hinge-loss values into a\nsingle number.\n\n(cid:2)m\n\n\f(cid:1)\n\n(cid:1)\n\nmin\n\n(cid:3) \u2265 1 \u2212 \u03bei .\n\nL2-SVM is a variant of the optimization problem de\ufb01ned above, de\ufb01ned as follows:\n\n1\n\n2(cid:4)f, f(cid:5)H + C(cid:6)\u03be(cid:6)2\n\n2\n\ns.t.\n\n\u2200 1 \u2264 i \u2264 m yi\n\nf(xi) +b\n\n2 =\n\n(cid:2)m\n\nf\u2208H, b\u2208R, \u03be\u22650\nThis formulation differs from the L1 formulation in that the 1-norm is replaced by the squared 2-\nnorm, de\ufb01ned by (cid:6)\u03be(cid:6)2\ni . In this section, we take this idea even farther, and allow the\n1-norm of L1-SVM to be replaced by any norm. Formally, let (cid:6) \u00b7 (cid:6) be an arbitrary norm de\ufb01ned on\nm. Recall that a norm is a real valued operator such that for every v \u2208 R\nm and \u03bb \u2208 R it holds\nthat (cid:6)\u03bbv(cid:6) = |\u03bb|(cid:6)v(cid:6) (positive homogeneity), (cid:6)v(cid:6) \u2265 0 and (cid:6)v(cid:6) = 0 if and only if v = 0 (positive\nde\ufb01niteness), and that satis\ufb01es the triangle inequality. Now consider the following optimization\nproblem:\n\ni=1 \u03be2\n\nR\n\n(cid:3) \u2265 1 \u2212 \u03bei .\n\nmin\n\n1\n\n2(cid:4)f, f(cid:5)H + C(cid:6)\u03be(cid:6)\n\ns.t.\n\n\u2200 1 \u2264 i \u2264 m yi\n\nf(xi) + b\n\nmax\n\n(cid:2)m\n\nu \u00b7 v .\n\nv\u2208Rm : (cid:5)v(cid:5)=1\n\nf\u2208H, b\u2208R, \u03be\u22650\n\n(2)\nL1-SVM is recovered by setting (cid:6) \u00b7 (cid:6) to be the 1-norm. Setting (cid:6) \u00b7 (cid:6) to be the 2-norm induces an\noptimization problem which is close in nature to L2-SVM, but not identical to it since the 2-norm is\nnot squared. Combining the positive homogeneity property of (cid:6) \u00b7 (cid:6) with the fact that it satis\ufb01es the\ntriangle inequality ensures that the objective function of Eq. (2) is convex.\nAn important class of norms used extensively in our derivation is the family of p-norms, de\ufb01ned for\nevery p \u2265 1 by (cid:6)v(cid:6)p = (\nj=1 |vj|p)1/p. A special member of this family is the \u221e-norm, which is\nde\ufb01ned by (cid:6)v(cid:6)\u221e = limp\u2192\u221e (cid:6)v(cid:6)p and can be shown to be equivalent to maxj |vj|. We also use the\nm has a dual norm which is also de\ufb01ned on R\nm. The dual\nnotion of norm duality. Every norm on R\nnorm of (cid:6) \u00b7 (cid:6) is denoted by (cid:6) \u00b7 (cid:6)(cid:1) and given by\nu \u00b7 v\n(cid:6)v(cid:6) =\n\n(3)\nAs its name implies, (cid:6) \u00b7 (cid:6)(cid:1) also satis\ufb01es the requirements of a norm. For example, H\u00a8older\u2019s inequal-\nity [17] states that the dual of (cid:6) \u00b7 (cid:6)p is the norm (cid:6) \u00b7 (cid:6)q, where q = p/(p \u2212 1). The dual of the 1-norm\nis the \u221e-norm and vice versa.\nUsing the de\ufb01nition of the dual norm, we now state the dual optimization problem of Eq. (2):\nyi\u03b1i = 0 and (cid:6)\u03b1(cid:6)(cid:1) \u2264 C .\n\n(cid:6)u(cid:6)(cid:1) = max\nv\u2208Rm\n\n(cid:2)m\n\n(4)\nAs a \ufb01rst sanity check, note that if (cid:6) \u00b7 (cid:6) in Eq. (2) is chosen to be the 1-norm, then (cid:6) \u00b7 (cid:6)(cid:1) is the \u221e-\nnorm, and the constraint (cid:6)\u03b1(cid:6)(cid:1) \u2264 C reduces to the familiar box-constraint of L1-SVM [3]. The proof\nthat Eq. (2) and Eq. (4) are indeed dual optimization problems relies on basic techniques in convex\nanalysis [18], and is omitted due to the lack of space. Moreover, it can be shown that the solution\nto Eq. (2) takes the form f(\u00b7) =\ni=1 \u03b1iyiK(xi,\u00b7), and that strong duality holds regardless of the\nnorm used. This allows us to forget about the primal problem in Eq. (2) and to focus on solving the\ndual problem in Eq. (4). As with L1-SVM, the bias term, b, cannot be directly extracted from the\nsolution of the dual. The standard techniques used to \ufb01nd b in L1-SVM apply here as well [3].\nWe note that the Any-Norm-SVM formulation is not fundamentally different from the original L1-\nSVM formulation. Both optimization problems have convex objective functions and linear con-\nstraints. More importantly, the only difference between their respective duals is in the dual-norm\nconstraint. Speci\ufb01cally, the objective function in Eq. (4) is a concave quadratic function for any\nchoice of (cid:6)\u00b7(cid:6). These facts enable us to ef\ufb01ciently solve the problem in Eq. (4) for any kernel K and\nany norm using techniques similar to those used to solve the standard L1-SVM problem.\n\n\u03b1i\u03b1jyiyjK(xi, xj)\n\n\u03b1i \u2212 1\n\nmax\n\u03b1\u22650\n\nm(cid:4)\n\nm(cid:4)\n\nm(cid:4)\n\nm(cid:4)\n\ni=1\n\ni=1\n\nj=1\n\ns.t.\n\ni=1\n\n2\n\n3\n\nInterpolation Norms\n\nIn the previous section, we acquired the ability to replace the 1-norm in the de\ufb01nition of L1-SVM\nwith an arbitrary norm. We now use Peetre\u2019s K-method of norm interpolation [19] to obtain norms\nthat promote the sparsity of the generated classi\ufb01er. The K-method is a technique for smoothly\ninterpolating between a pair of norms. Let (cid:6) \u00b7 (cid:6)p1 : R\nm \u2192 R+ and (cid:6) \u00b7 (cid:6)p2 : R\nm \u2192 R+ be two\np-norms, and let (cid:6) \u00b7 (cid:6)q1 and (cid:6) \u00b7 (cid:6)q2 be their respective duals. Peetre\u2019s K-functional with respect to\np1 and p2, and with respect to the constant t >0, is de\ufb01ned to be\n(cid:5)\n\n(cid:6)\n\n(cid:6)v(cid:6)K(p1,p2,t) =\n\nmin\n\nw,z : w+z=v\n\n(cid:6)w(cid:6)p1 + t(cid:6)z(cid:6)p2\n\n.\n\n(5)\n\n\fPeetre\u2019s J-functional with respect to q1, q2, and with respect to the constant s >0, is given by\n\n(cid:7)\n\n(cid:8)\n\n(cid:6)u(cid:6)J(q1,q2,s) = max\n\n(cid:6)u(cid:6)q1 , s(cid:6)u(cid:6)q2\n\n.\n\n(6)\n\nm it holds that maxi |vi|p \u2264 (cid:2)m\n\nThe J-functional is obviously a norm: the properties of a norm all follow immediately from the\nfact that (cid:6) \u00b7 (cid:6)q1 and (cid:6) \u00b7 (cid:6)q2 posses these properties. (cid:6) \u00b7 (cid:6)K(p1,p2,t) is also a norm, and moreover,\n(cid:6) \u00b7 (cid:6)K(p1,p2,t) and (cid:6) \u00b7 (cid:6)J(q1,q2,s) are dual to each other when t = 1/s. This fact can be proven using\nelementary calculus, and this proof is omitted due to the lack of space.\nWe use the K-method to interpolate between the 1-norm and the \u221e-norm, and to interpolate be-\ntween the 2-norm and the \u221e-norm. To gain some intuition on the behavior of these interpolation-\nnorms, \ufb01rst note that for any p \u2265 1 and any v \u2208 R\ni=1 |vi|p \u2264\nm maxi |vi|p, and therefore (cid:6)v(cid:6)\u221e \u2264 (cid:6)v(cid:6)p \u2264 m1/p(cid:6)v(cid:6)\u221e. An immediate consequence of this is\nthat (cid:6) \u00b7 (cid:6)K(p,\u221e,t) \u2261 (cid:6) \u00b7 (cid:6)\u221e when 0 < t \u2264 1 and that (cid:6) \u00b7 (cid:6)K(p,\u221e,t) \u2261 (cid:6) \u00b7 (cid:6)p when m1/p \u2264 t. In\nother words, the interesting range of t for the 1 \u2212 \u221e interpolation-norm is [1, m], and for the 2 \u2212 \u221e\ninterpolation-norm is [1,\nNext, we prove a theorem which states that interpolating a p-norm with the \u221e-norm is approximately\nequivalent to restricting that p-norm to the absolutely-largest components of the vector. Speci\ufb01cally,\nthe 1 \u2212 \u221e interpolation norm with parameter t (with t chosen to be an integer in [1, m]) is precisely\nequivalent to taking the sum of the absolute values of the t absolutely-greatest elements of the vector.\nm and let \u03c0 be a permutation on {1, . . . , m} such\nTheorem 1. Let v be an arbitrary vector in R\nthat |v\u03c0(1)| \u2265 . . . \u2265 |v\u03c0(m)|. Then for any integer B in {1, . . . , m} it holds that (cid:6)v(cid:6)K(1,\u221e,B) =\n(cid:2)B\n\nm].\n\n\u221a\n\ni=1 |v\u03c0(i)|, and for any 1 \u2264 p < \u221e, if t = B1/p then it holds that\ni=1 |v\u03c0(i)|p\n\n(cid:3)1/p \u2264 (cid:6)v(cid:6)K(p,\u221e,t) \u2264 (cid:1)(cid:2)B\n\ni=1 |v\u03c0(i)|p\n\n(cid:1)(cid:2)B\n\n(cid:3)1/p + B1/p|v\u03c0(B)| .\n\nProof. Beginning with the lower bound, let w and z be such that w + z = v. Then\n\n(cid:1)(cid:2)B\n\ni=1 |v\u03c0(i)|p\n\n(cid:3)1/p =\n\n(cid:1)(cid:2)B\n\u2264 (cid:1)(cid:2)B\n\u2264 (cid:1)(cid:2)B\n\u2264 (cid:1)(cid:2)m\n\n(cid:3)1/p\n(cid:1)(cid:2)B\n\ni=1 |w\u03c0(i) + z\u03c0(i)|p\n(cid:3)1/p\n(cid:3)1/p +\ni=1 |w\u03c0(i)|p\ni=1 |z\u03c0(i)|p\n(cid:3)1/p + (B maxi |zi|p)1/p\ni=1 |w\u03c0(i)|p\n(cid:3)1/p + t(cid:6)z(cid:6)\u221e ,\ni=1 |wi|p\n\nwhere the \ufb01rst inequality is the triangle inequality for the p-norm. Since the above holds for any w\ni=1 |wi|p)1/p + t(cid:6)z(cid:6)\u221e,\nand z such that w + z = v, it also holds for the pair which minimizes (\nand which de\ufb01nes (cid:6)v(cid:6)K(p,\u221e,t). Therefore, we have that,\n\n(cid:2)m\n\n(cid:1)(cid:2)B\ni=1 |v\u03c0(i)|p\n(7)\nlet \u03c6 = |v\u03c0(B)|, and de\ufb01ne for all 1 \u2264 i \u2264 m, \u00afwi =\n\n(cid:3)1/p \u2264 (cid:6)v(cid:6)K(p,\u221e,t) .\n\nTurning to the upper bound,\nsign(vi) max{0,|vi| \u2212\u03c6} and \u00afzi = sign(vi) min{|vi|, \u03c6}. Note that \u00afw + \u00afz = v, and that\n\nB(cid:4)\n\n|v\u03c0(i)| = (cid:6) \u00afw(cid:6)1 + B(cid:6)\u00afz(cid:6)\u221e .\n\ni=1\n\nThis proves that (cid:6)v(cid:6)K(1,\u221e,B) \u2264 (cid:2)B\nfor p = 1. Moving on to the case of an arbitrary p, we have that\n(cid:6)v(cid:6)K(p,\u221e,t) = min\n\nw+z=v\n\n((cid:6)w(cid:6)p + t(cid:6)z(cid:6)\u221e) \u2264 (cid:6) \u00afw(cid:6)p + t(cid:6)\u00afz(cid:6)\u221e .\n\ni=1 |v\u03c0(i)| and together with Eq. (7) we have proven our claim\n\nSince the absolute value of each element in \u00afw is at most as large as the absolute value of the\ncorresponding element of v, and since \u00afw\u03c0(r+1) = . . . = \u00afw\u03c0(m) = 0, we have that (cid:6) \u00afw(cid:6)p \u2264\n(cid:2)B\ni=1 |v\u03c0(i)|p)1/p. By de\ufb01nition, (cid:6)\u00afz(cid:6)\u221e = \u03c6 = |v\u03c0(B)|. This proves that (cid:6)v(cid:6)K(p,\u221e,t) \u2264\n(\n(cid:2)B\ni=1 |v\u03c0(i)|p)1/p + t|v\u03c0(B)| and together with Eq. (7) this concludes our proof for arbitrary p.\n(\n\n\f4 Deriving Concrete Algorithms from the General Framework\nOur \ufb01rst concrete algorithm is budget-L1-SVM, obtained by plugging the 1\u2212\u221e interpolation-norm\nwith parameter B into the general Any-Norm-SVM framework. Relying on Thm. 1, we know that\nthis norm takes into account only the B largest values in \u03be. Since \u03be measures how badly each\nexample is misclassi\ufb01ed, the budget-L1-SVM problem essentially optimizes the soft-margin with\nrespect to the B worst-classi\ufb01ed examples. We now show that this property promotes the sparsity\nof the budget-L1-SVM solution.\nIf there are less than B examples for which yi(f(xi)+b) < 1, then the KKT conditions of optimality\nimmediately imply that the number of support vectors is less than B. This holds true for every\ninstance of the Any-Norm-SVM framework, and is proven for L1-SVM in [3]. Therefore, we focus\non the more interesting case, where yi(f(xi) + b) < 1 for at least B examples.\nTheorem 2. Let B be an integer in {1, . . . , m}. Let (f, b, \u03be, \u03b1) be an optimal primal-dual solution\nof the primal problem in Eq. (2) and the dual problem in Eq. (4), where (cid:6)\u00b7(cid:6) is chosen to be the 1\u2212\u221e\ninterpolation-norm with parameter B. De\ufb01ne \u00b5i = yi(f(xi) + b) and let \u03c0 be a permutation of\n{1, . . . , m} such that \u00b5\u03c0(1) \u2264 . . . \u2264 \u00b5\u03c0(m). Assume that \u00b5\u03c0(B) < 1. Then, \u03b1k = 0 if \u00b5\u03c0(B) < \u00b5k.\n\n2(\u03bek + \u03be\u03c0(B)). Moreover, let \u03be(cid:6)\nk, or in other words, \u03be(cid:6) = (\u03be1, . . . , \u03be(cid:6)\n\nProof. We begin the proof by rede\ufb01ning \u03bei = max{1 \u2212 \u00b5i, 0} for all 1 \u2264 i \u2264 m and noting\nthat (f, b, \u03be, \u03b1) remains a primal-dual solution to our problem. The bene\ufb01t of starting with this\nspeci\ufb01c solution is that \u03be\u03c0(1) \u2265 . . . \u2265 \u03be\u03c0(m). Let k be an index such that \u00b5\u03c0(B) < \u00b5k and de\ufb01ne\n\u03be(cid:6)\nk = 1\nbe the vector obtained by replacing the k\u2019th coordinate in\n\u03be with \u03be(cid:6)\nk, . . . , \u03bem). Using the assumption that \u00b5\u03c0(B) < 1,\nwe know that \u03be\u03c0(B) > 0, and since \u00b5k > \u00b5\u03c0(B) we get that \u03bek < \u03be\u03c0(B). We can now draw two\nconclusions. First, \u03be\u03c0(1) \u2265 . . . \u2265 \u03be\u03c0(B) > \u03be(cid:6)\nk and therefore (cid:6)\u03be(cid:6)(cid:6)K(1,\u221e,B) = (cid:6)\u03be(cid:6)K(1,\u221e,B). Second,\nsatis\ufb01es the constraints of Eq. (2). Overall, we obtain that (f, b, \u03be(cid:6), \u03b1) is\n\u03bek < \u03be(cid:6)\nalso a primal-dual solution to our problem. Moreover, we know that 1 \u2212 \u00b5k < \u03be(cid:6)\nk. Using the KKT\ncomplementary slackness condition, it follows that \u03b1k, the Lagrange multiplier corresponding to\nthis constraint, must equal 0.\n\nk and therefore \u03be(cid:6)\n\nDe\ufb01ning \u00b5i and \u03c0 as above, a simple corollary of Thm. 2 is that the number of support vectors is\nupper bounded by B in the case that \u00b5\u03c0(B) (cid:12)= \u00b5\u03c0(B+1).\nFrom our discussion in Sec. 3, we know that the dual of the 1 \u2212 \u221e interpolation-norm is the func-\ntion max{(cid:6)u(cid:6)\u221e, (1/B)(cid:6)u(cid:6)1}. Plugging this de\ufb01nition into Eq. (4) gives us the dual optimiza-\ntion problem of budget-L1-SVM. The constraint (cid:6)\u03b1(cid:6)(cid:1) \u2264 C simpli\ufb01es to \u03b1i \u2264 C for all i and\n(cid:2)m\ni=1 \u03b1i \u2264 BC. To numerically solve this optimization problem, we turn to the Sequential Mini-\nmal Optimization (SMO) [5] technique. We brie\ufb02y describe the SMO technique, and then discuss\nits adaptation to our setting. SMO is an iterative process, which on every iteration selects a pair of\ndual variables, \u03b1k and \u03b1l, and optimizes the dual problem with respect to them, leaving all other\nvariables \ufb01xed. The choice of the two variables is determined by a heuristic [5], and their optimal\nvalues are calculated analytically. Assume that we start with a vector \u03b1 which is a feasible point\nof the optimization problem in Eq. (4). When restricted to the two active variables, \u03b1k and \u03b1l, the\nl yl = \u03b1old\nl yl. Put another way, we\nconstraint\ncan slightly overload our notation and de\ufb01ne the linear functions\n\ni(cid:7)=k,l \u03b1iyi = 0 simpli\ufb01es to \u03b1new\n\nk yk + \u03b1new\n\nk yk + \u03b1old\n\n(cid:2)\n\n\u03b1k(\u03bb) = \u03b1k + \u03bbyk\n\nand \u03b1l(\u03bb) = \u03b1l \u2212 \u03bbyl ,\n\n(8)\n\nand \ufb01nd the single variable \u03bb which maximizes our constrained optimization problem. Since the\nconstraints in Eq. (4) de\ufb01ne a convex and bounded feasible set, the intersection of the linear equali-\nties in Eq. (8) with this feasible set restricts \u03bb to an interval. The objective function, as a function of\nthe single variable \u03bb, takes the form O(\u03bb) =P \u03bb 2 + Q\u03bb + c, where c is a constant,\n(cid:3) \u2212 (cid:1)\nand f is the current function in the RKHS (f \u2261 (cid:2)m\ni=1 \u03b1iyiK(xi,\u00b7)). Maximizing the objective\nfunction in Eq. (4) with respect to \u03b1k and \u03b1l is equivalent to maximizing O(\u03bb) with respect to\n\u03bb over an interval. P equals minus the Euclidean distance between the functions K(xk,\u00b7) and\n\nP = K(xk, xl) \u2212 1\n\n2 K(xk, xk) \u2212 1\n\n2 K(xl, xl) , Q =\n\nyk \u2212 f(xk)\n\n(cid:1)\n\n(cid:3)\n\nyl \u2212 f(xl)\n\n,\n\n\fK(xl,\u00b7) in the RKHS, and is therefore a negative number. Therefore, O(\u03bb) is a concave function\nwhich attains a single (unconstrained) maximum. This maximum can be found analytically by\n\n0 = \u2202O(\u03bb)\n\n\u2202\u03bb\n\n= 2P \u03bb + Q \u21d2 \u03bb =\n\n\u2212Q\n2P\n\n.\n\n(9)\n\nIf this unconstrained optimum falls inside the feasible interval, then it is equivalent to the constrained\noptimum. Otherwise, the constrained optimum falls on one of the two end-points of the interval.\nThus, we are left with the task of \ufb01nding these end-points. To do so, we consider the remaining\nconstraints:\n\n\u03b1k(\u03bb) \u2265 0\n\u03b1l(\u03bb) \u2265 0\n\n(I)\n\n\u03b1k(\u03bb) \u2264 C\n\u03b1l(\u03bb) \u2264 C\n\n(II)\n\n(III) \u03b1k(\u03bb) + \u03b1l(\u03bb) \u2264 BC \u2212\n\n(cid:4)\n\ni(cid:7)=k,l\n\n\u03b1i .\n\n(10)\n\n(11)\n\n(12)\n\nThe constraints in (I) translate to\n\nyk = \u22121 \u21d2 \u03bb \u2264 \u03b1k\nyl = \u22121 \u21d2 \u03bb \u2265 \u2212\u03b1l\n\nyk = +1 \u21d2 \u03bb \u2265 \u2212\u03b1k\nyl = +1 \u21d2 \u03bb \u2264 \u03b1l .\n\nThe constraints in (II) translate to\n\nyk = \u22121 \u21d2 \u03bb \u2265 \u03b1k \u2212 C\nyl = \u22121 \u21d2 \u03bb \u2264 C \u2212 \u03b1l\n\nyk = +1 \u21d2 \u03bb \u2264 C \u2212 \u03b1k\nyl = +1 \u21d2 \u03bb \u2265 \u03b1l \u2212 C .\n\nConstraint (III) translates to\n\nyk = \u22121 \u2227 yl = +1 \u21d2 \u03bb \u2265 1\nyk = +1 \u2227 yl = \u22121 \u21d2 \u03bb \u2264 1\n\n2\n\n2\n\n(cid:1)(cid:2)m\n(cid:3)\ni=1 \u03b1i \u2212 BC\n(cid:1)\n(cid:3)\nBC \u2212 (cid:2)m\ni=1 \u03b1i\n\n.\n\nFinding the end-points of the interval that con\ufb01nes \u03bb amounts to \ufb01nding the smallest upper bound\nand the greatest lower bound in Eqs. (10,11,12). This concludes the analytic derivation of the SMO\nupdate for budget-L1-SVM.\nL2-SVM on a budget Next, we use the 2 \u2212 \u221e interpolation-norm with parameter t =\nB in\n\u221a\nthe Any-Norm-SVM framework, and obtain the budget-L2-SVM problem. Thm. 1 hints that setting\nB makes the 2 \u2212 \u221e interpolation-norm almost equivalent to restricting the 2-norm to the top\nt =\nB elements in the vector \u03be. The support size of the budget-L2-SVM solution is strongly correlated\nwith the parameter B although the exact relation between the two is not as clear as before. Again\nwe begin with the dual formulation de\ufb01ned in Eq. (4), where the constraint (cid:6)\u03b1(cid:6)(cid:1) \u2264 C becomes\n\u221a\nmax{(cid:6)\u03be(cid:6)2, (1/\nB)(cid:6)\u03be(cid:6)1} \u2264 C. The intersection of this constraint with the other constraints de\ufb01nes\na convex and bounded feasible set, and its intersection with the linear equalities in Eq. (8) de\ufb01nes\nan interval. The objective function in Eq. (4) is the same as before, so the unconstrained maximum\nis once again given be Eq. (9). To obtain the constrained maximum, we must \ufb01nd the end-points of\nthe interval that con\ufb01nes \u03bb. The dual-norm constraint can be written more explicitly as\n(cid:4)\n\n(cid:4)\n\n\u221a\n\n\u221a\n\n(I) \u03b1k(\u03bb) +\u03b1 l(\u03bb) \u2264\n\nBC \u2212\n\n(II) \u03b12\n\nk(\u03bb) + \u03b12\n\nl (\u03bb) \u2264 C 2 \u2212\n\n\u03b12\ni\n\n.\n\ni(cid:7)=k,l\n\n\u03b1i\n\ni(cid:7)=k,l\n\nConstrain (I) is similar to the constraint we had in the budget-L1-SVM case, and is given in terms\nof \u03bb by replacing B with\nB in Eq. (12). Constraint (II) is new, and can be written in terms of \u03bb as\n\u03bb2 + \u03bb\u03b2 + \u03b3 \u2264 0, where \u03b2 = \u03b1kyk \u2212 \u03b1lyl and \u03b3 = 1\ni \u2212 C 2). It can be written even more\n2(\nexplicitly as\n\n(cid:2)m\n\ni=1 \u03b12\n\n\u221a\n\n\u03bb \u2264 1\n\n2(\u2212\u03b2 +\n\n\u03b22 \u2212 4\u03b3)\n\n\u03bb \u2265 1\n\n2(\u2212\u03b2 \u2212\n\n\u03b22 \u2212 4\u03b3) .\n\n(13)\nIn addition, we still have the constraint \u03b1 \u2265 0, which is common to every instance of the Any-\nNorm-SVM framework. This constraint is given in terms of \u03bb in Eq. (10). Overall, the end-points\nof the interval we are searching for are found by taking the smallest upper bound and the greatest\nlower bound in Eqs. (10,13) and Eq. (12) with B replaced by\n\nand\n\n\u221a\n\nB.\n\n(cid:9)\n\n(cid:9)\n\n\f700\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\ns\n\n500\n\n450\n\n400\n\n350\n\n300\n\ns\n\n250\n\n200\n\n150\n\n100\n\n50\n\n200\n\n400\n\nB\n\n600\n\n800\n\n1000\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nB\n\nFigure 1: Average test error of budget-L1-SVM (left) and budget-L2-SVM (right) for different\nvalues of the budget parameter B and the pruning parameter s (all but s weights in \u03b1 are set to\nzero). The test error in the darkest region is roughly 50%, and in the lightest region is roughly 5%.\n\n5 Experiments\n\nMany existing solvers for the standard L1-SVM problem de\ufb01ne a positive threshold value close to\nzero and replace every weight that falls below this threshold with zero. This heuristic signi\ufb01cantly\nreduces the time required for the algorithm to converge. In our setting, a more natural way to speed\nup the learning process is to run the iterative SMO optimization algorithm for a \ufb01xed number of\niterations and then to keep only the B largest weights, setting the m \u2212 B remaining weights to\nzero. This pruning heuristic enforces the budget constraint in a brute-force way, and can be equally\napplied to any kernel-machine. However, the natural question is how much will the pruning heuristic\naffect the classi\ufb01cation accuracy of the kernel-machine it is applied to. If our technique indeed lives\nup to its theoretical promise, we expect the pruning heuristic to have little impact on classi\ufb01cation\naccuracy. On the other hand, if we train an L1-SVM and it so happens that the number of large\nweights exceeds B, then applying the pruning heuristic should have a dramatic negative effect on\nclassi\ufb01cation accuracy. The goal of our experiments is to demonstrate that this behavior indeed\noccurs in practice.\n\nWe conducted our experiments using the MNIST dataset, which contains handwritten digits from\nthe 10 digit classes. We randomly generated 50 binary classi\ufb01cation problems by \ufb01rst randomly\npartitioning the 10 classes into two equally sized sets, and then randomly choosing a training set of\n1000 examples and a test set of 4000 examples. The results reported below are averaged over these\n50 problems. Although MNIST is generally thought to induce easy learning problems, the method\ndescribed above generates moderately dif\ufb01cult learning tasks.\nFor each binary problem, we trained both the L1 and the L2 budget SVMs with B =\n20, 40, . . . , 1000. Note that (cid:6)\u03be(cid:6)K(1,\u221e,B) grows roughly linearly with B, and that (cid:6)\u03be(cid:6)K(2,\u221e,\n\u221a\ngrows roughly like the square root of B. To compensate for this, we set C = 10/B in the L1 case\nand C = 10/\nB in the L2 case. This heuristic choice of C attempts to preserve the relative weight\nof the regularization term with respect to the norm term in Eq. (2), across the various values of B.\nIn all of our experiments, we used a Gaussian kernel with \u03c3 = 1 (after scaling the data to have an\naverage unit norm). For each classi\ufb01er trained, we pruned away all but the s largest weights, with\ns = 20, 40, . . . , 1000, and calculated the test error. The average test error for every choice of B\n(the budget parameter in the optimization problem) and s (the number of non-zero weights kept) is\nsummarized in Fig. 1. In practice, s and B should be equal, however we let s take different values\nin our experiment to illustrate the characteristics of our approach. Note that the test-error attained\nby L1-SVM (without a budget parameter) and L2-SVM are represented by the top-right corners of\nthe respective plots.\nAs expected, classi\ufb01cation accuracy for any value of B deteriorates as s becomes small. However,\nthe accuracy attained by L1-SVM and L2-SVM can be equally attained using signi\ufb01cantly less\nsupport vectors.\n\n\u221a\n\nB)\n\n\f6 Discussion\n\nUsing the Any-Norm-SVM framework with interesting norms enabled us to introduce a budget\nparameter to the SVM formulation. However, the Any-Norm framework can be used for other\ntasks as well. For example, we can interpolate between L1-SVM and L2-SVM by using the 1 \u2212 2\ninterpolation-norm. This gives the user the explicit ability to balance the trade-off between the pros\nand cons of these two SVM variants. In [20] it is shown that there exists a constant c such that,\n\nc(cid:6)v(cid:6)K(1,2,\n\n\u221a\n\nr) \u2264 (cid:2)r\n\nj=1 |vj| +\n\n(cid:5)(cid:2)m\n\n\u221a\n\nr\n\nj=r+1 v2\nj\n\n(cid:6)1/2 \u2264 (cid:6)v(cid:6)K(1,2,\n\n\u221a\n\nr) .\n\nThese bounds give some insight into how such an interpolation would behave. Another possible\nnorm that can be used in our framework is the Mahalanobis norm ((cid:6)v(cid:6) = (v(cid:9)Mv)1/2, where M\nis a positive de\ufb01nite matrix), which would de\ufb01ne a loss function that takes into account pair-wise\nrelationships between examples.\nRegarding our experiments, the rule-of-thumb we used to choose the parameter C is not always\noptimal. It seems preferable to tune C individually for each B using cross-validation.\nWe are currently exploring extensions to our SMO variant that would quickly converge to the sparse\nsolution without the help of the pruning heuristic. We are also considering multiplicative update\noptimization algorithms as an alternative to SMO.\n\nReferences\n[1] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classi\ufb01ers. In Proc.\n\nof the Fifth Annual ACM Workshop on Computational Learning Theory, pages 144\u2013152, 1992.\n\n[2] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.\n[3] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University\n\nPress, 2000.\n\n[4] P. Bartlett and A. Tewari. Sparseness vs estimating conditional probabilities: Some asymptotic results. In\nProc. of the Seventeenth Annual Conference on Computational Learning Theory, pages 564\u2013578, 2004.\nIn\nB. Sch\u00a8olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learn-\ning. MIT Press, 1998.\n\n[5] J. C. Platt. Fast training of Support Vector Machines using sequential minimal optimization.\n\n[6] C.J.C. Burges. Simpli\ufb01ed support vector decision rules. In Proc. of the Thirteenth International Confer-\n\nence on Machine Learning, pages 71\u201377, 1996.\n\n[7] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector machines. In B. Sch\u00a8olkopf,\nC. Burges, and A. Smola, editors, Advances in Kernel Methods: Support Vector Learning, pages 271\u2013284.\nMIT Press, 1999.\n\n[8] B. Sch\u00a8olkopf, S. Mika, C.J.C. Burges, P. Knirsch, K-R M\u00a8uller, G. R\u00a8atsch, and A.J. Smola. Input space\nversus feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5):1000\u20131017,\nSeptember 1999.\n\n[9] J-H. Chen and C-S. Chen. Reducing SVM classi\ufb01cation time using multiple mirror classi\ufb01ers. IEEE\n\ntransactions on systems, man and cybernetics \u2013 part B: Cybernetics, 34(2):1173\u20131183, April 2004.\n\n[10] M. Wu, B. Sch\u00a8olkopf, and G. Bakir. A direct method for building sparse kernel learning algorithms.\n\nJournal of Machine Learning Research, 7:603\u2013624, 2006.\n\n[11] K.P. Bennett. Combining support vector and mathematical programming methods for classi\ufb01cation. In\n\nAdvances in kernel methods: support vector learning, pages 307\u2013326. MIT Press, 1999.\n\n[12] Y. Lee and O.L. Mangasarian. RSVM: Reduced support vector machines. In Proc. of the First SIAM\n\n[13] K. Crammer, J. Kandola, and Y. Singer. Online classi\ufb01cation on a budget. In Advances in Neural Infor-\n\nInternational Conference on Data Mining, 2001.\n\nmation Processing Systems 16, 2003.\n\n[14] O. Dekel, S. Shalev-Shwartz, and Y. Singer. The Forgetron: A kernel-based perceptron on a \ufb01xed budget.\n\nIn Advances in Neural Information Processing Systems 18, 2005.\n\n[15] N. Cesa-Bianchi and C. Gentile. Tracking the best hyperplane with a simple budget perceptron. In Proc.\n\nof the Nineteenth Annual Conference on Computational Learning Theory, 2006.\n\n[16] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society,\n\n68(3):337\u2013404, May 1950.\n\n[17] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.\n[18] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[19] C. Bennett and R. Sharpley. Interpolation of Operators. Academic Press, 1998.\n[20] T. Holmstedt. Interpolation of quasi-normed spaces. Mathematica Scandinavica, 26:177\u2013190, 1970.\n\n\f", "award": [], "sourceid": 3086, "authors": [{"given_name": "Ofer", "family_name": "Dekel", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}