{"title": "Relative Margin Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 1481, "page_last": 1488, "abstract": "In classification problems, Support Vector Machines maximize the margin of separation between two classes. While the paradigm has been successful, the solution obtained by SVMs is dominated by the directions with large data spread and biased to separate the classes by cutting along large spread directions. This article proposes a novel formulation to overcome such sensitivity and maximizes the margin relative to the spread of the data. The proposed formulation can be efficiently solved and experiments on digit datasets show drastic performance improvements over SVMs.", "full_text": "Relative Margin Machines\n\nPannagadatta K Shivaswamy and Tony Jebara\n\nDepartment of Computer Science, Columbia University, New York, NY\n\npks2103,jebara@cs.columbia.edu\n\nAbstract\n\nIn classi\ufb01cation problems, Support Vector Machines maximize the margin\nof separation between two classes. While the paradigm has been success-\nful, the solution obtained by SVMs is dominated by the directions with\nlarge data spread and biased to separate the classes by cutting along large\nspread directions. This article proposes a novel formulation to overcome\nsuch sensitivity and maximizes the margin relative to the spread of the\ndata. The proposed formulation can be e\ufb03ciently solved and experiments\non digit datasets show drastic performance improvements over SVMs.\n\n1 Introduction\n\nThe goal of most machine learning problems is to generalize from a limited number of\ntraining examples. For example, in support vector machines [10] (SVMs) a hyperplane 1\nof the form w\u22a4x + b = 0, w \u2208 Rm, x \u2208 Rm, b \u2208 R is recovered as a decision boundary\nafter observing a limited number of training examples. The parameters of the hyperplane\n(w, b) are estimated by maximizing the margin (the distance between w\u22a4x + b = 1 and\nw\u22a4x + b = \u22121) while minimizing a weighted upper bound on the misclassi\ufb01cation rate on\nthe training data (the so called slack variables). In practice, the margin is maximized by\nminimizing 1\n\n2 w\u22a4w.\n\nWhile this works well in practice, we point out that merely changing the scale of the data\ncan give a di\ufb00erent solution. On one hand, an adversary can exploit this shortcoming to\ntransform the data so as to give bad performance. More distressingly, this shortcoming\ncan naturally lead to a bad performance especially in high dimensional settings. The key\nproblem is that SVMs simply \ufb01nd a large margin solution giving no attention to the spread\nof the data. An excellent discriminator lying in a dimension with relatively small data\nspread may be easily overlooked by the SVM solution.\nIn this paper, we propose novel\nformulations to overcome such a limitation. The crux here is to \ufb01nd the maximum margin\nsolution with respect to the spread of the data in a relative sense rather than \ufb01nding the\nabsolute large margin solution.\n\nLinear discriminant analysis \ufb01nds a projection of the data so that the inter-class separation\nis large while within class scatter is small. However, it only makes use of the \ufb01rst and\nthe second order statistics of the data. Feature selection with SVMs [12] remove that have\nlow discriminative value. Ellipsoidal kernel machines [9] normalize data in feature space\nby estimating bounding ellipsoids. While these previous methods showed performance im-\nprovements, both relied on multiple-step locally optimal algorithms for interleaving spread\ninformation with margin estimation. Recently, additional examples were used to improve\nthe generalization of the SVMs with so called \u201cUniversum\u201d samples [11]. Instead of leverag-\ning additional data or additional model assumptions such as axis-aligned feature selection,\n\n1In this paper we use the dot product w\n\n\u22a4\n\nx with the understanding that it can be replaced with\n\nan inner product.\n\n1\n\n\fthe proposed method overcomes what seems to be a fundamental limitation of the SVMs\nand subsequently yield improvements in the same supervised setting. In addition, the for-\nmulations derived in this paper are convex, can be e\ufb03ciently solved and admit some useful\ngeneralization bounds.\n\nNotation Boldface letters indicate vectors/matrices. For two vectors u \u2208 Rm and v \u2208 Rm,\nu \u2264 v indicates that ui \u2264 vi for all i from 1 to m. 1, 0 and I denote the vectors of all ones,\nall zeros and the identity matrix respectively. Their dimensions are clear from the context.\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\n4\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\u22124\n\n\u22123\n\n\u22122\n\n\u22121\n\n0\n\n1\n\n2\n\n3\n\n4\n\nFigure 1: Top: As the data is scaled along the x-axis, the SVM solution (red or dark shade)\ndeviates from the maximum relative margin solution (green or light shade). Bottom: The\nprojections of the examples in the top row on the real line for the SVM solution (red or\ndark shade) and the proposed classi\ufb01er (green or light shade) in each case.\n\n2 Motivation with a two dimensional example\n\nLet us start with a simple two dimensional toy dataset to illustrate a problem with the\nSVM solution. Consider the binary classi\ufb01cation example shown in the top row of Figure\n1 where squares denote examples from one class and triangles denote examples from the\nother class. Consider the leftmost plot in the top row of Figure 1. One possible decision\nboundary separating the two classes is shown in green (or light shade). The solution shown\nin red (or dark shade) is the SVM estimate; it achieves the largest margin possible while\nstill separating both the classes. Is this necessarily \u201cthe best\u201d solution?\n\nLet us now consider the same set of points after scaling the x-axis in the second and the\nthird plots. With progressive scaling, the SVM increasingly deviates from the green solution,\nclearly indicating that the SVM decision boundary is sensitive to a\ufb03ne transformations of\nthe data and produces a family of di\ufb00erent solutions as a result. This sensitivity to scaling\nand a\ufb03ne transformations is worrisome. If there is a best and a worst solution in the family\nof SVM estimates, there is always the possibility that an adversary exploits this scaling such\nthat the SVM solution we recover is poor. Meanwhile, an algorithm producing the green\ndecision boundary remains resilient to such adversarial scalings.\n\nIn the previous example, a direction with a small spread in the data produced a good\ndiscriminator. Merely \ufb01nding a large margin solution, on the other hand, does not recover\nthe best possible discriminator. This particular weakness in large margin estimation has\nonly received limited attention in previous work. In the above example, suppose each class is\ngenerated from a one dimensional distribution on a line with the two classes on two parallel\nlines. In this case, the green decision boundary should obtain zero test error even if it is\nestimated from a \ufb01nite number of samples. However, for \ufb01nite training data, the SVM\nsolution will make errors and will do so increasingly as the data is scaled along the x-axis.\nUsing kernels and nonlinear mappings may help in some cases but might also exacerbate\nsuch problems. Similarly, simple prepossessing of the data (a\ufb03ne \u201cwhitening\u201d to make the\n\n2\n\n\fdataset zero mean and unit covariance or scaling to place the data into a zero-one box) may\nfail to resolve such problems.\n\nFor more insight, consider the uni-dimensional projections of the data given by the green and\nred solutions in the bottom row of Figure 1. In the green solution, all points in the \ufb01rst class\nare mapped to a single coordinate and all points in the other class are mapped to another\n(distinct) coordinate. Meanwhile, the red solution produces more dispersed projections of\nthe two classes. As the adversarial scaling is increased, the spread of the projection in the\nSVM solution increases correspondingly. Large margins are not su\ufb03cient on their own and\nwhat is needed is a way to also control the spread of the data after projection. Therefore,\nrather than just maximizing the margin, a trade-o\ufb00 regularizer should also be used to\nminimize the spread of the projected data.\nIn other words, we will couple large margin\nestimation with regularization which seeks to bound the spread |w\u22a4x + b| of the data. This\nwill allow the linear classi\ufb01er to recover large margin solutions not in the absolute sense but\nrather relative to the spread of the data in that projection direction.\n\n3 Formulations\n\nGiven (xi, yi)n\ni=1 where xi \u2208 Rm and yi \u2208 {\u00b11} drawn independent and identically dis-\ntributed from a distribution Pr(x, y), the Support Vector Machine primal formulation 2 is\nas follows:\n\nmin\n\nw,b,\u03be\u22650\n\n1\n2kwk2 + C\u03be\u22a41 s.t. yi(w\u22a4xi + b) \u2265 1 \u2212 \u03bei, \u22001 \u2264 i \u2264 n.\n\n(1)\n\nThe above formulation minimizes an upper bound on the misclassi\ufb01cation while maximizing\nthe margin (the two quantities are traded o\ufb00 by C). In practice, the following dual of the\nformulation (1) is solved:\n\n0\u2264\u03b1\u2264C1 \u2212\nmax\n\n1\n2\n\nn\n\nn\n\nXi=1\n\nXj=1\n\n\u03b1i\u03b1jyiyj x\u22a4\n\ni xj +\n\nn\n\nXi=1\n\n\u03b1i s.t. \u03b1\u22a4y = 0.\n\n(2)\n\nIt is easy to see that the above formulation (2) is rotation invariant; if all the xi are replaced\nby Axi where A \u2208 Rm\u00d7m, A\u22a4A = I, then the solution remains the same. However, the\nsolution is not guaranteed to be the same when A is not a rotation matrix. In addition, the\nsolution is sensitive to translations as well.\nTypically, the dot product between the examples is replaced by a kernel function k : Rm \u00d7\nRm \u2192 R such that k(xi, xj) = \u03c6(xi)\u22a4\u03c6(xj ), where \u03c6 : Rm \u2192 H is a mapping to a Hilbert\nspace to obtain non-linear decision boundaries in the input space. Thus, in (2), x\u22a4\ni xj is\nreplaced by k(xi, xj ) to obtain non-linear solutions.\nIn rest of this paper, we denote by\nK \u2208 Rn\u00d7n the Gram matrix, whose individual entries are given by Kij = k(xi, xj).\nNext, we consider the formulation which corresponds to whitening the data with the covari-\nnPn\nance matrix. Denote by \u03a3 = 1\ni=1 xi, the\nsample covariance and mean respectively. Consider the following formulation which we call\n\u03a3-SVM:\n2 wk2 + C\u03be\u22a41 s.t. yi(w\u22a4(xi \u2212 \u00b5) + b) \u2265 1 \u2212 \u03bei,\n\nnPn\nD\n2 k\u03a3\n\nj , and \u00b5 = 1\n\nn2 Pn\n\ni=1 xiPn\n\ni=1 xix\u22a4\n\ni \u2212 1\n\nkwk2 +\n\nj=1 x\u22a4\n\n(3)\n\nmin\n\nw,b,\u03be\u22650\n\n1 \u2212 D\n\n2\n\n1\n\nwhere 0 \u2264 D \u2264 1 is an additional parameter that trades o\ufb00 between the two regularization\nterms.\n\nThe dual of (3) can be shown to be:\n\nmax\n\n0\u2264\u03b1\u2264C1,y\u22a4\u03b1=0\n\nn\n\nXi=1\n\n\u03b1i \u2212\n\n1\n2\n\nn\n\nXi=1\n\n\u03b1iyi(xi \u2212 \u00b5)\u22a4((1 \u2212 D)I + D\u03a3)\u22121\n\nn\n\nXj=1\n\n\u03b1jyj(xj \u2212 \u00b5).\n\n(4)\n\n2After this formulation, we stop explicitly writing \u22001 \u2264 i \u2264 n since it will be obvious from the\n\ncontext.\n\n3\n\n\fIt is easy to see that the above formulation (4) is translation invariant and tends to an a\ufb03ne\ninvariant solution when D tends to one. When 0 < D < 1, it can be shown, by using the\nWoodbury matrix inversion formula, that the above formulation can be \u201ckernelized\u201d simply\nby replacing the dot products x\u22a4\n\ni xj in (2) by:\n\n1\n\n1 \u2212 D k(xi, xj ) \u2212\n1 \u2212 D (cid:18)Ki \u2212\n\nK\u22a4\ni 1\nn \u2212\nn (cid:19)\u22a4(cid:18) I\nn \u2212\n\nK1\n\n1\n\nK\u22a4\nj 1\nn\n\n+\n\n1\u22a4K1\n\nn2 !\nn2 (cid:19)(cid:20) 1 \u2212 D\n\nD\n\n11\u22a4\n\nI + K(cid:18) I\nn \u2212\n\n\u2212\n\n\u22121\n\n11\u22a4\n\nn2 (cid:19)(cid:21)\n\n(cid:18)Kj \u2212\n\nK1\n\nn (cid:19)! ,\n\nwhere Ki is the ith column of K. For D = 0 and D = 1, it is much easier to obtain the\nkernelized formulations. Note that the above formula involves a matrix inversion of size n,\nmaking the kernel computation alone O(n3).\n\n3.1 RMM and its geometrical interpretation\n\nFrom Section 2, it is clear that large margin in the absolute sense might be deceptive and\ncould merely be a by product of bad scaling of the data. To overcome this limitation, as\nwe pointed out earlier, we need to bound the projections of the training examples as well.\nAs in the two dimensional example, it is necessary to trade o\ufb00 between the margin and the\nspread of the data. We propose a slightly modi\ufb01ed formulation in the next section that\ncan be solved e\ufb03ciently. For now, we write the following formulation, mainly to show how\nit compares with the \u03a3-SVM. In addition, writing the dual of the following formulation\ngives some geometric intuition. Since we trade o\ufb00 between the projections and the margin,\nimplicitly, we \ufb01nd large relative margin. Thus we call the following formulation the Relative\nMargin Machine (RMM):\n\nmin\n\nw,b,\u03be\u22650\n\n1\n2kwk2 + C\u03be\u22a41 s.t. yi(w\u22a4xi + b) \u2265 1 \u2212 \u03bei,\n\n1\n2\n\n(w\u22a4xi + b)2 \u2264\n\nB2\n2\n\n.\n\n(5)\n\nThis is a quadratically constrained quadratic problem (QCQP). This formulation has one\nextra parameter B in addition to the SVM parameter. Note that B \u2265 1 since having a\nB less than one would mean none of the examples would satisfy yi(w\u22a4xi + b) \u2265 1. Let\nwC and bC be the solutions obtained by solving the SVM (1) for a particular value of C,\nthen B > maxi |w\u22a4\nC xi + bC|, makes the constraint on the second line in the formulation (5)\ninactive for each i and the solution obtained is the same as the SVM estimate.\n\nFor smaller B values, we start getting di\ufb00erent solutions. Speci\ufb01cally, with a smaller B, we\nstill \ufb01nd a large margin solution such that all the projections of the training examples are\nbounded by B. Thus by trying out di\ufb00erent B values, we explore di\ufb00erent large margin\nsolutions with respect to the projection and spread of the data.\n\nIn the following, we assume that the value of B is smaller than the threshold mentioned\nabove. The Lagrangian of (5) is given by:\n\nn\n\nn\n\nB2(cid:19) ,\n1\n2kwk2 + C\u03be\u22a41 \u2212\nwhere \u03b1, \u03b2, \u03bb \u2265 0 are the Lagrange multipliers corresponding to the constraints. Di\ufb00er-\nentiating with respect to the primal variables and equating them to zero, it can be shown\nthat:\n\n\u03b1i(cid:0)yi(w\u22a4xi + b) \u2212 1 + \u03bei(cid:1) \u2212 \u03b2\u22a4\u03be +\n\n(w\u22a4xi + b)2 \u2212\n\n\u03bbi(cid:18) 1\n\nXi=1\n\nXi=1\n\n1\n2\n\n2\n\nn\n\nn\n\nn\n\nn\n\nn\n\n(I+\n\n\u03bbixi =\n\n\u03bbixix\u22a4\n\nXi=1\n\n\u03b1iyixi, b =\n\nXi=1\n\u03bb\u22a41Pn\ni \u2212 1\nthe dual of (5) can be shown to be:\n\nXi=1\ni )w\u2212b\nDenoting by \u03a3\u03bb =Pn\n\u03b1i \u2212\n\ni=1 \u03bbixiPn\n\u03b1iyi(xi \u2212 \u00b5\u03bb)\u22a4(I + \u03a3\u03bb)\u22121\n\ni=1 \u03bbixix\u22a4\n\n0\u2264\u03b1\u2264C1,\u03bb\u22650\n\nmax\n\n1\n2\n\n(\n\nn\n\nn\n\nXi=1\n\n1\n\n\u03bb\u22a41\n\nXi=1\n\nXi=1\n\nn\n\nXj=1\n\n4\n\n\u03bbiw\u22a4xi), C1 = \u03b1+\u03b2.\n\n\u03b1iyi\u2212\n\nXi=1\n\nj=1 \u03bbj x\u22a4\n\nj , and by \u00b5\u03bb = 1\n\nj=1 \u03bbj xj\n\n\u03bb\u22a41Pn\n\n\u03b1jyj(xj \u2212 \u00b5\u03bb) \u2212\n\n1\n2\n\nB2\u03bb\u22a41 (6)\n\n\fNote that the above formulation is translation invariant since \u00b5\u03bb is subtracted from each xi.\n\u03a3\u03bb corresponds to a \u201cshape matrix\u201d (potentially low rank) determined by xi\u2019s that have\nnon-zero \u03bbi. From the KKT conditions of (5), \u03bbi( 1\n2 ) = 0. Consequently\n\u03bbi > 0 implies ( 1\n\n2 (w\u22a4xi + b)2 \u2212 B 2\n\n2 ) = 0.\n\n2 (w\u22a4xi + b)2 \u2212 B 2\n\nGeometrically, in the above formulation (6), the data is whitened with the matrix (I + \u03a3\u03bb)\nwhile solving SVM. While this is similar to what is done by the \u03a3-SVM, the matrix (I+ \u03a3\u03bb)\nis determined jointly considering both the margin of the data and the spread. In contrast,\nin \u03a3-SVM, whitening is simply a prepossessing step which can be done independently of the\nmargin. Note that the constraint 1\n2 B2 can be relaxed with slack variables at\nthe expense of one additional parameter however this will not be investigated in this paper.\n\n2 (w\u22a4xi +b)2 \u2264 1\n\nThe proposed formulation is of limited use unless it can be solved e\ufb03ciently. Solving (6)\namounts to solving a semi-de\ufb01nite program; it cannot scale beyond a few hundred data\npoints. Thus, for e\ufb03cient solution, we consider a di\ufb00erent but equivalent formulation.\n\n2 (w\u22a4xi + b)2 \u2264 1\n\nNote that the constraint 1\n2 B2 can be equivalently posed as two linear\nconstraints : (w\u22a4xi + b) \u2264 B and \u2212(w\u22a4xi + b) \u2264 B. With these constraints replacing\nthe quadratic constraint, we have a quadratic program to solve. In the primal, we have 4n\nconstraints (including \u03be \u2265 0 ) instead of the 2n constraints in the SVM. Thus, solving RMM\nas a standard QP has the same order of complexity as the SVM. In the next section, we\nbrie\ufb02y explain how the RMM can be solved e\ufb03ciently from the dual.\n\n3.2 Fast algorithm\n\nThe main idea for the fast algorithm is to have linear constraints bounding the projections\nrather than quadratic constraints. The fast algorithm that we developed is based on SVMlight\n[5]. We \ufb01rst write the equivalent of (5) with linear constraints:\n\nmin\n\nw,b,\u03be\u22650\n\n1\n2kwk2 + C\u03be\u22a41 s.t. yi(w\u22a4xi + b) \u2265 1 \u2212 \u03bei, w\u22a4xi + b \u2264 B, \u2212 w\u22a4xi \u2212 b \u2264 B. (7)\n\nThe dual of (7) can be shown to be the following:\n\n1\n2\n\n(\u03b1 \u2297 y \u2212 \u03bb + \u03bb\u2217)\u22a4 K (\u03b1 \u2297 y \u2212 \u03bb + \u03bb\u2217) + \u03b1\u22a41 \u2212 B\u03bb\u22a41 \u2212 B\u03bb\u2217\u22a41\n\n\u03b1,\u03bb,\u03bb\u2217 \u2212\nmax\ns.t. \u03b1\u22a4y \u2212 \u03bb\u22a41 + \u03bb\u2217\u22a41 = 0, 0 \u2264 \u03b1 \u2264 C1, \u03bb, \u03bb\u2217 \u2265 0,\n\n(8)\n\nwhere, the operator \u2297 denotes the element-wise product of two vectors.\nThe above QP (8) is solved in an iterative way. In each step, only a subset of the dual\nvariables are optimized. Let us say, q, r and s (\u02dcq, \u02dcr and \u02dcs) are the indices to the free (\ufb01xed)\nvariables in \u03b1, \u03bb and \u03bb\u2217 respectively (such that q \u222a \u02dcq = {1, 2,\u00b7\u00b7\u00b7 n} and q \u2229 \u02dcq = \u2205, similarly\nfor the other two indices) in a particular iteration. Then the optimization over the free\nvariables in that step can be expressed as:\n\nmax\n\n\u03b1q ,\u03bbr ,\u03bb\u2217\n\ns \u2212\n\n\u22a4\n\n\u22a4\n\n#\n#\n\n1\n\n\u03bbr\n\u03bb\u2217\ns\n\n2\" \u03b1q \u2297 yq\n2\" \u03b1q \u2297 yq\nq yq \u2212 \u03bb\u22a4\n\n\u03bbr\n\u03bb\u2217\ns\n\n1\n\n\u2212\ns.t. \u03b1\u22a4\n\n\u03bbr\n\u03bb\u2217\ns\n\nKsq \u2212Ksr Kss #\" \u03b1q \u2297 yq\n\" Kqq \u2212Kqr Kqs\n\u2212Krq Krr \u2212Krs\n\" Kq \u02dcq \u2212Kq \u02dcr Kq \u02dcs\nKs \u02dcq \u2212Ks\u02dcr Ks\u02dcs #\" \u03b1 \u02dcq \u2297 y \u02dcq\n\u2212Kr \u02dcq Kr \u02dcr \u2212Kr \u02dcs\ns 1 = \u2212\u03b1\u22a4\n\n\u02dcr 1 \u2212 \u03bb\u2217\u22a4\n\n\u02dcq y \u02dcq + \u03bb\u22a4\n\nr 1 + \u03bb\u2217\u22a4\n\n(9)\n\n#\n# + \u03b1\u22a4\n\n\u03bb\u02dcr\n\u03bb\u2217\n\u02dcs\n\nq 1 \u2212 B\u03bb\u22a4\n\u02dcs 1, 0 \u2264 \u03b1q \u2264 C1, \u03bbr, \u03bb\u2217\n\ns \u2265 0.\n\nr 1 \u2212 B\u03bb\u2217\u22a4\ns 1\n\nNote that while the \ufb01rst term in the objective above is quadratic in the free variables (over\nwhich it is optimized), the second term is only linear.\n\nThe algorithm, solves a small sub-problem like (9) in each step until the KKT conditions\nof the formulation (8) are satis\ufb01ed to a given tolerance. In each step, the free variables are\nselected using heuristics similar to those in SVMlight but slightly adapted to our formulation.\n\n5\n\n\fWe omit the details due to lack of space. Since only a small subset of the variables is\noptimized, book-keeping can be done e\ufb03ciently in each step. Moreover, the algorithm can\nbe warm-started with a previous solution just like SVMlight.\n\n4 Experiments\n\nExperiments were carried out on three sets of digits - optical digits from the UCI machine\nlearning repository [1], USPS digits [6] and MNIST digits [7]. These datasets have di\ufb00erent\nnumber of features (64 in optical digits, 256 in USPS and 784 in MNIST) and training\nexamples (3823 in optical digits, 7291 in USPS and 60000 in MNIST). In all these multi-\nclass experiments one versus one classi\ufb01cation strategy was used. We start by noting that,\non the MNIST test set, an improvement of 0.1% is statistically signi\ufb01cant [3, 4]. This\ncorresponds to 10 or fewer errors by one method over another on the MNIST test set.\n\nAll the parameters were tuned by splitting the training data in each case in the ratio 80:20\nand using the smaller split for validation and the larger split for training. The process\nwas repeated \ufb01ve times over random splits to pick best parameters (C for SVM, C and\nD for \u03a3-SVM and C and B for RMM). A \ufb01nal classi\ufb01er was trained for each of the 45\nclassi\ufb01cation problems with the best parameters found from cross validation using all the\ntraining examples in those classes.\n\nIn the case of MNIST digits, training \u03a3-SVM and KLDA are prohibitive since they involve\ninverting a matrix. So, to compare all the methods, we conducted an experiment with 1000\nexamples per training. For the larger experiments we simply excluded \u03a3-SVM and KLDA.\nThe larger experiment on MNIST consisted of training with two thirds of the digits (note\nthat this amounts to training with 8000 examples on an average for each pair of digits) for\neach binary classi\ufb01cation task. In both the experiments, the remaining training data was\nused as a validation set. The classi\ufb01er that performed the best on the validation set was\nused for testing.\n\nOnce we had 45 classi\ufb01ers for each pair of digits, testing was done on the separate test set\navailable in each of these three datasets (1797 examples in the case of optical digits, 2007\nexamples in USPS and 10000 examples in MNIST). The \ufb01nal prediction given for each test\nexample was based on the majority of predictions made by the 45 classi\ufb01ers on the test\nexample with ties broken uniformly at random.\n\nTable 1 shows the result on all the three datasets for polynomial kernel with various degrees\nand the RBF kernel. For each dataset, we report the number of misclassi\ufb01ed examples using\nthe majority voting scheme mentioned above. It can be seen that while \u03a3-SVM usually\nperforms much better compared to SVM, RMM performs even better than \u03a3-SVM in most\ncases. Interestingly, with higher degree kernels, \u03a3-SVM seems to match the performance\nof the RMM, but in most of the lower degree kernels, RMM outperforms both SVM and\n\u03a3-SVM convincingly. Since, \u03a3-SVM is prohibitive to run on large scale datasets, the RMM\nwas clearly the most competitive method in these experiments.\n\nTraining with entire MNIST We used the best parameters found by crossvalidation\nin the previous experiments on MNIST and trained 45 classi\ufb01ers for both SVM and RMM\nwith all the training examples for each class in MNIST for various kernels. The test results\nare reported in Table 1; the advantage still carries over to the full MNIST dataset.\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n \n3\n\nSVM\nRMM B1\nRMM B2\nRMM B3\n\n \n\n3.1\n\n3.2\n\n3.3\n\n3.4\n\n3.5\n\n3.6\n\n3.7\n\n3.8\n\n3.9\n\n4\n\nFigure 2: Log run time versus log number of examples from 1000 to 10000 in steps of 1000.\n\n6\n\n\f1\nSVM\n71\n\u03a3-SVM 61\n71\nKLDA\nRMM\n71\n145\nSVM\n\u03a3-SVM 132\n132\nKLDA\n153\nRMM\nSVM\n696\n\u03a3-SVM 671\n1663\nKLDA\n689\nRMM\n552\nSVM\nRMM\n534\n536\nSVM\nRMM\n521\n\n2\n57\n48\n57\n36\n109\n108\n119\n109\n511\n470\n848\n342\n237\n164\n198\n146\n\n3\n54\n41\n54\n32\n109\n99\n121\n94\n422\n373\n591\n319\n200\n148\n170\n140\n\n4\n47\n36\n47\n31\n103\n94\n117\n91\n380\n341\n481\n301\n183\n140\n156\n130\n\n5\n40\n35\n40\n33\n100\n89\n114\n91\n362\n322\n430\n298\n178\n123\n157\n119\n\n6\n46\n31\n46\n30\n95\n87\n118\n90\n338\n309\n419\n290\n177\n129\n141\n116\n\n7\n46\n29\n46\n29\n93\n90\n117\n90\n332\n303\n405\n296\n164\n129\n136\n115\n\nRBF\n\n51\n47\n45\n51\n104\n97\n101\n98\n670\n673\n1597\n613\n166\n144\n146\n129\n\nOPT\n\nUSPS\n\n1000-MNIST\n\n2/3-MNIST\n\nFull MNIST\n\nTable 1: Number of digits misclassi\ufb01ed with various kernels by SVM, \u03a3-SVM and RMM\nfor three di\ufb00erent datasets.\n\nRun time comparison We studied the empirical run times using the MNIST digits 3 vs\n8 and polynomial kernel with degree two. The tolerance was set to 0.001 in both the cases.\nThe size of the sub-problem (9) solved was 500 in all the cases. The number of training\nexamples were increased in steps of 1000 and the training time was noted. C value was\nset at 1000. SVM was \ufb01rst run on the training examples. The value of maximum absolute\nprediction \u03b8 was noted. We then tried three di\ufb00erent values of B for RMM, B1 = 1+(\u03b8\u22121)/2,\nB2 = 1 + (\u03b8 \u2212 1)/4 B3 = 1 + (\u03b8 \u2212 1)/10. In all the cases, the run time was noted. We show\na log-log plot comparing the number of examples to the run time in Figure 2. Both SVM\nand RMM have similar asymptotic behavior. However, in many cases, warm starting RMM\nwith previous solution signi\ufb01cantly helped in reducing the run times.\n\n5 Conclusions\n\nWe identi\ufb01ed a sensitivity of Support Vector Machines and maximum absolute margin cri-\nteria to a\ufb03ne scalings. These classi\ufb01ers are biased towards producing decision boundaries\nthat separate data along directions with large data spread. The Relative Margin Machine\nwas proposed to overcome such a problem and optimizes the projection direction such that\nthe margin is large only relative to the spread of the data. By deriving the dual with\nquadratic constraints, a geometrical interpretation was also formulated for RMMs. An im-\nplementation for RMMs requiring only additional linear constraints in the SVM quadratic\nprogram leads to a competitively fast implementation. Experiments showed that while a\ufb03ne\ntransformations can improve over the SVMs, RMM performs even better in practice.\n\nThe maximization of relative margin is fairly promising as it is compatible with other popular\nproblems handled by the SVM framework such as ordinal regression, structured prediction\netc. These are valuable future extensions for the RMM. Furthermore, the constraints that\nbound the projection are unsupervised; thus RMMs can readily work in semi-supervised\nand transduction problems. We will study these extensions in detail in an extended version\nof this paper.\n\nReferences\n\n[1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.\n\n[2] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep\nIn Advances in Neural Information Processing Systems 19, pages 153\u2013160. MIT\n\nnetworks.\nPress, Cambridge, MA, 2007.\n\n7\n\n\f[4] D. Decoste and B. Sch\u00a8olkopf. Training invariant support vector machines. Machine Learning,\n\npages 161\u2013190, 2002.\n\n[5] T. Joachims. Making large-scale support vector machine learning practical. In Advances in\n\nKernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.\n\n[6] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L. Jackel.\nBack-propagation applied to handwritten zip code recognition. Neural Computation, 1:541\u2013\n551, 1989.\n\n[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Ha\ufb00ner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[8] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge Univer-\n\nsity Press, 2004.\n\n[9] P. K. Shivaswamy and T. Jebara. Ellipsoidal kernel machines. In Proceedings of the Arti\ufb01cial\n\nIntelligence and Statistics, 2007.\n\n[10] V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995.\n\n[11] J. Weston, R. Collobert, F. H. Sinz, L. Bottou, and V. Vapnik. Inference with the universum.\nIn Proceedings of the International Conference on Machine Learning, pages 1009\u20131016, 2006.\n\n[12] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik. Feature selection\n\nfor SVMs. In Neural Information Processing Systems, pages 668\u2013674, 2000.\n\nA Generalization Bound\n\nIn this section, we give the empirical Rademacher complexity [2, 8] for function classes used\nby the SVM, and modi\ufb01ed versions of RMM and \u03a3-SVM which can be plugged into a\ngeneralization bound.\n\nMaximizing the margin can be seen as choosing a function f (x) = w\u22a4x from a bounded class\nof functions FE := {x \u2192 w\u22a4x| 1\n2kwk2 \u2264 E}. For a technical reason, instead of bounding\nthe projection on the training examples as in (5), we consider bounding the projections\non an independent set of examples drawn from Pr(x), that is, a set U = {u1, u2, . . . unu}.\nNote that if we have an iid training set, it can be split into two parts and one part can be\nused exclusively to bound the projections and the other part can be used exclusively for\nclassi\ufb01cation constraints. Since the labels of the examples used to bound the projections\ndo not matter, we denote this set by U and the other part of the set by (xi, yi)n\ni=1 We\nnow consider the following function class which is closely related to RMM: HE,D := {x \u2192\nw\u22a4x| 1\n2 (w\u22a4ui)2 \u2264 E \u22001 \u2264 i \u2264 nu} where D > 0 trades o\ufb00 between large margin\nand small bound on the projections. Similarly, consider: GE,D := {x \u2192 w\u22a4x| 1\n2 w\u22a4w +\n2nu Pnu\ni=1(w\u22a4ui)2 \u2264 E}, which is closely related to the class of functions considered by\n\u03a3-SVM. The empirical Rademacher complexities of the three classes of functions are as\nbelow:\n\n2 w\u22a4w + D\n\nD\n\n\u02c6R(FE) \u2264 UFE :=\n\n\u02c6R(GE,D) \u2264 UGE,D :=\n\n2\u221a2E\n\nn vuut\n\nn\n\nXi=1\n\ni \u03a3\u22121\nx\u22a4\n\nD xi,\n\nn\n\n2\u221a2E\n\nn vuut\n\n\u02c6R(HE,D) \u2264 UHE,D := min\n\n\u03bb\u22650\n\nn\n\n1\nn\n\nx\u22a4\n\ni xi,\n\ni \u03a3\u22121\nx\u22a4\n\nXi=1\nXi=1\ni and \u03a3\u03bb,D =Pnu\n\n\u03bb,Dxi +\n\n2\nn\n\nE\n\n\u03bbi,\n\nnu\n\nXi=1\n\ni=1 uiu\u22a4\n\ni=1 \u03bbiuiu\u22a4\n\nnu Pnu\n\ni=1 \u03bbiI + DPnu\n\nwhere \u03a3D = I + D\ni . Note that the last\nupper bound is not a closed form expression, but a semi-de\ufb01nite optimization. Now, the\nupper bounds UFE , UGE,D and UHE,D can be plugged in the following theorem in place of\n\u02c6R(F ) to obtain Rademacher type generalization bounds.\nTheorem 1 Fix \u03b3 > 0, let F be the class of functions from Rm \u00d7 {\u00b11} \u2192 R given by\nf (x, y) = \u2212yg(x). Let {(x1, y1), . . . , (xn, yn)} be drawn iid from a probability distribution\nD. Then, with probability at least 1\u2212 \u03b4 over the samples of size n, the following bound holds:\nPrD[y 6= sign(g(x))] \u2264 \u03be\u22a41/n + 2 \u02c6R(F )/\u03b3 + 3p(ln(2/\u03b4))/2n, where \u03bei = max(0, 1\u2212 yig(xi))\nare the so-called slack variables.\n\n8\n\n\f", "award": [], "sourceid": 839, "authors": [{"given_name": "Tony", "family_name": "Jebara", "institution": null}, {"given_name": "Pannagadatta", "family_name": "Shivaswamy", "institution": null}]}