{"title": "Ranking with Large Margin Principle: Two Approaches", "book": "Advances in Neural Information Processing Systems", "page_first": 961, "page_last": 968, "abstract": "", "full_text": "Ranking with Large Margin Principle: Two \n\nApproaches* \n\nAmnon Shashua \nSchool of CS&E \n\nHebrew University of Jerusalem \n\nJerusalem 91904, Israel \n\nemail: shashua@cs.huji.ac.il \n\nAnat Levin \n\nSchool of CS&E \n\nHebrew University of Jerusalem \n\nJerusalem 91904, Israel \n\nemail: alevin@cs.huji.ac.il \n\nAbstract \n\nWe discuss the problem of ranking k instances with the use of a \"large \nmargin\" principle. We introduce two main approaches: the first is the \n\"fixed margin\" policy in which the margin of the closest neighboring \nclasses is being maximized - which turns out to be a direct generaliza(cid:173)\ntion of SVM to ranking learning. The second approach allows for k - 1 \ndifferent margins where the sum of margins is maximized. This approach \nis shown to reduce to lI-SVM when the number of classes k = 2. Both \napproaches are optimal in size of 21 where I is the total number of training \nexamples. Experiments performed on visual classification and \"collab(cid:173)\norative filtering\" show that both approaches outperform existing ordinal \nregression algorithms applied for ranking and multi-class SVM applied \nto general multi-class classification. \n\n1 Introduction \n\nIn this paper we investigate the problem of inductive learning from the point of view of \npredicting variables of ordinal scale [3, 7,5], a setting referred to as ranking learning or \nordinal regression. We consider the problem of applying the large margin principle used \nin Support Vector methods [12, 1] to the ordinal regression problem while maintaining an \n(optimal) problem size linear in the number of training examples. \nLet x{ be the set of training examples where j = 1, ... , k denotes the class number, and \ni = 1, ... , ij is the index within each class. Let I = 2:j ij be the total number of training \nexamples. A straight-forward generalization of the 2-c1ass separating hyperplane problem, \nwhere a single hyperplane determines the classification rule, is to define k - 1 separating \nhyperplanes which would separate the training data into k ordered classes by modeling the \nranks as intervals on the real line - an idea whose origins are with the classical cumulative \nmodel [9], see also [7,5]. The geometric interpretation of this approach is to look for k - 1 \nparallel hyperplanes represented by vector w E Rn (the dimension of the input vectors) \n:::; ... :::; bk- I defining the hyperplanes (w, bd, ... , (w, bk-d, such that the \nand scalars bl \n\n'This work was done while A.S. was spending his sabbatical at the computer science department \n\nof Stanford University. \n\n\f~ ~ \nIwl \n\nIwl \n\n2 \nIwl \n\nmaximize the mf~in \n\n~~ .... \n. : ~ ~ \" \n\n(w\u00b7o) \n\nFixed-margin \n\nSum-oj-margins \n\nFigure 1: Lefthand display: fi xed-margin policy for ranking learning. The margin to be maximized \nis associated with the two closest neighboring classes. As in conventional SVM, the margin is pre(cid:173)\nscaled to be equal to 2/lwl thus maximizing the margin is achieved by minimizing w\u00b7w. The support \nvectors lie on the boundaries between the two closest classes. Righthand display: sum-of-margins \npolicy for ranking learning. The objective is to maximize the sum of k - 1 margins . Each class is \nsandwiched between two hyperplanes, the norm of w is set to unity as a constraint in the optimization \nproblem and as a result the objective is to maximize I:j (bj - aj). In this case, the support vectors lie \non the boundaries among all neighboring classes (unlike the fi xed-margin policy) . When the number \nof classes k = 2, the dual functional is equivalent to v-SVM. \n\ndata are separated by dividing the space into equally ranked regions by the decision rule \n\nf (x) = min {r: w . x - br < O}. \n\nrE{l , ... ,k} \n\n(1) \n\nIn other words, all input vectors x satisfying br - 1 < w . x < br are assigned the rank \nr (using the convention that bk = (0). For instance, recently [5] proposed an \"on-line\" \nalgorithm (with similar principles to the classic \"perceptron\" used for 2-class separation) \nfor finding the set of parallel hyperplanes which would comply with the separation rule \nabove. \n\nTo continue the analogy to 2-class learning , in addition to the separability constraints on \nthe variables 0: = {w, b1 :S ... :S bk-d one would like to control the tradeoff between \nlowering the \"empirical risk\" Remp(O:) (error measure on the training set) and lowering \nthe \"confidence interval\" 1J>(0:, h) controlled by the VC-dimension h of the set of loss \nfunctions. The \"structural risk minimization\" (SRM) principle [12] minimizes a bound \non the risk over a structure on the set of functions. The geometric interpretation for 2-class \nlearning is to maximize the margin between the boundaries of the two sets [12, 1]. \n\nIn our setting of ranking learning, there are k - 1 margins to consider, thus there are two \npossible approaches to take on the \"large margin\" principle for ranking learning: \n\n\"fixed margin\" strategy: the margin to be maximized is the one defined by the closest \n(neighboring) pair of classes. Formally, let w, bq be the hyperplane separating the two \npairs of classes which are the closest among all the neighboring pairs of classes. Let w , bq \nbe scaled such the distance of the boundary points from the hyperplane is 1, i.e., the margin \nbetween the classes q, q + 1 is 2/lwl (see Fig. 1, lefthand display) . Thus, the fixed margin \npolicy for ranking learning is to find the direction wand the scalars b1 , ... , bk - 1 such that \nw . w is minimized (i.e., the margin between classes q, q + 1 is maximized) subject to the \nseparability constraints (modulo margin errors in the non-separable case). \n\n\"sum of margins\" strategy: the sum of all k - 1 margins are to be maximized. In this case, \nthe margins are not necessarily equal (see Fig. 1, righthand display). Formally, the ranking \n\n\frule employs a vector w, Iwi = 1, and a set of 2(k - 1) thresholds ai ::::; bi \n::::; a2 ::::; b2 ::::; \n... ::::; ak-i ::::; bk- i such that w . x{ ::::; aj and w . x{+i 2:: bj for j = 1, ... , k - 1. In \nother words, all the examples of class 1 ::::; j ::::; k are \"sandwiched\" between two parallel \nhyperplanes (w,aj) and (w, bj- t}, where bo = -00 and ak = 00. The k - 1 margins are \ntherefore (bj - aj) and the large margin principle is to maximize Lj (bj - aj) subject to \nthe separability constraints above. \n\nIt is also fairly straightforward to apply the SRM principle and derive the bounds on the \nactual risk functional -\n\nsee [11] for details. \n\nIn the remainder of this paper we will introduce the algorithmic implications of these two \nstrategies for implementing the large margin principle for ranking learning. The fixed(cid:173)\nmargin principle will turn out to be a direct generalization of the Support Vector Machine \nin the sense that substituting k = 2 in our proposed algorithm would \n(SYM) algorithm -\nproduce the dual functional underlying conventional SVM.1t is interesting to note that the \nsum-of-margins principle reduces to v-SVM (introduced by [10] and later [2]) when k = 2. \n\n2 Fixed Margin Strategy \n\nRecall that in the fixed margin policy (w, bq ) is a \"canonical\" hyperplane normalized such \nthat the margin between the closest classes q, q + 1 is 2/llwll. The index q is of course \n::::; ... ::::; bk - i (and the index q) could be solved \nunknown. The unknown variables w, bi \nin a two-stage optimization problem: a Quadratic Linear Programming (QLP) formulation \nfollowed by a Linear Programming (LP) formulation. \n\nThe (primal) QLP formulation of the (\"soft margin\") fixed-margin policy for ranking learn(cid:173)\ning takes the form: \n\nj \n\ni \n\n~w . w + c l: l: (E{ + 1 - c:~j+1 \nJ -\nc:j > 0 c:*j > 0 \n\nt ' \n\nJ -\n\n\u2022 \n\nl \n\n't \n\n-\n\n, \n\n't \n\n-\n\n(2) \n\n(3) \n(4) \n(5) \n\nwhere j = 1, ... , k - 1 and i = 1, ... , i j , and C is some predefined constant. The scalars c:{ \nand i' '>i \n\n,Z -\n\n-\n\n. L \n\n. \n\nd r j r* j+i d d \n\nII \n\n\fto w, bj , fi, f;j+1 we obtain the dual optimization function which then must be maximized \nwith respect to the Lagrange multipliers. From the minimization of the Lagrangian with \nrespect to w we obtain: \n\nw = - ' \" )..~x~ + '\" 8j x j +1 \n\n't \n\n't \n\n't \n\nL...-t \ni,j \n\nL...-t \ni,j \n\n'I. \n\n(6) \n\nThat is, the direction w of the parallel hyperplanes is described by a linear combination \nof the support vectors x associated with the non-vanishing Lagrange multipliers. From the \nKuhn-Tucker theorem the support vectors are those vectors for which equality is achieved \nin the inequalities (3,4). These vectors lie on the two boundaries between the adjacent \nclasses q, q + 1 (and other adjacent classes which have the same margin). From the mini(cid:173)\nmization of the Lagrangian with respect to bj we obtain the constraint: \n\n(7) \n\nand the minimization with respect to fi and # 0), and \nlikewise for 8{. Note that a data point can count twice as a margin error - once with \nrespect to the class on its \"left\" and once with respect to the class on its \"right\". \n\n\"::.'1. \n\n't \n\n, \n\nFor the sake of presenting the dual functional in a compact form, we will introduce some \nnew notations. Let X j be the n x ij matrix whose columns are the data points xi, \ni = 1, ... , ij. Let )..j = ()..I, ... ,)..i.) T be the vector whose components are the Lagrange \nmultipliers )..{ corresponding to class j. Likewise, let 8j = (8{, ... , 8f) T be the Lagrange \nmultipliers 8! corresponding to class j + 1. Let fL = (P, ... , )..k-1, 81 , ... , 8k- 1) T be the \nvector holding all the )..! and 8! Lagrange multipliers, and let fL1 = (fLL ... , fLL1) T = \n()..1, ... , )..k-1) T and fL2 = (fLr, ... , fLL1) T = (81, ... , 8k- 1) T the first and second halves of \nfL. Note that fL] = )..j is a vector, and likewise so is fL3 = 8j . Let 1 be the vector of 1 's, and \nfinally, let Q be the matrix holding two copies of the training data: \n\n, \n\ni1 -\n\nwhere N = 2l -\nQfL. \nBy substituting the expression for w = QfL back into the Lagrangian and taking into \naccount the constraints (7,8) one obtains the dual functional which should be maximized \nwith respect to the Lagrange multipliers fLi: \n\nik' For example, (6) becomes in the new notations w \n\n(9) \n\nmax \n\n{! \n\ni = l \nsubject to \no :S fLi :S C \n1\u00b7 fLJ = 1 . fL] \n\ni = 1, ... , N \n\nj = 1, ... , k - 1 \n\n(10) \n\n(11) \n(12) \n\nNote that k = 2, i.e., we have only two classes thus the ranking learning problem is equiv(cid:173)\nalent to the 2-class classification problem, the dual functional reduces and becomes equiv(cid:173)\nalent to the dual form of conventional SVM. In that case (QT Q)ij = YiYjXi . Xj where \nYi, Yj = \u00b11 denoting the class membership. \n\n\fAlso worth noting is that since the dual functional is a function of the Lagrange multipliers \n>-.{ and 5{ alone, the problem size (the number of unknown variables) is equal to twice the \nnumber of training examples - precisely N = 2l-il -ik where l is the number oftraining \nexamples. This favorably compares to the O(l2) required by the recent SYM approach to \nordinal regression introduced in [7] or the kl required by the general multi-class approach \nto SYM [4,8]. \nFurther note that since the entries of Q T Q are the inner-products of the training examples, \nthey can be represented by the kernel inner-product in the input space dimension rather than \nby inner-products in the feature space dimension. The decision rule, in this case, given a \nnew instance vector x would be the rank r corresponding to the first smallest threshold br \nfor which \n\nsupport vector s \n\nsupport vectors \n\nwhere K(x, y) = \u00a2>(x) . \u00a2>(y) replaces the inner-products in the higher-dimensional \"fea(cid:173)\nture\" space \u00a2>(x). \n\nFinally, from the dual form one can solve for the Lagrange multipliers J-Li and in turn obtain \nw = QJ-L the direction of the parallel hyperplanes. The scalar bq (separating the adjacent \nclasses q, q + 1 which are the closest apart) can be obtained from the support vectors, but \nthe remaining scalars bj cannot. Therefore an additional stage is required which amounts \nto a Linear Programming problem on the original primal functional (2) but this time w is \nalready known (thus making this a linear problem instead of a quadratic one). \n\n3 Sum-of-Margins Strategy \n\nIn this section we propose an alternative large-margin policy which allows for k - 1 mar(cid:173)\ngins where the criteria function maximizes the sum of them. The challenge in formulating \nthe appropriate optimization functional is that one cannot adopt the \"pre-scaling\" of w ap(cid:173)\nproach which is at the center of conventional SYM formulation and of the fixed-margin \npolicy for ranking learning described in the previous section. \nThe approach we take is to represent the primal functional using 2(k - 1) parallel hy(cid:173)\nperplanes instead of k - 1. Each class would be \"sandwiched\" between two hyperplanes \n(except the first and last classes). Formally, we seek a ranking rule which employs a vector \nwand a set of 2(k - 1) thresholds al :::; b1 \n:::; b2 :::; ... :::; ak-l :::; bk- 1 such \nthat w . x{ :::; aj and w . X{+l ::::: bj for j = 1, ... , k - 1. In other words, all the exam(cid:173)\nples of class 1 :::; j :::; k are \"sandwiched\" between two parallel hyperplanes (w, aj) and \n(w, bj- d, where bo = -00 and ak = 00. \nThe margin between two hyperplanes separating class j and j + 1 is: (bj - aj) / JTIWTI. \nThus, by setting the magnitude of w to be of unit length (as a constraint in the optimization \nproblem) , the margin which we would like to maximize is Lj(bj - aj) for j = 1, ... , k-1 \nwhich we can formulate in the following primal QLP (see also Fig. 1, righthand display): \n\n:::; a2 \n\nmin \n\nk-l \n\ni \n\nj \n\n2)aj - bj ) + C 2: 2: (f{ + f;j+l) \nj =l \nsubject to \naj :::; bj , \nbj:::;aj+l, \nw\u00b7 x j < a\u00b7 + fj b\u00b7 - f*j+l < w\u00b7 x j +! \n-\nw . w < 1 fj > 0 f*j+! > 0 \n\nj=1, ... , k-2 \n\n\u2022 -\n\n., \n\nJ \n\n- , 2 - '1 , \n\n-\n\nJ \n\n\u2022 \n\n. , \n\n(13) \n\n(14) \n(15) \n\n(16) \n\n(17) \n\n\fwhere j = 1, ... , k - 1 (unless otherwise specified) and i = 1, ... , ij, and C is some prede(cid:173)\nfined constant (whose physical role would be explained later). Note that the (non-convex) \nconstraint w . w = 1 is replaced by the convex constraint w . w ::; 1 since it can be shown \nthat the optimal solution w* would have unit magnitude in order to optimize the objective \nfunction (see [11] for details). We will proceed to derive the dual functional below. \n\nThe Lagrangian takes the following form: \n\nL(\u00b7) \n\nl)aj - bj ) + C L (e1 + 0 for all vectors on the boundaries between the adjacent pairs of \nclasses and margin errors . In other words, the vectors x associated with non-vanishing f.1i \nare those which lie on the hyperplanes or vectors tagged as margin errors. Therefore, all \nthe thresholds aj, bj can be recovered from the support vectors - unlike the fixed-margin \nscheme which required another LP pass. \n\nThe dual functional (18) is similar to the dual functional (10) but with some crucial differ(cid:173)\nences: (i) the quadratic criteria functional is homogeneous , and (ii) constraints (20) lead \nto the constraint L:i f.1i \n;::: 2. These two differences are also what distinguishes between \nconventional SVM and v-SVM for 2-class learning proposed recently by [10]. Indeed, if \nwe set k = 2 in the dual functional (18) we would be able to conclude that the two dual \nfunctionals are identical (by a suitable change of variables) . Therefore, the role of the con(cid:173)\nstant C complies with the findings of [10] by controlling the tradeoff between the number \nof margin errors and support vectors and the size of the margins: 2/ N ::; C ::; 2 such that \nwhen C = 2 a single margin error is allowed (otherwise a duality gap would occur) and \nwhen C = 2/ N all vectors are allowed to become margin errors and support vectors (see \n[11] for a detailed discussion on this point) . \nIn the general case of k > 2 classes (in the context of ranking learning) the role of the \nconstant C carries the same meaning: C::; 2(k - 1)/#m.e. where #m.e. stand for \"total \nnumber of margin errors\", thus \n\n2(k;; 1) ::; C ::; 2(k _ 1). \n\nSince a data point can can count twice for a margin error, the total number of margin errors \nin the worst case is N = 2l - il - ik where l is the total number of data points. \n\n\f'\" .. ~ ~-\n\no~ \n\n1~ \n\nI~ * ~ ~ \n\nFigure 2: The results of the fi xed-margin principle plotted against the results of PRank of [5] which \ndoes not use a large-margin principle. The average error of PRank is about 1.25 compared to 0.7 with \nthe fi xed-margin algorithm. \n\n4 Experiments \n\nDue to lack of space we describe only two sets of experiments we conducted on a \"collabo(cid:173)\nrative filtering\" problem and visual data ranking. More details and further experiments are \nreported in [11]. \n\nIn general, the goal in collaborative filtering is to predict a person's rating on new items \nsuch as movies given the person's past ratings on similar items and the ratings of other \npeople of all the items (including the new item). The ratings are ordered, such as \"highly \nrecommended\", \"good\" , ... , \"very bad\" thus collaborative filtering falls naturally under the \ndomain of ordinal regression (rather than general multi-class learning). \n\nThe \"EachMovie\" dataset [6] contains 1628 movies rated by 72,916 people arranged as a \n2D array whose columns represent the movies and the rows represent the users -\nabout \n5% of the entries of this array are filled-in with ratings between 0, ... ,6 totaling 2,811,983 \nratings. Given a new user, the ratings of the user on the 1628 movies (not all movies would \nbe rated) form the Yi and the i'th column of the array forms the Xi which together form the \ntraining data (for that particular user). Given a new movie represented by the vector x of \nratings of all the other 72,916 users (not all the users rated the new movie), the learning \ntask is to predict the rating f (x) of the new user. Since the array contains empty entries, the \nratings were shifted by -3.5 to have the possible ratings {-2.5, -1.5, -0.5, 0.5,1.5, 2.5} \nwhich allows to assign the value of zero to the empty entries of the array (movies which \nwere not rated). \n\nFor the training phase we chose users which ranked about 450 movies and selected a subset \n{50, 100, ... , 300} of those movies for training and tested the prediction on the remaining \nthe average distance be(cid:173)\nmovies. We compared our results (collected over 100 runs) -\ntween the correct rating and the predicted rating -\nto the best \"on-line\" algorithm of [5] \ncalled \"PRank\" (there is no use of large margin principle). In their work, PRank was com(cid:173)\npared to other known on-line approaches and was found to be superior, thus we limited our \ncomparison to PRank alone. Attempts to compare our algorithms to other known ranking \nalgorithms which use a large-margin principle ([7], for example) were not successful since \nthose square the training set size which made the experiment with the Eachmovie dataset \nuntractable computationally. \n\nThe graph in Fig. 2 shows that the large margin principle makes a significant difference on \nthe results compared to PRank. The results we obtained with PRank are consistent with \nthe reported results of [5] (best average error of about 1.25), whereas our fixed-margin \nalgorithm provided an average error of about 0.7). \n\nWe have applied our algorithms to classification of \"vehicle type\" to one of three classes: \n\"small\" (passenger cars), \"medium\" (SUVs, minivans) and \"large\" (buses, trucks). There \n\n\fFigure 3: Classifi cation of vehicle type: Small, Medium and Large (see text for details). \n\nis a natural order Small, Medium, Large since making a mistake between Small and Large \nis worse than confusing Small and Medium, for example. We compared the classification \nerror (counting the number of miss-classifications) to general multi-class learning using \npair-wise SVM. The error over a test set of about 14,000 pictures was 20% compared to \n25% when using general multi-class SVM. We also compared the error (averaging the \ndifference between the true rank {I, 2,3} and the predicted rank using 2nd-order kernel) to \nPRank. The average error was 0.216 compared to 1.408 with PRank. Fig. 3 shows a typical \ncollection of correctly classified and incorrectly classified pictures from the test set. \n\nReferences \n\n[1] B.E. Boser, LM. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classifers. \nIn Proc. of the 5th ACM Workshop on Computational Learning Theory, pages 144-152. ACM \nPress, 1992. \n\n[2] C.C. Chang and C.J. Lin. Training v-Support Vector classifi ers: Theory and Algorithms. In \n\nNeural Computations, 14(8),2002. \n\n[3] W.W. Cohen, R.E. Schapire, and Y. Singer. Learning to order things. lournal of Artificial \n\nIntelligence Research (lAIR), 10:243-270, 1999. \n\n[4] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based \n\nvector machines. lournal of Machine Learning Research, 2:265-292, 2001. \n\n[5] K. Crammer and Y. Singer. Pranking with ranking. In Proceedings of the conference on Neural \n\nInformation Processing Systems (NIPS), 2001. \n\n[6] http://www.research.compaq.comlSRC/eachmovie/ . \n[7] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regres(cid:173)\n\nsion. Advances in Large Margin Classifi ers, 2000. pp. 115-132. \n\n[8] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines. Technical Report 1043, \n\nUniv. of Wisconsin, Dept. of Statistics, Sep. 2001. \n\n[9] P. McCullagh and J. A. NeIder. Generalized Linear Models. Chapman and Hall, London, 2nd \n\nedition edition, 1989. \n\n[10] B. Scholkopf, A. Smola, R.C. Williamson, and P.L. Bartless. New support vector algorithms. \n\nNeural Computation, 12:1207-1245, 2000. \n\n[11] A. Shashua and A. Levin. Taxonomy of Large Margin Principle Algorithms for Ordinal Regres(cid:173)\n\nsion Problems. Technical Report 2002-39, Leibniz Center for Research, School of Computer \nScience and Eng., the Hebrew University of Jerusalem. \n\n[12] V.N. Vapnik. The nature of statistical learning. Springer, 2nd edition, 1998. \n[13] J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In Proc. \n\nof the 7th European Symposium on Artificial Neural Networks, April 1999. \n\n\f", "award": [], "sourceid": 2269, "authors": [{"given_name": "Amnon", "family_name": "Shashua", "institution": null}, {"given_name": "Anat", "family_name": "Levin", "institution": null}]}