{"title": "Pranking with Ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 641, "page_last": 647, "abstract": null, "full_text": "Pranking with Ranking \n\nKoby Crammer and Yoram Singer \nSchool of Computer Science & Engineering \n\nThe Hebrew University, Jerusalem 91904, Israel \n\n{kobics,singer}@cs.huji.ac.il \n\nAbstract \n\nWe discuss the problem of ranking instances. In our framework \neach instance is associated with a rank or a rating, which is an \ninteger from 1 to k. Our goal is to find a rank-prediction rule that \nassigns each instance a rank which is as close as possible to the \ninstance's true rank. We describe a simple and efficient online al(cid:173)\ngorithm, analyze its performance in the mistake bound model, and \nprove its correctness. We describe two sets of experiments, with \nsynthetic data and with the EachMovie dataset for collaborative \nfiltering. In the experiments we performed, our algorithm outper(cid:173)\nforms online algorithms for regression and classification applied to \nranking. \n\n1 \n\nIntroduction \n\nThe ranking problem we discuss in this paper shares common properties with both \nclassification and regression problems. As in classification problems the goal is to \nassign one of k possible labels to a new instance. Similar to regression problems, \nthe set of k labels is structured as there is a total order relation between the labels. \nWe refer to the labels as ranks and without loss of generality assume that the ranks \nconstitute the set {I, 2, .. . , k} . Settings in which it is natural to rank or rate in(cid:173)\nstances rather than classify are common in tasks such as information retrieval and \ncollaborative filtering. We use the latter as our running example. In collaborative \nfiltering the goal is to predict a user's rating on new items such as books or movies \ngiven the user's past ratings of the similar items. The goal is to determine whether \na movie fan will like a new movie and to what degree, which is expressed as a \nrank. An example for possible ratings might be, run-to-see, very-good, good, \nonly-if-you-must, and do-not-bother. While the different ratings carry mean(cid:173)\ningful semantics, from a learning-theoretic point of view we model the ratings as a \ntotally ordered set (whose size is 5 in the example above). \n\nThe interest in ordering or ranking of objects is by no means new and is still the \nsource of ongoing research in many fields such mathematical economics, social sci(cid:173)\nence, and computer science. Due to lack of space we clearly cannot cover thoroughly \nprevious work related to ranking. For a short overview from a learning-theoretic \npoint of view see [1] and the references therein. One of the main results of [1] un(cid:173)\nderscores a complexity gap between classification learning and ranking learning. To \nsidestep the inherent intractability problems of ranking learning several approaches \nhave been suggested. One possible approach is to cast a ranking problem as a \nregression problem. Another approach is to reduce a total order into a set of pref-\n\n\fCorrect interval #l \\ I \n\nFigure 1: An Illustration of the update rule. \n\nerences over pairs [3, 5]. The first case imposes a metric on the set of ranking rules \nwhich might not be realistic, while the second approach is time consuming since it \nrequires increasing the sample size from n to O(n2 ). \n\nIn this paper we consider an alternative approach that directly maintains a totally \nordered set via projections. Our starting point is similar to that of Herbrich et. al [5] \nin the sense that we project each instance into the reals. However, our work then \ndeviates and operates directly on rankings by associating each ranking with distinct \nsub-interval of the reals and adapting the support of each sub-interval while learn(cid:173)\ning. In the next section we describe a simple and efficient online algorithm that \nmanipulates concurrently the direction onto which we project the instances and \nthe division into sub-intervals. In Sec. 3 we prove the correctness of the algorithm \nand analyze its performance in the mistake bound model. We describe in Sec. 4 \nexperiments that compare the algorithm to online algorithms for classification and \nregression applied to ranking which demonstrate the merits of our approach. \n\n2 The PRank Algorithm \n\nThis paper focuses on online algorithms for ranking instances. We are given a \nsequence (Xl, yl), ... , (xt , yt) , ... of instance-rank pairs. Each instance xt is in IR n \nand its corresponding rank yt is an element from finite set y with a total order \nrelation. We assume without loss of generality that y = {I , 2, ... ,k} with \">\" \nas the order relation. The total order over the set Y induces a partial order over \nthe instances in the following natural sense. We say that xt is preferred over X S \nif yt > yS. We also say that xt and x S are not comparable if neither yt > yS nor \nyt < yS. We denote this case simply as yt = yS. Note that the induced partial order \nis of a unique form in which the instances form k equivalence classes which are totally \norderedl . A ranking rule H is a mapping from instances to ranks, H : IRn -+ y. \nThe family of ranking rules we discuss in this paper employs a vector w E IRn and \na set of k thresholds bl :::; ... :::; bk- l :::; bk = 00. For convenience we denote by \nb = (bl , . .. ,bk-d the vector of thresholds excluding bk which is fixed to 00. Given a \nnew instance x the ranking rule first computes the inner-product between w and x . \nThe predicted rank is then defined to be the index of the first (smallest) threshold \nbr for which w . x < br . This type of ranking rules divide the space into parallel \nequally-ranked regions: all the instances that satisfy br - l < W\u00b7 x < br are assigned \nthe same rank r. Formally, given a ranking rule defined by wand b the predicted \nrank of an instance x is, H(x) = minrE{l, ... ,k}{r : w . x - br < O}. Note that the \nabove minimum is always well defined since we set bk = 00. \n\nThe analysis that we use in this paper is based on the mistake bound model for \nonline learning. The algorithm we describe works in rounds. On round t the learning \nalgorithm gets an instance xt. Given x t , the algorithm outputs a rank, il = minr {r : \nW\u00b7 x - br < O}. It then receives the correct rank yt and updates its ranking rule by \nmodifying wand b. We say that our algorithm made a ranking mistake if il f:. yt. \n\nIFor a discussion of this type of partial orders see [6] . \n\n\fInitialize: Set wI = 0 , b~ , ... , bLl = 0, bl = 00 . \nLoop: Fort=1 ,2, ... ,T \n\n\u2022 Get a new rank-value xt E IRn. \n\u2022 Predict fl = minr E{I, ... ,k} {r: w t . xt - b~ < o}. \n\u2022 Get a new label yt. \n\u2022 If fl t- yt update w t (otherwise set w t+! = w t , \\;fr : b~+! = bn : \n\n1. For r = 1, ... , k - 1 \n\n2. For r = 1, ... , k - 1 \n\nIf yt :::; r Then y~ = -1 \nElse y~ = 1. \nIf (wt . xt - b~)y~ :::; 0 Then T; = y~ \nElse T; = o. \n\n3. Update w t+! f- w t + CLr T;)xt. \n\nb~+1 f- b~ - T; \nOutput: H(x) = minr E{1, ... ,k} {r : w T +! . x - b;+! < O}. \n\nFor r = 1, . .. , k - 1 update: \n\nFigure 2: The PRank algorithm. \n\nWe wish to make the predicted rank as close as possible to the true rank. Formally, \nthe goal of the learning algorithm is to minimize the ranking-loss which is defined to \nbe the number of thresholds between the true rank and the predicted rank. Using \nthe representation of ranks as integers in {I ... k}, the ranking-loss after T rounds \nis equal to the accumulated difference between the predicted and true rank-values, \n'\u00a3'[=1 W - yt I. The algorithm we describe updates its ranking rule only on rounds \non which it made ranking mistakes. Such algorithms are called conservative. \n\n:::; b2 \n\n:::; ... :::; bk- 1 \n\nWe now describe the update rule of the algorithm which is motivated by the per(cid:173)\nceptron algorithm for classification and hence we call it the PRank algorithm (for \nPerceptron Ranking). For simplicity, we omit the index of the round when refer(cid:173)\nring to an input instance-rank pair (x, y) and the ranking rule wand h. Since \n:::; bk then the predicted rank is correct if w . x > br for \nb1 \nr = 1, ... ,y - 1 and w . x < br for y, . .. , k - 1. We represent the above inequali(cid:173)\nties by expanding the rank y into into k - 1 virtual variables Yl , ... ,Yk-l. We set \nYr = +1 for the case W\u00b7 x > br and Yr = -1 for w . x < br. Put another way, a \nrank value y induces the vector (Yl, ... , Yk-d = (+1, ... , +1 , -1 , ... , -1) where the \nmaximal index r for which Yr = +1 is y-1. Thus, the prediction of a ranking rule is \ncorrect if Yr(w\u00b7 x - br ) > 0 for all r. If the algorithm makes a mistake by ranking x \nas fj instead of Y then there is at least one threshold, indexed r, for which the value \nof W\u00b7 x is on the wrong side of br, i.e. Yr(w\u00b7 x - br ) :::; O. To correct the mistake, we \nneed to \"move\" the values of W\u00b7 x and br toward each other. We do so by modifying \nonly the values of the br's for which Yr (w . x - br ) :::; 0 and replace them with br - Yr. \nWe also replace the value of w with w + ('\u00a3 Yr)x where the sum is taken over the \nindices r for which there was a prediction error, i.e., Yr (w . x - br ) :::; o. \nAn illustration of the update rule is given in Fig 1. In the example, we used the \nset Y = {I ... 5}. (Note that b5 = 00 is omitted from all the plots in Fig 1.) The \ncorrect rank of the instance is Y = 4, and thus the value of w . x should fall in the \nfourth interval, between b3 and b4 . However, in the illustration the value of w . x \nfell below b1 and the predicted rank is fj = 1. The threshold values b1 , b2 and b3 are \na source of the error since the value of b1 , b2 , b3 is higher then W\u00b7 x. To mend the \nmistake the algorithm decreases b1 , b2 and b3 by a unit value and replace them with \nb1 -1 , b2 -1 and b3 -1. It also modifies w to be w+3x since '\u00a3r:Yr(w.x- br):SO Yr = 3. \nThus, the inner-product W\u00b7 x increases by 311x11 2 . This update is illustrated at the \nmiddle plot of Fig. 1. The updated prediction rule is sketched on the right hand \n\n\fside of Fig. 1. Note that after the update, the predicted rank of x is Y = 3 which is \ncloser to the true rank y = 4. The pseudocode of algorithm is given in Fig 2. \nTo conclude this section we like to note that PRank can be straightforwardly com(cid:173)\nbined with Mercer kernels [8] and voting techniques [4] often used for improving the \nperformance of margin classifiers in batch and online settings. \n\n3 Analysis \n\nBefore we prove the mistake bound of the algorithm we first show that it main(cid:173)\ntains a consistent hypothesis in the sense that it preserves the correct order of the \nthresholds. Specifically, we show by induction that for any ranking rule that can \n, ... , (wT +1 , b T +1) we have \nbe derived by the algorithm along its run, (w1 , b 1 ) \nthat b~ :S ... :S bL1 for all t. Since the initialization of the thresholds is such that \nb~ :S b~ :S ... :S bL1' then it suffices to show that the claim holds inductively. For \nsimplicity, we write the updating rule of PRank in an alternative form. Let [7f] be \n1 if the predicate 7f holds and 0 otherwise. We now rewrite the value of T; (from \nFig. 2) as T; = y~[(wt . xt - bny~ :S 0]. Note that the values of b~ are integers for \nall r and t since for all r we initialize b; = 0, and b~+l - b~ E {-1 , 0, +1}. \n\nLemma 1 (Order Preservation) Let w t and b t be the current ranking rule, \n:S .. . :S bL1' and let (xt,yt) be an instance-rank pair fed to PRank \nwhere bi \non round t. Denote by wt+1 and bt+1 the resulting ranking rule after the update of \nPRank, then bi+1 :S ... :S bt~ll\u00b7 \n\nProof: \nIn order to show that PRank maintains the order of the thresholds we \nuse the definition of the algorithm for y~, namely we define y~ = +1 for r < yt and \ny~ = -1 for r 2:: yt. We now prove that b~t~ 2:: b~+l for all r by showing that \nb~+l - b~ 2:: y~+1[(wt . xt - b~+1)Y;+l :S 0] - y;[(wt . xt - b;)y; :S 0], \n\n(1) \nwhich we obtain by substituting the values of bt+1. Since b~+1 :S b~ and b~ ,b~+1 E Z \nwe get that the value of b~+1 - b~ on the left hand side of Eq. (1) is a non-negative \ninteger. Recall that y~ = 1 if yt > r and y~ = -1 otherwise, and therefore, \ny~+l :S y~. We now analyze two cases. We first consider the case y~+1 :j:. y~ which \nimplies that y~+l = -1, y~ = +1. In this case, the right hand-side of Eq. (1) is at \nmost zero, and the claim trivially holds. The other case is when y~+1 = y~. Here \nwe get that the value of the right hand-side Eq. (1) cannot exceed 1. We therefore \nhave to consider only the case where b~ = b~+1 and y~+1 = y~. But given these two \nconditions we have that y~+1[(wt. xt - b~+1)Y~+1 < 0] and y~[(wt. xt - b~)y~ < 0] \nare equal. The right hand side of Eq. (1) is now zero and the inequality holds with \n\u2022 \nequality. \n\nIn order to simplify the analysis of the algorithm we introduce the following nota(cid:173)\ntion. Given a hyperplane wand a set of k -1 thresholds b we denote by v E ~n+k-1 \nthe vector which is a concatenation of wand b that is v = (w, b). For brevity we re(cid:173)\nfer to the vector vas a ranking rule. Given two vectors v' = (w', b ' ) and v = (w, b) \nwe have v' . v = w' . w + b' . b and IIvl12 = IIwl1 2 + IlbW. \nTheorem 2 (Mistake bound) Let (xl, y1), ... , (xT , yT) be an input sequence for \nPRank where xt E ~n and yt E {l. .. k}. Denote by R2 = maxt Ilxtl12. Assume \nthat there is a ranking rule v* = (w* , b*) with br :S ... :S bk- 1 of a unit norm that \nclassifies the entire sequence correctly with margin \"( = minr,t{ (w* . xt - b;)yn > o. \nThen, the rank loss of the algorithm '\u00a3;=1 Iyt - yt I, is at most (k - 1) (R2 + 1) / \"(2. \n\n\fk-1 \n2:= T; (w* . x t - b;) ~ nt\"( . \nr = 1 \n\n(3) \n\n, \n\n. , \n\nProof: Let us fix an example (xt, yt) which the algorithm received on round t. \nBy definition the algorithm ranked the example using the ranking rule v t which is \ncomposed of w t and the thresholds b t . Similarly, we denote by vt+l the updated \nrule (wt+l bt+l) after round t That is wt+l = w t + (\" Tt)xt and bt+l = bt - Tt \nr \nfor r = 1, 2, ... , k - 1. Let us denote by n t = W - yt 1 the difference between the true \nthat if there wasn't a ranking mistake on round t then T; = \u00b0 for r = 1, ... , k-1, and \nrank and the predicted rank. It is straightforward to verify that nt = 2:=r ITn Note \nthus also nt = 0. To prove the theorem we bound 2:=t nt from above by bounding \nIIvtl12 from above and below. First, we derive a lower bound on IIvtl12 by bounding \nv* . v H1 . Substituting the values of w H1 and b H1 we get, \nv* . vt+l = v* . v t + 2:= T; (w* . xt - b;) \n\nur r \n\nr \n\nr \n\n(2) \n\nk-1 \n\nr=1 \n\nT; from the pseudocode in Fig. 2 we need to analyze two cases. If (wt \u00b7xt - b~)y; :::; \u00b0 \nWe further bound the right term by considering two cases. Using the definition of \nthen T; = y;. Using the assumption that v* ranks the data correctly with a margin \n(wt . xt - b;)y; > \u00b0 we have T; = \u00b0 and thus T;(W* . xt - b;) = 0. Summing \nof at least \"( we get that T;(W* . xt - b;) ~ \"(. For the other case for which \n\nnow over r we get, \n\nCombining Eq. (2) and Eq. (3) we get v* . vt+l ~ v* . v t + nt\"(. Unfolding the \nsum, we get that after T rounds the algorithm satisfies, v* . v T+1 ~ 2:=t nt\"( = \n\"( 2:=t nt. Plugging this result into Cauchy-Schwartz inequality, (1IvT+111 21Iv* 112 ~ \n(vT+l . v*) 2) and using the assumption that v* is of a unit norm we get the lower \nbound, IIvT+ll1 2 ~ (2:=t nt)2 \"(2. \nNext, we bound the norm of v from above. As before, assume that an example \n(xt, yt) was ranked using the ranking rule v t and denote by vt+l the ranking rule \nafter the round. We now expand the values ofwt+1 and bt+l in the norm ofvH1 and \nget, IIvH1 112 = IIwtl12 + IIbt l1 2 + 2 2:=r T; (wt . xt - b;) + (2:=r T;)21IxtI12 + 2:=r (T;)2. \nSince T; E {-1,0,+1} we have that (2:=rT;)2 :::; (nt)2 and 2:=r(T;)2 = nt and we \ntherefore get, \n\nIIvH1 112 :::; IIvtl12 + 22:= T; (wt . xt - b~) + (nt)21IxtW + nt . \n\n(4) \n\nr \n\nWe further develop the second term using the update rule of the algorithm and get, \n\n2:= T; (wt . xt - b~) = 2:=[(wt . xt - b~)y; :::; 0] ((wt . xt - b~)y~) :::; \u00b0 . (5) \n\nr \n\nr \n\n:::; R2 we get that \nPlugging Eq. (5) into Eq. (4) and using the bound IIxtl12 \nIlvH1112:::; IIvtl12 + (nt)2R2 + nt. Thus, the ranking rule we obtain after T rounds \nof the algorithm satisfies the upper bound, IlvT+l W :::; R2 2:=t(nt )2 + 2:=t nt. Com-\nbining the lower bound IlvT+l W ~ (2:=t nt)2 \"(2 with the upper bound we have that, \n(2:=tnt) 2\"(2:::; IlvT+1112:::; R2 2:=t(nt )2 + 2:=t nt . Dividing both sides by \"(2 2:=tnt we \nfinally get, \n\n2:= nt :::; R2 [2:=t(nt )2] f [2:=t ntl + 1 . \n\n(6) \n\nt \n\n\"( \n\nBy definition, nt is at most k - 1, which implies that 2:=t(nt )2 :::; 2:=t nt(k - 1) = \n(k -1) 2:=t nt. Using this inequality in Eq. (6) we get the desired bound, 2:=;=1 Igt -\nytl = 2:=;=1 nt :::; [(k - 1)R2 + 1lh2 :::; [(k - 1)(R2 + 1)lh2 . \n\u2022 \n\n\fi\" \n\nI ... ~ \n\nFigure 3: Comparison of the time-averaged ranking-loss of PRank, WH, and MCP \non synthetic data (left). Comparison of the time-averaged ranking-loss of PRank, \nWH, and MCP on the EachMovie dataset using viewers who rated and at least 200 \nmovies (middle) and at least 100 movies (right). \n\n4 Experiments \n\nIn this section we describe experiments we performed that compared PRank with \ntwo other online learning algorithms applied to ranking: a multiclass generalization \nof the perceptron algorithm [2], denoted MCP, and the Widrow-Hoff [9] algorithm \nfor online regression learning which we denote by WHo For WH we fixed its learning \nrate to a constant value. The hypotheses the three algorithms maintain share \nsimilarities but are different in their complexity: PRank maintains a vector w of \ndimension n and a vector of k - 1 modifiable thresholds b, totaling n + k - 1 \nparameters; MCP maintains k prototypes which are vectors of size n, yielding kn \nparameters; WH maintains a single vector w of size n. Therefore, MCP builds the \nmost complex hypothesis of the three while WH builds the simplest. \nDue to the lack of space, we only describe two sets of experiments with two different \ndatasets. The dataset used in the first experiment is synthetic and was generated in \na similar way to the dataset used by Herbrich et. al. [5]. We first generated random \npoints x = (Xl, X2) uniformly at random from the unit square [0,1 f. Each point \nwas assigned a rank y from the set {I, ... , 5} according to the following ranking rule, \ny = maxr{r : lO((XI - 0.5)(X2 - 0.5)) + ~ > br } where b = (-00 , -1, -0.1,0.25,1) \nand ~ is a normally distributed noise of a zero mean and a standard deviation \nof 0.125. We generated 100 sequences of instance-rank pairs each of length 7000. \nWe fed the sequences to the three algorithms and obtained a prediction for each \ninstance. We converted the real-valued predictions of WH into ranks by rounding \neach prediction to its closest rank value. As in ~5] we used a non-homogeneous \npolynomial of degree 2, K(XI' X2) = ((Xl\u00b7 X2) + 1) as the inner-product operation \nbetween each input instance and the hyperplanes the three algorithms maintain. \nAt each time step, we computed for each algorithm the accumulated ranking-loss \nnormalized by the instantaneous sequence length. Formally, the time-averaged loss \nafter T rounds is, (liT) 'L,i Iyt _ytl. We computed these losses for T = 1, ... ,7000. \nTo increase the statistical significance of the results we repeated the process 100 \ntimes, picking a new random instance-rank sequence of length 7,000 each time, and \naveraging the instantaneous losses across the 100 runs. The results are depicted \non the left hand side of Fig. 3. The 95% confidence intervals are smaller then the \nsymbols used in the plot. In this experiment the performance of MPC is constantly \nworse than the performance of WH and PRank. WH initially suffers the smallest \ninstantaneous loss but after about 500 rounds PRank achieves the best performance \nand eventually the number of ranking mistakes that PRank suffers is significantly \nlower than both WH and MPC. \n\n\fIn the second set of experiments we used the EachMovie dataset [7]. This dataset \nis used for collaborative filtering tasks and contains ratings of movies provided \nby 61 , 265 people. Each person in the dataset viewed a subset of movies from a \ncollection of 1623 titles. Each viewer rated each movie that she saw using one of \n6 possible ratings: 0, 0.2, 0.4, 0.6, 0.8,1. We chose subsets of people who viewed a \nsignificant amount of movies extracting for evaluation people who have rated at \nleast 100 movies. There were 7,542 such viewers. We chose at random one person \namong these viewers and set the person's ratings to be the target rank. We used the \nratings of all the rest of the people who viewed enough movies as features. Thus, \nthe goal is to learn to predict the \"taste\" of a random user using the user's past \nratings as a feedback and the ratings of fellow viewers as features. The prediction \nrule associates a weight with each fellow viewer an therefore can be seen as learning \ncorrelations between the tastes of different viewers. Next, we subtracted 0.5 from \neach rating and therefore the possible ratings are -0.5 , -0.3, -0.1 , 0.1, 0.3, 0.5. This \nlinear transformation enabled us to assign a value of zero to movies which have not \nbeen rated. We fed these feature-rank pairs one at a time, in an online fashion . \nSince we picked viewer who rated at least 100 movies, we were able to perform at \nleast 100 rounds of online predictions and updates. We repeated this experiment \n500 times, choosing each time a random viewer for the target rank. The results are \nshown on the right hand-side of Fig. 3. The error bars in the plot indicate 95% \ncondfidence levels. We repeated the experiment using viewers who have seen at \nleast 200 movies. (There were 1802 such viewers.) The results of this experiment \nare shown in the middle plot of Fig. 3. Along the entire run of the algorithms, \nPRank is significantly better than WH, and consistently better than the multiclass \nperceptron algorithm, although the latter employs a bigger hypothesis. \n\nFinally, we have also evaluated the performance of PRank in a batch setting, using \nthe experimental setup of [5]. In this experiment, we ran PRank over the training \ndata as an online algorithm and used its last hypothesis to rank unseen test data. \nHere as well PRank came out first, outperforming all the algorithms described in [5]. \n\nAcknowledgments Thanks to Sanjoy Dagupta and Rob Schapire for numerous \ndiscussions on ranking problems and algorithms. Thanks also to Eleazar Eskin and \nUri Maoz for carefully reading the manuscript. \n\nReferences \n\n[1] William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. \n\nJournal of Artificial Int elligence Research, 10:243- 270, 1999. \n\n[2] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass prob(cid:173)\nlems. Proc. of the Fourteenth Annual ConI on Computational Learning Theory, 200l. \n[3] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for \n\ncombining preferences. Machine Learning: Proc. of the Fifteenth Inti. ConI, 1998. \n\n[4] Y. Freund and R. E. Schapire. Large margin classification using the perceptron algo(cid:173)\n\nrithm. Machine Learning, 37(3): 277-296, 1999. \n\n[5] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal \n\nregression. Advances in Large Margin Classifiers. MIT Press, 2000. \n\n[6] J. Kemeny and J . Snell. Mathematical Models in the Social Sciences. MIT Press, 1962. \n[7] Paul McJones. EachMovie collaborative filtering data set. DEC Systems Research \n\nCenter, 1997. http://www.research.digital.com/SRC/eachmoviej. \n\n[8] Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998. \n[9] Bernard Widrow and Marcian E. Hoff. Adaptive switching circuits. \n\n1960 IRE \nWESCON Convention Record, 1960. Reprinted in Neurocomputing (MIT Press, 1988). \n\n\f", "award": [], "sourceid": 2023, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}