{"title": "Pranking with Ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 641, "page_last": 647, "abstract": null, "full_text": "Pranking with Ranking \n\nKoby  Crammer  and  Yoram  Singer \nSchool of Computer Science  &  Engineering \n\nThe Hebrew University,  Jerusalem 91904,  Israel \n\n{kobics,singer}@cs.huji.ac.il \n\nAbstract \n\nWe  discuss  the  problem  of ranking  instances.  In  our  framework \neach  instance  is  associated  with  a  rank  or  a  rating,  which  is  an \ninteger from  1 to k.  Our goal is  to find  a  rank-prediction rule that \nassigns  each  instance  a  rank  which  is  as  close  as  possible  to  the \ninstance's true rank.  We  describe  a  simple  and efficient  online  al(cid:173)\ngorithm, analyze its performance in the mistake bound model, and \nprove  its  correctness.  We  describe  two  sets  of experiments,  with \nsynthetic  data and  with  the  EachMovie  dataset  for  collaborative \nfiltering.  In  the experiments we  performed,  our algorithm outper(cid:173)\nforms  online algorithms for  regression and classification applied to \nranking. \n\n1 \n\nIntroduction \n\nThe ranking problem we  discuss in this paper shares common properties with both \nclassification  and  regression  problems.  As  in  classification  problems  the  goal  is  to \nassign  one  of k  possible  labels  to  a  new  instance.  Similar  to  regression  problems, \nthe set of k  labels is  structured as there is  a  total order relation between the labels. \nWe  refer to the labels as ranks and without loss of generality assume that the ranks \nconstitute the  set  {I, 2, .. . , k} .  Settings  in  which  it  is  natural  to  rank or  rate  in(cid:173)\nstances rather than classify are common in tasks such  as  information retrieval and \ncollaborative filtering.  We  use  the latter as  our running example.  In  collaborative \nfiltering  the goal is  to predict a  user's rating on new items such as  books or movies \ngiven the user's past ratings of the similar items.  The goal is  to determine whether \na  movie  fan  will  like  a  new  movie  and  to  what  degree,  which  is  expressed  as  a \nrank.  An  example  for  possible  ratings  might  be,  run-to-see,  very-good,  good, \nonly-if-you-must, and  do-not-bother.  While  the  different  ratings  carry mean(cid:173)\ningful  semantics, from  a  learning-theoretic point of view  we  model the ratings as a \ntotally ordered set  (whose size  is  5 in the example above). \n\nThe  interest  in ordering or  ranking of objects  is  by  no  means  new  and is  still  the \nsource of ongoing research in  many fields  such mathematical economics,  social sci(cid:173)\nence, and computer science.  Due to lack of space we clearly cannot cover thoroughly \nprevious  work  related  to  ranking.  For  a  short  overview  from  a  learning-theoretic \npoint  of view  see  [1]  and the references  therein.  One of the main results  of [1]  un(cid:173)\nderscores a  complexity gap between classification learning and ranking learning.  To \nsidestep the inherent intractability problems of ranking learning several approaches \nhave  been  suggested.  One  possible  approach  is  to  cast  a  ranking  problem  as  a \nregression problem.  Another approach is  to reduce  a  total order into a  set of pref-\n\n\fCorrect interval #l \\ I \n\nFigure 1:  An Illustration of the update rule. \n\nerences over pairs [3,  5].  The first  case imposes a  metric on the set of ranking rules \nwhich  might  not be realistic,  while  the second  approach is  time consuming since it \nrequires increasing the sample size  from  n  to O(n2 ). \n\nIn this paper we  consider an alternative approach that directly maintains a  totally \nordered set via projections.  Our starting point is similar to that of Herbrich et. al [5] \nin  the  sense  that  we  project each  instance into the reals.  However,  our  work  then \ndeviates and operates directly on rankings by associating each ranking with distinct \nsub-interval of the reals  and adapting the support of each sub-interval  while learn(cid:173)\ning.  In  the  next  section  we  describe  a  simple  and  efficient  online  algorithm  that \nmanipulates  concurrently  the  direction  onto  which  we  project  the  instances  and \nthe division into sub-intervals.  In Sec.  3  we  prove the  correctness of the algorithm \nand  analyze  its  performance  in  the  mistake  bound  model.  We  describe  in  Sec.  4 \nexperiments that compare the algorithm to online algorithms for  classification and \nregression applied to ranking which demonstrate the merits of our approach. \n\n2  The PRank Algorithm \n\nThis  paper  focuses  on  online  algorithms  for  ranking  instances.  We  are  given  a \nsequence  (Xl, yl), ... , (xt , yt) , ... of instance-rank pairs.  Each instance xt  is  in  IR n \nand  its  corresponding  rank  yt  is  an  element  from  finite  set  y  with  a  total  order \nrelation.  We  assume  without  loss  of generality  that  y  =  {I , 2, ... ,k}  with  \">\" \nas  the  order relation.  The  total  order  over  the  set Y  induces  a  partial order  over \nthe  instances  in  the  following  natural  sense.  We  say  that  xt  is  preferred  over  X S \nif  yt  > yS.  We  also  say  that  xt  and  x S  are not  comparable if neither  yt  > yS  nor \nyt  <  yS.  We denote this case simply as  yt  =  yS.  Note that the induced partial order \nis of a unique form in which the instances form k equivalence classes which are totally \norderedl .  A  ranking  rule  H  is  a  mapping  from  instances  to  ranks,  H  :  IRn  -+  y. \nThe family  of ranking rules  we  discuss  in this  paper employs  a  vector w  E  IRn  and \na  set  of k  thresholds  bl  :::;  ...  :::;  bk- l  :::;  bk  =  00.  For  convenience  we  denote  by \nb  = (bl , . .. ,bk-d the vector of thresholds excluding bk which is fixed to 00.  Given a \nnew instance x  the ranking rule first  computes the inner-product between w  and x . \nThe predicted rank is  then defined  to be the index of the first  (smallest)  threshold \nbr  for  which  w  . x  <  br .  This  type  of ranking  rules  divide  the  space into  parallel \nequally-ranked regions:  all the instances that satisfy br - l  < W\u00b7 x  < br  are assigned \nthe same rank r.  Formally, given  a  ranking rule  defined  by wand b  the predicted \nrank of an instance  x  is,  H(x)  =  minrE{l, ... ,k}{r : w  . x  - br  < O}.  Note  that the \nabove minimum is  always  well  defined since we  set bk  =  00. \n\nThe  analysis  that  we  use  in  this  paper  is  based  on  the  mistake  bound  model  for \nonline learning.  The algorithm we describe works in rounds.  On round t the learning \nalgorithm gets an instance xt.  Given x t , the algorithm outputs a rank, il = minr {r  : \nW\u00b7 x - br  < O}.  It then receives the correct rank yt and updates its ranking rule by \nmodifying wand b.  We  say that our algorithm made a  ranking mistake if il f:.  yt. \n\nIFor a  discussion  of this type of partial orders see  [6] . \n\n\fInitialize:  Set  wI  =  0  ,  b~ ,  ... , bLl  =  0, bl  =  00  . \nLoop:  Fort=1 ,2, ... ,T \n\n\u2022  Get  a  new  rank-value xt  E  IRn. \n\u2022  Predict fl =  minr E{I, ... ,k} {r: w t . xt - b~ < o}. \n\u2022  Get  a  new  label  yt. \n\u2022  If fl t- yt  update w t  (otherwise set w t+!  =  w t  ,  \\;fr  : b~+! =  bn  : \n\n1.  For r  =  1, ... , k  - 1 \n\n2.  For r  = 1, ... , k  - 1 \n\nIf yt  :::;  r  Then y~ =  -1 \nElse y~ =  1. \nIf (wt  . xt - b~)y~ :::;  0 Then T;  = y~ \nElse T;  =  o. \n\n3.  Update w t+!  f- w t +  CLr T;)xt. \n\nb~+1 f- b~ - T; \nOutput:  H(x)  =  minr E{1, ... ,k} {r  : w T +!  . x  - b;+!  < O}. \n\nFor r  =  1, . .. , k  - 1 update: \n\nFigure 2:  The PRank algorithm. \n\nWe  wish to make the predicted rank as close as possible to the true rank.  Formally, \nthe goal of the learning algorithm is  to minimize the ranking-loss which is defined to \nbe  the number of thresholds between the true rank and the predicted rank.  Using \nthe representation of ranks as integers  in {I ... k},  the ranking-loss after T  rounds \nis  equal to the accumulated difference  between  the predicted  and true rank-values, \n'\u00a3'[=1 W - yt I.  The algorithm we  describe updates its ranking rule only on rounds \non which it made ranking mistakes.  Such algorithms are called  conservative. \n\n:::;  b2 \n\n:::;  ...  :::;  bk- 1 \n\nWe  now  describe  the update  rule  of the  algorithm which  is  motivated  by  the  per(cid:173)\nceptron  algorithm for  classification  and hence  we  call  it  the PRank algorithm  (for \nPerceptron Ranking).  For  simplicity,  we  omit  the  index  of the  round  when  refer(cid:173)\nring  to  an  input  instance-rank  pair  (x, y)  and  the  ranking  rule  wand h.  Since \n:::;  bk  then  the  predicted  rank  is  correct  if  w  . x  >  br  for \nb1 \nr  =  1, ... ,y - 1 and w  . x  < br  for  y, . .. , k  - 1.  We  represent  the  above inequali(cid:173)\nties  by expanding the rank y  into into  k  - 1 virtual variables Yl , ... ,Yk-l.  We  set \nYr  =  +1  for  the  case  W\u00b7 x  >  br  and Yr  =  -1 for  w  . x  < br.  Put another way,  a \nrank value y induces the vector (Yl, ... , Yk-d  =  (+1, ... , +1 , -1 , ... , -1) where the \nmaximal index r  for which Yr  =  +1 is  y-1.  Thus, the prediction of a ranking rule is \ncorrect if Yr(w\u00b7 x - br ) > 0 for  all r.  If the algorithm makes a mistake by ranking x \nas fj  instead of Y then there is  at least one threshold, indexed r,  for  which the value \nof W\u00b7 x  is  on the wrong side of br, i.e.  Yr(w\u00b7 x - br ) :::;  O.  To  correct the mistake, we \nneed to \"move\"  the values of W\u00b7 x  and br  toward each other.  We do so by modifying \nonly the values of the br's for which Yr (w . x - br ) :::;  0 and replace them with br - Yr. \nWe  also  replace  the value of w  with  w  +  ('\u00a3 Yr)x  where the sum is  taken over  the \nindices  r  for  which  there was  a  prediction error, i.e.,  Yr (w . x  - br ) :::;  o. \nAn  illustration  of the  update  rule  is  given  in  Fig  1.  In  the  example,  we  used  the \nset Y =  {I ... 5}.  (Note  that b5  =  00  is  omitted from  all  the  plots  in  Fig  1.)  The \ncorrect rank of the instance is  Y =  4,  and thus the value of w  . x  should fall  in the \nfourth  interval,  between  b3  and  b4 .  However,  in the illustration the  value  of w  . x \nfell  below b1  and the predicted rank is fj  =  1.  The threshold values b1 ,  b2  and b3  are \na  source of the error since the value of b1 ,  b2 ,  b3  is  higher then W\u00b7 x.  To  mend the \nmistake the algorithm decreases b1 ,  b2  and b3  by a unit value and replace them with \nb1 -1 ,  b2 -1 and b3 -1. It also modifies w  to be w+3x since '\u00a3r:Yr(w.x- br):SO Yr  =  3. \nThus, the inner-product W\u00b7 x  increases by 311x11 2 .  This update is  illustrated at the \nmiddle  plot  of Fig.  1.  The  updated  prediction  rule  is  sketched  on the  right  hand \n\n\fside of Fig. 1.  Note that after the update, the predicted rank of x  is Y =  3 which is \ncloser to the true rank y  = 4.  The pseudocode of algorithm is  given in  Fig 2. \nTo  conclude this section we  like to note that PRank can be straightforwardly com(cid:173)\nbined with Mercer kernels  [8]  and voting techniques  [4]  often used for  improving the \nperformance of margin classifiers in  batch and online  settings. \n\n3  Analysis \n\nBefore  we  prove  the  mistake  bound  of the  algorithm  we  first  show  that  it  main(cid:173)\ntains a  consistent  hypothesis in the  sense that it  preserves the correct order of the \nthresholds.  Specifically,  we  show  by  induction  that  for  any  ranking  rule  that  can \n,  ...  ,  (wT +1 , b T +1)  we  have \nbe  derived  by  the  algorithm  along  its  run,  (w1 , b 1 ) \nthat  b~ :S  ... :S  bL1  for  all  t.  Since the initialization of the thresholds is  such that \nb~  :S  b~ :S  ... :S  bL1' then it suffices  to show  that the claim holds inductively.  For \nsimplicity,  we  write the updating rule of PRank in an alternative form.  Let  [7f]  be \n1 if the  predicate 7f  holds  and  0  otherwise.  We  now  rewrite  the  value  of T;  (from \nFig.  2)  as T;  =  y~[(wt . xt - bny~ :S  0].  Note that the values  of b~ are integers for \nall r  and t  since for  all  r  we  initialize  b;  = 0,  and b~+l - b~ E  {-1 , 0, +1}. \n\nLemma 1  (Order Preservation)  Let  w t  and  b t  be  the  current  ranking  rule, \n:S  .. .  :S  bL1'  and  let  (xt,yt)  be  an  instance-rank  pair  fed  to  PRank \nwhere  bi \non  round t.  Denote  by wt+1  and bt+1  the  resulting  ranking  rule  after the  update  of \nPRank,  then bi+1  :S  ... :S  bt~ll\u00b7 \n\nProof: \nIn  order  to  show  that  PRank maintains  the  order  of the  thresholds  we \nuse the definition of the algorithm for  y~, namely we  define  y~ =  +1 for  r  < yt  and \ny~ =  -1 for  r  2::  yt.  We  now  prove that b~t~ 2::  b~+l for  all  r  by showing that \nb~+l - b~ 2::  y~+1[(wt . xt - b~+1)Y;+l :S  0]  - y;[(wt . xt - b;)y; :S  0], \n\n(1) \nwhich we obtain by substituting the values of bt+1.  Since b~+1 :S  b~ and b~ ,b~+1 E  Z \nwe  get that the value of b~+1 - b~ on the left  hand side of Eq.  (1)  is  a  non-negative \ninteger.  Recall  that  y~  =  1  if  yt  >  r  and  y~  =  -1  otherwise,  and  therefore, \ny~+l :S  y~.  We  now  analyze two cases.  We  first  consider the  case  y~+1 :j:.  y~ which \nimplies  that y~+l =  -1,  y~ =  +1.  In this  case,  the right hand-side of Eq.  (1)  is  at \nmost  zero,  and  the  claim  trivially  holds.  The other case  is  when  y~+1 =  y~.  Here \nwe  get  that the value of the right hand-side Eq.  (1)  cannot exceed  1.  We  therefore \nhave to consider only the case where  b~ =  b~+1 and y~+1 =  y~.  But given these two \nconditions we  have that y~+1[(wt. xt - b~+1)Y~+1 < 0]  and y~[(wt. xt - b~)y~ < 0] \nare equal.  The right hand side of Eq.  (1)  is  now zero and the inequality holds  with \n\u2022 \nequality. \n\nIn order to simplify  the  analysis of the  algorithm we  introduce the following  nota(cid:173)\ntion.  Given a hyperplane wand a set of k -1 thresholds b  we denote by v  E  ~n+k-1 \nthe vector which is a concatenation of wand b  that is v  =  (w, b).  For brevity we re(cid:173)\nfer to the vector vas a  ranking rule.  Given two vectors v' =  (w', b ' ) and v  =  (w, b) \nwe  have  v' . v  = w' . w + b' . b  and  IIvl12 = IIwl1 2 + IlbW. \nTheorem 2  (Mistake  bound)  Let (xl, y1), ... , (xT , yT)  be  an input sequence for \nPRank  where  xt  E  ~n and  yt  E  {l. .. k}.  Denote  by  R2  =  maxt Ilxtl12.  Assume \nthat there  is  a  ranking  rule v*  =  (w* , b*)  with br  :S  ... :S  bk- 1  of a  unit norm  that \nclassifies  the  entire sequence  correctly with margin \"(  =  minr,t{ (w* . xt - b;)yn > o. \nThen,  the  rank loss  of the  algorithm '\u00a3;=1  Iyt  - yt I, is  at most (k - 1) (R2 + 1) / \"(2. \n\n\fk-1 \n2:= T;  (w* . x t - b;)  ~ nt\"(  . \nr = 1 \n\n(3) \n\n, \n\n. ,  \n\nProof:  Let  us  fix  an  example  (xt, yt)  which  the  algorithm  received  on  round  t. \nBy definition the algorithm ranked the example using the ranking rule v t  which is \ncomposed of w t  and the thresholds  b t .  Similarly,  we  denote  by vt+l  the  updated \nrule  (wt+l  bt+l)  after round t  That is  wt+l  =  w t + (\" Tt)xt  and bt+l  =  bt - Tt \nr \nfor r  =  1, 2, ... , k - 1.  Let us denote by n t  =  W - yt 1 the difference between the true \nthat if there wasn't a ranking mistake on round t then T;  = \u00b0 for r  =  1, ... , k-1, and \nrank and the predicted rank.  It is  straightforward to verify that nt =  2:=r ITn  Note \nthus  also  nt = 0.  To  prove the  theorem  we  bound  2:=t nt  from  above  by  bounding \nIIvtl12  from above and below.  First, we  derive a  lower bound on  IIvtl12  by bounding \nv*  . v H1 .  Substituting the values of w H1  and b H1  we  get, \nv*  . vt+l  =  v*  . v t + 2:= T;  (w*  . xt - b;) \n\nur  r \n\nr \n\nr \n\n(2) \n\nk-1 \n\nr=1 \n\nT;  from the pseudocode in Fig.  2 we need to analyze two cases.  If (wt \u00b7xt - b~)y; :::; \u00b0 \nWe  further  bound the right term by considering two  cases.  Using the definition of \nthen T;  =  y;.  Using the assumption that v*  ranks the data correctly with a  margin \n(wt  . xt  - b;)y;  > \u00b0 we  have  T;  = \u00b0 and  thus  T;(W*  . xt  - b;)  =  0.  Summing \nof  at  least  \"(  we  get  that  T;(W*  . xt  - b;)  ~ \"(.  For  the  other  case  for  which \n\nnow over r  we  get, \n\nCombining  Eq.  (2)  and  Eq.  (3)  we  get  v*  . vt+l  ~ v*  . v t  + nt\"(.  Unfolding  the \nsum,  we  get  that  after  T  rounds  the  algorithm  satisfies,  v*  . v T+1  ~ 2:=t nt\"(  = \n\"( 2:=t nt.  Plugging  this  result  into  Cauchy-Schwartz  inequality,  (1IvT+111 21Iv* 112  ~ \n(vT+l . v*) 2)  and using the assumption that v*  is  of a  unit norm we  get the lower \nbound,  IIvT+ll1 2 ~ (2:=t nt)2 \"(2. \nNext,  we  bound  the  norm  of v  from  above.  As  before,  assume  that  an  example \n(xt, yt)  was  ranked  using the ranking rule  v t  and  denote  by vt+l  the ranking rule \nafter the round.  We now expand the values ofwt+1 and bt+l in the norm ofvH1 and \nget,  IIvH1 112  = IIwtl12  + IIbt l1 2 +  2 2:=r T;  (wt . xt - b;)  +  (2:=r  T;)21IxtI12  +  2:=r  (T;)2. \nSince  T;  E {-1,0,+1} we  have  that  (2:=rT;)2  :::;  (nt)2  and  2:=r(T;)2 =  nt  and  we \ntherefore get, \n\nIIvH1 112  :::;  IIvtl12  +  22:= T;  (wt . xt - b~) +  (nt)21IxtW  +  nt  . \n\n(4) \n\nr \n\nWe further develop the second term using the update rule of the algorithm and get, \n\n2:= T;  (wt . xt - b~)  =  2:=[(wt . xt - b~)y; :::;  0]  ((wt  . xt - b~)y~) :::; \u00b0 .  (5) \n\nr \n\nr \n\n:::;  R2  we  get  that \nPlugging  Eq.  (5)  into  Eq.  (4)  and  using  the  bound  IIxtl12 \nIlvH1112:::;  IIvtl12  + (nt)2R2  + nt.  Thus,  the ranking rule we  obtain after T  rounds \nof the algorithm satisfies the upper bound,  IlvT+l W :::;  R2 2:=t(nt )2  + 2:=t nt.  Com-\nbining the lower bound  IlvT+l W ~ (2:=t nt)2 \"(2  with the upper bound we have that, \n(2:=tnt) 2\"(2:::;  IlvT+1112:::;  R2 2:=t(nt )2  + 2:=t nt .  Dividing both sides  by \"(2  2:=tnt  we \nfinally  get, \n\n2:= nt :::;  R2  [2:=t(nt )2] f [2:=t  ntl + 1  . \n\n(6) \n\nt \n\n\"( \n\nBy  definition,  nt  is  at  most  k - 1,  which  implies  that  2:=t(nt )2  :::;  2:=t nt(k - 1)  = \n(k -1) 2:=t  nt.  Using this inequality in Eq.  (6)  we  get the desired bound,  2:=;=1  Igt -\nytl  =  2:=;=1 nt  :::;  [(k  - 1)R2 + 1lh2 :::;  [(k  - 1)(R2 + 1)lh2 . \n\u2022 \n\n\fi\" \n\nI ... ~ \n\nFigure 3:  Comparison of the time-averaged ranking-loss of PRank, WH,  and MCP \non  synthetic  data  (left).  Comparison of the  time-averaged ranking-loss  of PRank, \nWH,  and MCP on the EachMovie dataset using viewers who rated and at least 200 \nmovies  (middle)  and at least  100 movies  (right). \n\n4  Experiments \n\nIn  this  section  we  describe  experiments we  performed that  compared  PRank with \ntwo other online learning algorithms applied to ranking:  a multiclass generalization \nof the perceptron algorithm  [2],  denoted  MCP,  and the Widrow-Hoff [9]  algorithm \nfor  online regression learning which we denote by WHo  For WH  we fixed its learning \nrate  to  a  constant  value.  The  hypotheses  the  three  algorithms  maintain  share \nsimilarities  but  are  different  in  their  complexity:  PRank maintains  a  vector  w  of \ndimension  n  and  a  vector  of  k  - 1  modifiable  thresholds  b,  totaling  n  + k - 1 \nparameters; MCP  maintains  k  prototypes  which  are vectors  of size  n,  yielding  kn \nparameters; WH  maintains a  single  vector w  of size  n.  Therefore,  MCP  builds the \nmost complex hypothesis of the three while  WH  builds the simplest. \nDue to the lack of space, we only describe two sets of experiments with two different \ndatasets.  The dataset used in the first experiment is synthetic and was generated in \na similar way to the dataset used by Herbrich et. al.  [5].  We first  generated random \npoints  x  =  (Xl, X2)  uniformly  at  random from  the  unit  square  [0,1 f.  Each point \nwas assigned a rank y from the set {I, ... , 5}  according to the following ranking rule, \ny  =  maxr{r : lO((XI  - 0.5)(X2  - 0.5)) + ~ > br }  where  b  =  (-00 , -1, -0.1,0.25,1) \nand  ~  is  a  normally  distributed  noise  of  a  zero  mean  and  a  standard  deviation \nof 0.125.  We  generated  100  sequences  of instance-rank pairs  each  of length  7000. \nWe  fed  the  sequences  to  the  three  algorithms  and  obtained  a  prediction  for  each \ninstance.  We  converted  the  real-valued  predictions  of WH  into ranks  by  rounding \neach  prediction  to  its  closest  rank  value.  As  in  ~5]  we  used  a  non-homogeneous \npolynomial of degree 2, K(XI' X2)  =  ((Xl\u00b7 X2)  + 1)  as the inner-product operation \nbetween  each  input  instance  and  the  hyperplanes  the  three  algorithms  maintain. \nAt  each  time  step,  we  computed  for  each  algorithm  the  accumulated  ranking-loss \nnormalized by  the instantaneous sequence length.  Formally, the time-averaged loss \nafter T  rounds is,  (liT) 'L,i Iyt _ytl.  We computed these losses for T  =  1, ... ,7000. \nTo  increase  the  statistical  significance  of the  results  we  repeated  the  process  100 \ntimes, picking a new random instance-rank sequence of length 7,000 each time, and \naveraging  the  instantaneous  losses  across  the  100  runs.  The  results  are  depicted \non the left  hand side  of Fig.  3.  The 95%  confidence  intervals  are smaller  then the \nsymbols used in the plot.  In  this experiment the performance of MPC is  constantly \nworse  than the performance of WH  and  PRank.  WH  initially  suffers  the  smallest \ninstantaneous loss but after about 500 rounds PRank achieves the best performance \nand eventually  the number  of ranking mistakes  that PRank suffers  is  significantly \nlower than both WH  and MPC. \n\n\fIn  the second  set  of experiments we  used the EachMovie dataset  [7].  This  dataset \nis  used  for  collaborative  filtering  tasks  and  contains  ratings  of  movies  provided \nby  61 , 265  people.  Each  person  in  the  dataset  viewed  a  subset  of movies  from  a \ncollection  of 1623  titles.  Each viewer  rated each  movie  that  she  saw  using  one  of \n6  possible  ratings:  0, 0.2, 0.4, 0.6, 0.8,1.  We  chose  subsets  of people  who  viewed  a \nsignificant  amount  of  movies  extracting  for  evaluation  people  who  have  rated  at \nleast  100 movies.  There were  7,542 such  viewers.  We  chose  at random one person \namong these viewers and set the person's ratings to be the target rank.  We used the \nratings of all  the rest  of the  people  who  viewed  enough  movies  as  features.  Thus, \nthe  goal  is  to  learn  to  predict  the  \"taste\"  of a  random  user  using  the  user's  past \nratings as  a  feedback  and the  ratings of fellow  viewers  as  features.  The prediction \nrule associates a weight with each fellow viewer an therefore can be seen as learning \ncorrelations  between  the tastes of different  viewers.  Next,  we  subtracted 0.5  from \neach rating and therefore the possible ratings are -0.5 , -0.3, -0.1 , 0.1, 0.3, 0.5.  This \nlinear transformation enabled us to assign a  value of zero to movies which have not \nbeen  rated.  We  fed  these  feature-rank  pairs  one  at  a  time,  in  an  online  fashion . \nSince  we  picked viewer  who  rated at least  100  movies,  we  were  able to perform at \nleast  100  rounds  of online  predictions  and  updates.  We  repeated  this  experiment \n500 times,  choosing each time a  random viewer for  the target rank.  The results are \nshown  on  the  right  hand-side  of  Fig.  3.  The  error  bars  in  the  plot  indicate  95% \ncondfidence  levels.  We  repeated  the  experiment  using  viewers  who  have  seen  at \nleast  200  movies.  (There  were  1802  such  viewers.)  The  results  of this  experiment \nare  shown  in  the  middle  plot  of  Fig.  3.  Along  the  entire  run  of  the  algorithms, \nPRank is significantly better than WH, and consistently better than the multiclass \nperceptron algorithm, although the latter employs  a  bigger hypothesis. \n\nFinally, we  have also evaluated the performance of PRank in a  batch setting, using \nthe experimental setup of [5].  In  this experiment, we  ran PRank over the training \ndata as  an online  algorithm and used its last hypothesis  to rank unseen test  data. \nHere as well PRank came out first, outperforming all the algorithms described in [5]. \n\nAcknowledgments  Thanks to Sanjoy  Dagupta and Rob  Schapire for  numerous \ndiscussions on ranking problems and algorithms.  Thanks also to Eleazar Eskin and \nUri  Maoz for  carefully reading the manuscript. \n\nReferences \n\n[1]  William W.  Cohen,  Robert  E.  Schapire, and Yoram Singer.  Learning to order things. \n\nJournal  of Artificial Int elligence  Research, 10:243- 270,  1999. \n\n[2]  K.  Crammer and Y.  Singer.  Ultraconservative  online  algorithms  for  multiclass  prob(cid:173)\nlems.  Proc.  of the  Fourteenth  Annual ConI  on  Computational Learning  Theory, 200l. \n[3]  Y. Freund, R.  Iyer,  R.  E.  Schapire,  and Y.  Singer.  An efficient  boosting algorithm for \n\ncombining preferences.  Machine  Learning:  Proc.  of the  Fifteenth  Inti.  ConI,  1998. \n\n[4]  Y.  Freund and R.  E.  Schapire.  Large  margin classification  using  the perceptron algo(cid:173)\n\nrithm.  Machine  Learning, 37(3):  277-296,  1999. \n\n[5]  R.  Herbrich, T.  Graepel,  and K. Obermayer.  Large margin rank boundaries for  ordinal \n\nregression.  Advances  in  Large  Margin  Classifiers.  MIT  Press,  2000. \n\n[6]  J. Kemeny and J . Snell.  Mathematical Models  in the  Social Sciences.  MIT Press, 1962. \n[7]  Paul  McJones.  EachMovie  collaborative  filtering  data set.  DEC  Systems  Research \n\nCenter,  1997.  http://www.research.digital.com/SRC/eachmoviej. \n\n[8]  Vladimir N.  Vapnik.  Statistical Learning  Theory.  Wiley, 1998. \n[9]  Bernard  Widrow  and  Marcian  E.  Hoff.  Adaptive  switching  circuits. \n\n1960  IRE \nWESCON Convention  Record, 1960.  Reprinted in Neurocomputing (MIT Press, 1988). \n\n\f", "award": [], "sourceid": 2023, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}