{"title": "Learning to Order Things", "book": "Advances in Neural Information Processing Systems", "page_first": 451, "page_last": 457, "abstract": "", "full_text": "Learning to Order Things \n\nWilliam W. Cohen Robert E. Schapire Yoram Singer \n\nAT&T Labs, 180 Park Ave., Florham Park, NJ 07932 \n\n{ wcohen,schapire,singer} @research.att.com \n\nAbstract \n\nThere are many applications in which it is desirable to order rather than classify \ninstances. Here we consider the problem of learning how to order, given feedback \nin the form of preference judgments, i.e., statements to the effect that one instance \nshould be ranked ahead of another. We outline a two-stage approach in which one \nfirst learns by conventional means a preference Junction, of the form PREF( u, v), \nwhich indicates whether it is advisable to rank u before v. New instances are \nthen ordered so as to maximize agreements with the learned preference func(cid:173)\ntion. We show that the problem of finding the ordering that agrees best with \na preference function is NP-complete, even under very restrictive assumptions. \nNevertheless, we describe a simple greedy algorithm that is guaranteed to find a \ngood approximation. We then discuss an on-line learning algorithm, based on the \n\"Hedge\" algorithm, for finding a good linear combination of ranking \"experts.\" \nWe use the ordering algorithm combined with the on-line learning algorithm to \nfind a combination of \"search experts,\" each of which is a domain-specific query \nexpansion strategy for a WWW search engine, and present experimental results \nthat demonstrate the merits of our approach. \n\n1 Introduction \n\nMost previous work in inductive learning has concentrated on learning to classify. However, \nthere are many applications in which it is desirable to order rather than classify instances. \nAn example might be a personalized email filter that gives a priority ordering to unread \nmail. Here we will consider the problem of learning how to construct such orderings, given \nfeedback in the form of preference judgments, i.e., statements that one instance should be \nranked ahead of another. \n\nSuch orderings could be constructed based on a learned classifier or regression model, \nand in fact often are. For instance, it is common practice in information retrieval to rank \ndocuments according to their estimated probability of relevance to a query based on a \nlearned classifier for the concept \"relevant document.\" An advantage of learning orderings \ndirectly is that preference judgments can be much easier to obtain than the labels required \nfor classification learning. \n\nFor instance, in the email application mentioned above, one approach might be to rank \nmessages according to their estimated probability of membership in the class of \"urgent\" \nmessages, or by some numerical estimate of urgency obtained by regression. Suppose, \nhowever, that a user is presented with an ordered list of email messages, and elects to read \nthe third message first. Given this election, it is not necessarily the case that message three \nis urgent, nor is there sufficient information to estimate any numerical urgency measures; \nhowever, it seems quite reasonable to infer that message three should have been ranked \nahead of the others. Thus, in this setting, obtaining preference information may be easier \nand more natural than obtaining the information needed for classification or regression. \n\n\f452 \n\nw. W. Cohen, R. E. Schapire and Y. Singer \n\nIn the remainder of this paper, we will investigate the following two-stage approach to \nlearning how to order. \nIn stage one, we learn a preference junction, a two-argument \nfunction PREF( u, v) which returns a numerical measure of how certain it is that u should \nbe ranked before v. In stage two, we use the learned preference function to order a set of \nnew instances U; to accomplish this, we evaluate the learned function PREF( u, v) on all \npairs of instances u, v E U, and choose an ordering of U that agrees, as much as possible, \nwith these pairwise preference judgments. This general approach is novel; for related work \nin various fields see, for instance, references [2, 3, 1, 7, 10]. \n\nAs we will see, given an appropriate feature set, learning a preference function can be \nreduced to a fairly conventional classification learning problem. On the other hand, finding \na total order that agrees best with a preference function is NP-complete. Nevertheless, we \nshow that there is an efficient greedy algorithm that always finds a good approximation to \nthe best ordering. After presenting these results on the complexity of ordering instances \nusing a preference function, we then describe a specific algorithm for learning a preference \nfunction. The algorithm is an on-line weight allocation algorithm, much like the weighted \nmajority algorithm [9] and Winnow [8], and, more directly, Freund and Schapire's [4] \n\"Hedge\" algorithm. We then present some experimental results in which this algorithm is \nused to combine the results of several \"search experts,\" each of which is a domain-specific \nquery expansion strategy for a WWW search engine. \n\n2 Preliminaries \n\nLet X be a set of instances (possibly infinite). A preference junction PREF is a binary \nfunction PREF : X x X ~ [0,1]. A value of PREF(u, v) which is close to 1 or a is \ninterpreted as a strong recommendation that u should be ranked before v. A value close to \n1/2 is interpreted as an abstention from making a recommendation. As noted above, the \nhypothesis of our learning system will be a preference function, and new instances will be \nranked so as to agree as much as possible with the preferences predicted by this hypothesis. \n\nIn standard classification learning, a hypothesis is constructed by combining primitive \nfeatures. Similarly, in this paper, a preference function will be a combination of other \npreference functions. In particular, we will typically assume the availability of a set of N \nprimitive preference functions RI , ... , RN. These can then be combined in the usual ways, \ne.g., with a boolean or linear combination of their values; we will be especially interested \nin the latter combination method. \n\nIt is convenient to assume that the Ri'S are well-formed in certain ways. To this end, we \nintroduce a special kind of preference function called a rank ordering. Let S be a totally \nordered set l with' >' as the comparison operator. An ordering function into S is a function \nf : X ~ S. The function f induces the preference function Rj, defined as \n\nI \nRj(u,v) ~ 0\n\n{\n\n21 \n\nif f (u) > f ( v) \nif f(u) < f(v) \notherwise. \n\nWe call Rf a rank ordering for X into S. If Rf(u, v) = I, then we say that u is preferred \nto v, or u is ranked higher than v. \n\nIt is sometimes convenient to allow an ordering function to \"abstain\" and not give a \npreference for a pair u, v. Let \u00a2> be a special symbol not in S, and let f be a function into \nS U {\u00a2>}. We will interpret the mapping f (u) = \u00a2> to mean that u is \"unranked,\" and let \nRf (u, v) = ! if either u or v is unranked. \nTo give concrete examples of rank ordering, imagine learning to order documents based on \nthe words that they contain. To model this, let X be the set of all documents in a repository, \n\n)That is, for all pairs of distinct elements 8J, 82 E S, either 8) < 82 or 8) > 82 . \n\n\fLearning to Order Things \n\n453 \n\nand for N words WI, ... , W N. let Ii (u) be the number of occurrences of Wi in u. Then \nRf; will prefer u to v whenever Wi occurs more often in u than v. As a second example. \nconsider a meta-search application in which the goal is to combine the rankings of several \nWWW search engines. For N search engines el, ... , eN. one might define h so that R'i \nprefers u to v whenever u is ranked ahead of v in the list Li produced by the corresponding \nsearch engine. To do this, one could let Ii(u) = -k for the document u appearing in the \nk-th position in the list L i \u2022 and let Ii( u) = for any document not appearing in L i . \n\n3 Ordering instances with a preference function \n\nWe now consider the complexity of finding the total order that agrees best with a learned \npreference function. To analyze this. we must first quantify the notion of agreement between \na preference function PREF and an ordering. One natural notion is the following: Let X \nbe a set. PREF be a preference function. and let p be a total ordering of X. expressed \nagain as an ordering function (i.e .\u2022 p( u) > p( v) iff u precedes v in the order). We define \nAGREE(p, PREF) to be the sum of PREF( u, v) over all pairs u, v such that u is ranked \nahead of v by p: \n\nAGREE(p, PREF) = \n\nPREF(u, v). \n\n(1) \n\nu.v:p{ul>p{v) \n\nIdeally. one would like to find a p that maximizes AGREE(p, PREF). This general opti(cid:173)\nmization problem is of little interest since in practice, there are many constraints imposed \nby learning: for instance PREF must be in some restricted class of functions. and will \ngenerally be a combination of relatively well-behaved preference functions R i . A more \ninteresting question is whether the problem remains hard under such constraints. \n\nThe theorem below gives such a result. showing that the problem is NP-complete even if \nPREF is restricted to be a linear combination of rank orderings. This holds even if all the \nrank orderings map into a set S with only three elements. one of which mayor may not be \n. (Clearly. if S consists of more than three elements then the problem is still hard.) \n\nTheorem 1 The following decision problem is NP-complete: \nInput: A rational number 1\\,; a set X; a set S with lSI ~ 3; a collection of \nN ordering functions Ii \n: X -t S; and a preference function PREF defined as \nPREF(u, v) = L~I wiR'i (u, v) where w = (WI, ... ,WN) is a weight vector in [0, l]N \nwith L~I Wi = 1. \nQuestion: Does there exist a total order p such that AGREE(p, PREF) ~ I\\,? \n\nThe proof (omitted) is by reduction from CYCLIC-ORDERING [5. 6]. \n\nAlthough this problem is hard when lSI ~ 3. it becomes tractable for linear combinations \nof rank orderings into a set S of size two. In brief. suppose one is given X, Sand PREF as \nin Theorem 1, save that S is a two-element set. which we assume without loss of generality \nto be S = {O, I}. Now define p(u) = Li Wdi(U). It can be shown that the total order \ndefined by p maximizes AGREE(p, PREF). (In case of a tie, p( u) = p( v) for distinct u \nand v. p defines only a partial order. The claim still holds in this case for any total order \nwhich is consistent with this partial order.) Of course, when lSI = 2, the rank orderings \nare really only binary classifiers. The fact that this special case is tractable underscores the \nfact that manipulating orderings can be computationally more difficult than performing the \ncorresponding operations on binary classifiers. \n\nTheorem 1 implies that we are unlikely to find an efficient algorithm that finds the optimal \ntotal order for a weighted combination of rank orderings. Fortunately. there do exist efficient \nalgorithms for finding an approximately optimal total order. Figure 1 summarizes a greedy \n\n\f454 \n\nw. W. Cohen, R. E. Schapire and Y. Singer \n\nAlgorithm Order-By-Preferences \nInputs: an instance set X; a preference function PREF \nOutput: an approximately optimal ordering function p \nlet V = X \nfor each v E V do7l'(v) = LUEVPREF(v,u) - LUEVPREF(u,v) \nwhile V is non-empty do \n\nlet t = argmaxuEv 71'(u) \nlet pet) = IVI \nV=V-{t} \nfor each v E V do 71'(v) = 71'(v) + PREF(t, v) - PREF(v, t) \n\nendwhile \n\nFigure 1: A greedy ordering algorithm \n\nalgorithm that produces a good approximation to the best total order, as we will shortly \nThe algorithm is easiest to describe by thinking of PREF as a directed \ndemonstrate. \nweighted graph where, initially, the set of vertices V is equal to the set of instances X, \nand each edge u -t v has weight PREF( u, v). We assign to each vertex v E V a potential \nvalue 71'( v), which is the weighted sum of the outgoing edges minus the weighted sum of \nthe ingoing edges. That is, 71'(v) = LUEV PREF(v,u) - LUEV PREF(u, v) . The greedy \nalgorithm then picks some node t that has maximum potential, and assigns it a rank by \nsetting pet) = lVI, effectively ordering it ahead of all the remaining nodes. This node, \ntogether with all incident edges, is then deleted from the graph, and the potential values \n71' of the remaining vertices are updated appropriately: This process is repeated until the \ngraph is empty; notice that nodes removed in subsequent iterations will have progressively \nsmaller and smaller ranks. \n\nThe next theorem shows that this greedy algorithm comes within a factor of two of optimal. \nFurthermore, it is relatively simple to show that the approximation factor of 2 is tight. \n\nTheorem 2 Let OPT(PREF) be the weighted agreement achieved by an optimal total \norderfor the preference junction PREF and let APPROX(PREF) be the weighted agreement \nachieved by the greedy algorithm. Then APPROX(PREF) ;::: !OPT(PREF). \n\n4 Learning a good weight vector \n\nIn this section, we look at the problem of learning a good linear combination of a set of \npreference functions. Specifically, we assume access to a set of ranking experts which \nprovide us with preference functions Ri of a set of instances. The problem, then, is to learn \na preference function of the form PREF(u,v) = L~I wiRi(U,V). We adopt the on-line \nlearning framework first studied by Littlestone [8J in which the weight Wi assigned to each \nranking expert Ri is updated incrementally. \n\nLearning is assumed to take place in a sequence of rounds. On the t-th round, the learning \nalgorithm is provided with a set X t of instances to be ranked and to a set of N preference \nfunctions R~ of these instances. The learner may compute R!( u, v) for any and all preference \nfunctions R~ and pairs u, v E X t before producing a final ordering Pt of xt. Finally, the \nlearner receives feedback from the environment. We assume that the feedback is an arbitrary \nset of assertions of the form \"u should be preferred to v.\" That is, formally we regard the \nfeedback on the t-th round as a set Ft of pairs (u, v) indicating such preferences. \n\nThe algorithm we propose for this problem is based on the \"weighted majority algorithm\" [9J \nand, more directly, on the \"Hedge\" algorithm [4]. We define the loss of a preference function \n\n\fLearning to Order Things \n\n455 \n\nAllocate Weights for Ranking Experts \nParameters: \n(3 E [0,1] , initial weight vector WI E [0, I]N with l:~1 wl = 1 \nN ranking experts, number of rounds T \nDo fort = 1,2, ... ,T \n\n1. Receive a set of elements X t and preference functions R~, ... , R'N. \n2. Use algorithm Order-By-Preferences to compute ordering function Pt which ap-\n\nproximatesPREFt(u,v) = E~I wiRHu,v). \n\n3. Order X t using Pt . \n4. Receive feedback Ft from the user. \n5. Evaluate losses Loss(RL Ft) as defined in Eq. (2). \n6. Set the new weight vector w!+ 1 = w! . (3Loss(R: ,Ft) / Zt where Zt is a normalization \n\nconstant, chosen so that E~I w!+1 = 1. \n\nFigure 2: The on-line weight allocation algorithm. \n\nR with respect to the user's feedback F as \n\nL \n\n(R F) ~ E(U,V)EF(1 - R(u,v)) \n\nIF I ' \n\noss \n\n, \n\n(2) \n\nThis loss has a natural probabilistic interpretation. If R is viewed as a randomized prediction \nalgorithm that predicts that u will precede v with probability R(u, v), then Loss(R, F) is \nthe probability of R disagreeing with the feedback on a pair (u, v) chosen uniformly at \nrandom from F. \n\nWe now can use the Hedge algorithm almost verbatim, as shown in Figure 2. The algorithm \nmaintains a positive weight vector whose value at time t is denoted by w t = (wf, . . . , w'N). \nIf there is no prior knowledge about the ranking experts, we set all initial weights to be \nequal so that wI = 1/ N. The weight vector w t is used to combine the preference functions \nof the different experts to obtain the preference function PREFt = E~ I w~ R~. This, in \ntum, is converted into an ordering Pt on the current set of elements Xl using the method \ndescribed in Section 3. After receiving feedback pt, the loss for each preference function \nLoss(RL Ft) is evaluated as in Eq. (2) and the weight vector w t is updated using the \nmUltiplicative rule W!+I = w~ . (3LQss(R: ,Ft) / Zt where (3 E [0, 1] is a parameter, and Zt is \na normalization constant, chosen so that the weights sum to one after the update. Thus, based \non the feedback, the weights of the ranking experts are adjusted so that experts producing \npreference functions with relatively large agreement with the feedback are promoted. \n\nWe will briefly sketch the theoretical rationale behind this algorithm. Freund and \nSchapire [4] prove general results about Hedge which can be applied directly to this loss \nfunction. Their results imply almost immediately a bound on the cumulative loss of the \npreference function PREFt in terms of the loss of the best ranking expert, specifically \n\nT \nLLoss(PREFt,Ft ) ~ a,Bm~n LLoss(RLFt) +c,BlnN \nt=1 \n\nt=1 \n\nT \n\nl \n\nwhere a,B = InO / (3) / (1 - (3) and C,B = 1/( I - (3). Thus, if one of the ranking experts has \nlow loss, then so will the combined preference function PREFt . \n\nHowever, we are not interested in the loss ofPREFt (since it is not an ordering), but rather in \nthe performance of the actual ordering Pt computed by the learning algorithm. Fortunately, \n\n\f456 \n\nw. W. Cohen, R. E. Schapire and y. Singer \n\nthe losses of these can be related using a kind of triangle inequality. It can be shown that, \nfor any PREF, F and p: \n\nLoss(Rp, F) ~ \n\nOISAGREE(p PREF) \n\nIFI ' \n\n+ Loss(PREF, F) \n\n(3) \n\nwhere, similar to Eq. (1), OISAGREE(p, PREF) = Lu,v:p(u\u00bbp(v)(l - PREF(u, v)). Not \nsurprisingly, maximizing AGREE is equivalent to minimizing DISAGREE. \n\nSo, in sum, we use the greedy algorithm of Section 3 to minimize (approximately) the first \nterm on the right hand side ofEq. (3), and we use the learning algorithm Hedge to minimize \nthe second term. \n\n5 Experimental results for metasearch \n\nWe now present some experiments in learning to combine the results of several WWW \nsearches. We note that this problem exhibits many facets that require a general approach \nsuch as ours. For instance, approaches that learn to combine similarity scores are not \napplicable since the similarity scores of WWW search engines are often unavailable. \n\nWe chose to simulate the problem of learning a domain-specific search engine. As test \ncases we picked two fairly narrow classes of queries-retrieving the home pages of ma(cid:173)\nchine learning researchers (ML), and retrieving the home pages of universities (UNIV). \nWe obtained a listing of machine learning researchers, identified by name and affiliated \ninstitution, together with their home pages, and a similar list for universities, identified by \nname and (sometimes) geographical location. Each entry on a list was viewed as a query, \nwith the associated URL the sole relevant document. \n\nWe then constructed a series of special-purpose \"search experts\" for each domain. These \nwere implemented as query expansion methods which converted a name, affiliation pair \n(or a name, location pair) to a likely-seeming Altavista query. For example, one expert \nfor the ML domain was to search for all the words in the person's name plus the words \n\"machine\" and \"learning,\" and to further enforce a strict requirement that the person's last \nname appear. Overall we defined 16 search experts for the ML domain and 22 for the UN IV \ndomain. Each search expert returned the top 30 ranked documents. In the ML domain there \nwere 210 searches for which at least one search expert returned the named home page; for \nthe UNIV domain, there were 290 such searches. \nFor each query t, we first constructed the set X t consisting of all documents returned by all \nof the expanded queries defined by the search experts. Next, each search expert i computed \na preference function R~. We chose these to be rank orderings defined with respect to an \nordering function If in the natural way: We assigned a rank of if = 30 to the first listed \ndocument, Ii = 29 to the second-listed document, and so on, finally assigning a rank of \nIi = 0 to every document not retrieved by the expanded query associated with expert i. \nTo encode feedback, we considered two schemes. \nIn the first we simulated complete \nrelevance feedback-that is, for each query, we constructed feedback in which the sole \nrelevant document was preferred to all other documents. In the second, we simulated the \nsort of feedback that could be collected from \"click data,\" i.e., from observing a user's \ninteractions with a metasearch system. For each query, after presenting a ranked list of \ndocuments, we noted the rank of the one relevant document. We then constructed a feedback \nranking in which the relevant document is preferred to all preceding documents. This would \ncorrespond to observing which link the user actually followed, and making the assumption \nthat this link was preferred to previous links. \n\nTo evaluate the expected performance of a fully-trained system on novel queries in this \ndomain, we employed leave-one-out testing. For each query q, we removed q from the \n\n\fLearning to Order Things \n\n457 \n\nML Domain \n\nUniversity Domain \n\nTop 1 Top lO Top 30 Av. rank Top 1 Top lO Top 30 Av. rank \n\nLearned System (Full Feedback) \nLearned System (\"Click Data\") \nNaive \nBest (Top 1) \nBest (Top 10) \nBest (Top 30) \nBest (Av. Rank) \n\n114 \n93 \n89 \n119 \n114 \n97 \n114 \n\n185 \n185 \n165 \n170 \n182 \n181 \n182 \n\n198 \n198 \n176 \n184 \n190 \n194 \n190 \n\n4.9 \n4.9 \n7.7 \n6.7 \n5.3 \n5.6 \n5.3 \n\n111 \n87 \n79 \n112 \n111 \n111 \n111 \n\n225 \n229 \n157 \n221 \n223 \n223 \n223 \n\n253 \n259 \n191 \n247 \n249 \n249 \n249 \n\n7.8 \n7.8 \n14.4 \n8.2 \n8.0 \n8.0 \n8.0 \n\nTable 1: Comparison of learned systems and individual search queries \n\nquery set, and recorded the rank of q after training (with (3 = 0.5) on the remaining \nqueries. For click data feedback, we recorded the median rank over 100 randomly chosen \npermutations of the training queries. \n\nWe the computed an approximation to average rank by artificially assigning a rank of 31 \nto every document that was either unranked, or ranked above rank 30. (The latter case is \nto be fair to the learned system, which is the only one for which a rank greater than 30 is \npossible.) A summary of these results is given in Table 1, together with some additional \ndata on \"top-k performance\"-the number of times the correct homepage appears at rank \nno higher than k. In the table we give the top-k performance (for three values of k) and \naverage rank for several ranking systems: the two learned systems, the naive query (the \nperson or university's name), and the single search expert that performed best with respect \nto each performance measure. The table illustrates the robustness of the learned systems, \nwhich are nearly always competitive with the best expert for every performance measure \nlisted; the only exception is that the system trained on click data trails the best expert in \ntop-k performance for small values of k. It is also worth noting that in both domains, the \nnaive query (simply the person or university's name) is not very effective. Even with the \nweaker click data feedback, the learned system achieves a 36% decrease in average rank \nover the naive query in the ML domain, and a 46% decrease in the UNIV domain. \n\nTo summarize the experiments, on these domains, the learned system not only performs \nmuch better than naive search strategies; it also consistently performs at least as well as, \nand perhaps slightly better than, any single domain-specific search expert. Furthermore, the \nperformance of the learned system is almost as good with the weaker \"click data\" training \nas with complete relevance feedback. \nReferences \n[1] D.S. Hochbaum (Ed.). Approximation Algorithms for NP-hard problems. PWS Publishing \n\nCompany, 1997. \n\n[2] O. Etzioni, S. Hanks, T. Jiang, R M. Karp, O. Madani, and O. Waarts. Efficient information \n\ngathering on the internet. In 37th Ann. Symp. on Foundations of Computer Science, 1996. \n\n[3] P.C Fishburn. The Theory of Social Choice. Princeton University Press, Princeton, NJ, 1973. \n[4] Y. Freund and RE. Schapire. A decision-theoretic generalization of on-line learning and an \n\napplication to boosting. Journal of Computer and System Sciences, 1997. \n\n[5] Z. Galil and N. Megido. Cyclic ordering is NP-complete. Theor. Compo Sci. , 5:179-182, 1977. \n[6] M.R Gary and D.S. Johnson. Computers and Intractibility: A Guide to the Theory of NP(cid:173)\n\ncompleteness. W. H. Freeman and Company, New York, 1979. \n\n[7j P.B. Kantor. Decision level data fusion for routing of documents in the TREC3 context: a best \n\ncase analysis of worste case results. In TREC-3, 1994. \n\n[8] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold \n\nalgorithm. Machine Learning, 2(4), 1988. \n\n[9) N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Infonnation and Compu(cid:173)\n\ntation, 108(2):212-261, 1994. \n\n[10] K.E. Lochbaum and L.A. Streeter. Comparing and combining the effectiveness of latent semantic \nindexing and the ordinary vector space model for information retrieval. Infonnation processing \nand management, 25(6):665-676, 1989. \n\n\f", "award": [], "sourceid": 1431, "authors": [{"given_name": "William", "family_name": "Cohen", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}