{"title": "Iterative ranking from pair-wise comparisons", "book": "Advances in Neural Information Processing Systems", "page_first": 2474, "page_last": 2482, "abstract": "The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR\u2019s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding \u2018scores\u2019 for each object (e.g. player\u2019s rating) is of interest to understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efficacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. This, in essence, leads to order-optimal dependence on the number of samples required to learn the scores well by our algorithm. Indeed, the experimental evaluation shows that our (model independent) algorithm performs as well as the Maximum Likelihood Estimator of the BTL model and outperforms a recently proposed algorithm by Ammar and Shah [1].", "full_text": "Iterative Ranking from Pair-wise Comparisons\n\nSahand Negahban\nDepartment of EECS\n\nSewoong Oh\n\nDepartment of IESE\n\nMassachusetts Institute of Technology\n\nUniversity of Illinois at Urbana Champaign\n\nsahandn@mit.edu\n\nswoh@illinois.edu\n\nDevavrat Shah\n\nDepartment of EECS\n\nMassachusetts Institute of Technology\n\ndevavrat@mit.edu\n\nAbstract\n\nThe question of aggregating pairwise comparisons to obtain a global ranking over\na collection of objects has been of interest for a very long time: be it ranking\nof online gamers (e.g. MSR\u2019s TrueSkill system) and chess players, aggregating\nsocial opinions, or deciding which product to sell based on transactions. In most\nsettings, in addition to obtaining ranking, \ufb01nding \u2018scores\u2019 for each object (e.g.\nplayer\u2019s rating) is of interest to understanding the intensity of the preferences.\nIn this paper, we propose a novel iterative rank aggregation algorithm for discov-\nering scores for objects from pairwise comparisons. The algorithm has a natural\nrandom walk interpretation over the graph of objects with edges present between\ntwo objects if they are compared; the scores turn out to be the stationary prob-\nability of this random walk. The algorithm is model independent. To establish\nthe ef\ufb01cacy of our method, however, we consider the popular Bradley-Terry-Luce\n(BTL) model in which each object has an associated score which determines the\nprobabilistic outcomes of pairwise comparisons between objects. We bound the\n\ufb01nite sample error rates between the scores assumed by the BTL model and those\nestimated by our algorithm. This, in essence, leads to order-optimal dependence\non the number of samples required to learn the scores well by our algorithm. In-\ndeed, the experimental evaluation shows that our (model independent) algorithm\nperforms as well as the Maximum Likelihood Estimator of the BTL model and\noutperforms a recently proposed algorithm by Ammar and Shah [1].\n\n1\n\nIntroduction\n\nRank aggregation is an important task in a wide range of learning and social contexts arising in\nrecommendation systems, information retrieval, and sports and competitions. Given n items, we\nwish to infer relevancy scores or an ordering on the items based on partial orderings provided through\nmany (possibly contradictory) samples. Frequently, the available data that is presented to us is in\nthe form of a comparison: player A defeats player B; book A is purchased when books A and\nB are displayed (a bigger collection of books implies multiple pairwise comparisons); movie A is\nliked more compared to movie B. From such partial preferences in the form of comparisons, we\nfrequently wish to deduce not only the order of the underlying objects, but also the scores associated\nwith the objects themselves so as to deduce the intensity of the resulting preference order. For\nexample, the Microsoft TrueSkill engine assigns scores to online gamers based on the outcomes of\n(pairwise) games between players. Indeed, it assumes that each player has inherent \u201cskill\u201d and the\n\n1\n\n\foutcomes of the games are used to learn these skill parameters which in turn lead to scores associate\nwith each player. In most such settings, similar model-based approaches are employed.\nIn this paper, we have set out with the following goal: develop an algorithm for the above stated\nproblem which (a) is computationally simple, (b) works with available (comparison) data only and\ndoes not try to \ufb01t any model per se, (c) makes sense in general, and (d) if the data indeed obeys\na reasonable model, then the algorithm should do as well as the best model aware algorithm. The\nmain result of this paper is an af\ufb01rmative answer to all these questions.\n\nRelated work. Most rating based systems rely on users to provide explicit numeric scores for their\ninterests. While these assumptions have led to a \ufb02urry of theoretical research for item recommen-\ndations based on matrix completion [2, 3, 4], it is widely believed that numeric scores provided\nby individual users are generally inconsistent. Furthermore, in a number of learning contexts as\nillustrated above, it is simply impractical to ask a user to provide explicit scores.\nThese observations have led to the need to develop methods that can aggregate such forms of or-\ndering information into relevance ratings. In general, however, designing consistent aggregation\nmethods can be challenging due in part to possible contradictions between individual preferences.\nFor example, if we consider items A, B, and C, one user might prefer A to B, while another prefers\nB to C, and a third user prefers C to A. Such problems have been well studied as in the work by\nCondorcet [5]. In the celebrated work by Arrow [6], existence of a rank aggregation algorithm with\nreasonable sets of properties (or axioms) was shown to be impossible.\nIn this paper, we are interested in a more restrictive setting: we have outcomes of pairwise compar-\nisons between pairs of items, rather than a complete ordering as considered in [6]. Based on those\npairwise comparisons, we want to obtain a ranking of items along with a score for each item indicat-\ning the intensity of the preference. One reasonable way to think about our setting is to imagine that\nthere is a distribution over orderings or rankings or permutations of items and every time a pair of\nitems is compared, the outcome is generated as per this underlying distribution. With this, our ques-\ntion becomes even harder than the setting considered by Arrow [6] as, in that work, effectively the\nentire distribution over permutations was already known! Indeed, such hurdles have not stopped the\nscienti\ufb01c community as well as practical designers from designing such systems. Chess rating sys-\ntems and the more recent MSR TrueSkill system are prime examples. Our work falls precisely into\nthis realm: design algorithms that work well in practice, makes sense in general, and perhaps more\nimportantly, have attractive theoretical properties under common comparative judgment models.\nWith this philosophy in mind, in recent work, Ammar and Shah [1] have presented an algorithm that\ntries to achieve the goal with which we have set out. However, their algorithm requires information\nabout comparisons between all pairs, and for each pair it requires the exact pairwise comparison\n\u2018marginal\u2019 with respect to the underlying distribution over permutations. Indeed, in reality, not all\npairs of items can typically be compared, and the number of times each pair is compared is also very\nsmall. Therefore, while an important step is taken in [1], it stops short of achieving the desired goal.\nIn somewhat related work by Braverman and Mossel [7], the authors present an algorithm that\nproduces an ordering based on O(n log n) pair-wise comparisons on adaptively selected pairs. They\nassume that there is an underlying true ranking and one observes noisy comparison results. Each\ntime a pair is queried, we are given the true ordering of the pair with probability 1/2 + \u03b3 for some\n\u03b3 > 0 which does not depend on the items being compared. One limitation of this model is that it\ndoes not capture the fact that in many applications, like chess matches, the outcome of a comparison\nvery much depends on the opponents that are competing.\nSuch considerations have naturally led to the study of noise models induced by parametric distribu-\ntions over permutations. An important and landmark model in this class is called the Bradley-Terry-\nLuce (BTL) model [8, 9], which is also known as the Multinomial Logit (MNL) model (cf. [10]).\nIt has been the backbone of many practical system designs including pricing in the airline industry\n[11]. Adler et al. [12] used such models to design adaptive algorithms that select the winner from\nsmall number of rounds.\nInterestingly enough, the (near-)optimal performance of their adaptive\nalgorithm for winner selection is matched by our non-adaptive (model independent) algorithm for\nassigning scores to obtain global rankings of all players.\n\nOur contributions. In this paper, we provide an iterative algorithm that takes the noisy comparison\nanswers between a subset of all possible pairs of items as input and produces scores for each item\n\n2\n\n\fas the output. The proposed algorithm has a nice intuitive explanation. Consider a graph with\nnodes/vertices corresponding to the items of interest (e.g. players). Construct a random walk on this\ngraph where at each time, the random walk is likely to go from vertex i to vertex j if items i and j\nwere ever compared; and if so, the likelihood of going from i to j depends on how often i lost to j.\nThat is, the random walk is more likely to move to a neighbor who has more \u201cwins\u201d. How frequently\nthis walk visits a particular node in the long run, or equivalently the stationary distribution, is the\nscore of the corresponding item. Thus, effectively this algorithm captures preference of the given\nitem versus all of the others, not just immediate neighbors: the global effect induced by transitivity\nof comparisons is captured through the stationary distribution.\nSuch an interpretation of the stationary distribution of a Markov chain or a random walk has been\nan effective measure of relative importance of a node in wide class of graph problems, popularly\nknown as the network centrality [13]. Notable examples of such network centralities include the\nrandom surfer model on the web graph for the version of the PageRank [14] which computes the\nrelative importance of a web page, and a model of a random crawler in a peer-to-peer \ufb01le-sharing\nnetwork to assign trust value to each peer in EigenTrust [15].\nThe computation of the stationary distribution of the Markov chain boils down to \u2018power iteration\u2019\nusing transition matrix lending to a nice iterative algorithm. Thus, in effect, we have produced an\nalgorithm that (a) is computationally simple and iterative, (b) is model independent and works with\nthe data only, and (c) intuitively makes sense. To establish rigorous properties of the algorithm, we\nanalyze its performance under the BTL model described in Section 2.1.\nFormally, we establish the following result: given n items, when comparison results between ran-\ndomly chosen O(npoly(log n)) pairs of them are produced as per an (unknown) underlying BTL\nmodel, the stationary distribution produced by our algorithm (asymptotically) matches the true score\n(induced by the BTL model). It should be noted that \u2126(n log n) is a necessary number of (random)\ncomparisons for any algorithm to even produce a consistent ranking (due to connectivity threshold\nof random bipartite graph). In that sense, we will see that up to poly(log n) factor, our algorithm\nis optimal in terms of sample complexity. Indeed, the empirical experimental study shows that the\nperformance of our algorithm is identical to the ML estimation of the BTL model. Furthermore, it\nhandsomely outperforms other popular choices including the algorithm by [1].\nSome remarks about our analytic technique. Our analysis boils down to studying the induced sta-\ntionary distribution of the random walk or Markov chain corresponding to the algorithm. Like most\nsuch scenarios, the only hope to obtain meaningful results for such \u2018random noisy\u2019 Markov chain\nis to relate it to stationary distribution of a known Markov chain. Through recent concentration of\nmeasure results for random matrices and comparison technique using Dirichlet forms for charac-\nterizing the spectrum of reversible/self-adjoint operators, along with the known expansion property\nof the random graph, we obtain the eventual result. Indeed, it is the consequence of such powerful\nresults that lead to near-optimal analytic results.\nThe remainder of this paper is organized as follows. In Section 2 we will concretely introduce our\nmodel, the problem, and our algorithm. In Section 3 we will discuss our main theoretical results.\nThe proofs will be presented in Section 4.\n\ni x2\n\ntranspose of a matrix. The Euclidean norm of a vector is denoted by (cid:107)x(cid:107) = (cid:112)(cid:80)\n\nNotation. We use C, C(cid:48), etc. to denote generic numerical constants. We use AT to denote the\ni , and the\noperator norm of a linear operator is denoted by (cid:107)A(cid:107)2 = maxx xT Ax/xT x. Also de\ufb01ne [n] =\n{1, 2, . . . , n} to be the set of all integers from 1 to n.\n\n2 Model, Problem Statement, and Algorithm\n\nWe now present a concrete exposition of our underlying probabilistic model and our problem. We\nthen present our explicit random walk approach to ranking.\n\n2.1 Bradley-Terry-Luce model for comparative judgment\n\nIn this section we discuss our model of comparisons between various items. As alluded to above,\nfor the purpose of establishing analytic properties of the algorithm, we will assume comparisons are\n\n3\n\n\fgoverned by the BTL model of pairwise comparisons. However, the algorithm itself operates with\ndata generated in arbitrary manner.\nTo begin with, there are n items of interest, represented as [n] = {1, . . . , n}. We shall assume that\nfor each item i \u2208 [n] that there is an associated weight score wi \u2208 R+ (i.e. it\u2019s a strictly positive\nreal number). Hence, we may consider the vector w \u2208 Rn\n+ to be the associated weight vector of\nij be 1 if j is preferred over i and 0 otherwise\nall items. Given a pair of items i and j we will let Y l\nduring the lth competition for 1 \u2264 l \u2264 k, where k is the total number of competitions for the pair.\nUnder the BTL model we assume that\n\nP(Y l\n\nij = 1) =\n\nwj\n\nwi + wj\n\n.\n\n(1)\n\nFurthermore, conditioned on the score vector w we assume the the variables Y l\ni,j are independent\nfor all i, j, and l. We further assume that given some item i we will compare item j to i with\nprobability d/n.\nIn our setting d will be poly-logarithmic in n. This model is a natural one to\nconsider because over a population of individuals the comparisons cannot be adaptively selected.\nA more realistic model might incorporate selecting various items with different distributions: for\nexample, the Net\ufb02ix dataset demonstrates skews in the sampling distribution for different \ufb01lms [16].\nThus, given this model our goal is to recover the weight vector w given such pairwise comparisons.\nWe now discuss our method for computing the scores wi.\n\n2.2 Random walk approach to ranking\n\nl=1 Y l\n\nabove we have that aij = (1/k)(cid:80)k\n\nIn our setting, we will assume that aij represents the fraction of times object j has been preferred to\nobject i, for example the fraction of times chess player j has defeated player i. Given the notation\nij. Consider a random walk on a weighted directed graph\nG = ([n], E, A), where a pair (i, j) \u2208 E if and only if the pair has been compared. The weight edges\nare de\ufb01ned as the outcome of the comparisons Aij = aij/(aij + aji) and Aji = aji/(aij + aji).\nWe let Aij = 0 if the pair has not been compared. Note that by the Strong Law of Large Numbers,\nas the number k \u2192 \u221e the quantity Aij converges to wj/(wi + wj) almost surely.\nA random walk can be represented by a time-independent transition matrix P , where Pij =\nP(Xt+1 = j|Xt = i). By de\ufb01nition, the entries of a transition matrix are non-negative and sat-\nj Pij = 1. One way to de\ufb01ne a valid transition matrix of a random walk on G is to scale\nall the edge weights by 1/dmax, where we de\ufb01ne dmax as the maximum out-degree of a node. This\nensures that each row-sum is at most one. Finally, to ensure that each row-sum is exactly one, we\nadd a self-loop to each node. More concretely,\n\nisfy(cid:80)\n\n(cid:26)\n\nPij =\n\n1 \u2212 1\n\ndmax\n\n(cid:80)\n\n1\n\nAij\ndmax\nk(cid:54)=i Aik\n\nif i (cid:54)= j ,\nif i = j .\n\n(2)\n\n+\n\nvector w/(cid:80)\n\n+ de\ufb01ned as \u03c0i = vi/((cid:80)\n\nThe choice to construct our random walk as above is not arbitrary. In an ideal setting with in\ufb01nite\nsamples (k \u2192 \u221e) the transition matrix P would de\ufb01ne a reversible Markov chain. Recall that\nthere exists v \u2208 Rn\na Markov chain is reversible if it satis\ufb01es the detailed balance equation:\nsuch that viPij = vjPji for all i, j; and in that case, \u03c0 \u2208 Rn\nj vj) is\nit\u2019s unique stationary distribution. In the ideal setting (say k \u2192 \u221e), we will have Pij \u2261 \u02dcPij =\n(1/dmax)wj/(wi + wj). That is, the random walk will move from state i to state j with probability\nequal to the chance that item j is preferred to item i. In such a setting, it is clear that v = w satis\ufb01es\nthe reversibility conditions. Therefore, under these ideal conditions it immediately follows that the\ni wi acts as a valid stationary distribution for the Markov chain de\ufb01ned by \u02dcP , the ideal\nmatrix. Hence, as long as the graph G is connected and at least one node has a self loop then we\nare guaranteed that our graph has a unique stationary distribution proportional to w. If the Markov\nchain is reversible then we may apply the spectral analysis of self-adjoint operators, which is crucial\nin the analysis when we repeatedly apply the operator \u02dcP .\nIn our setting, the matrix P is a noisy version (due to \ufb01nite sample error) of the ideal matrix \u02dcP\ndiscussed above. Therefore, it naturally suggests the following algorithm as a surrogate. We esti-\nmate the probability distribution obtained by applying matrix P repeated starting from any initial\ncondition. Precisely, let pt(i) = P(Xt = i) denote the distribution of the random walk at time t\n\n4\n\n\fwith p0 = (p0(i)) \u2208 Rn\n\n+ be an arbitrary starting distribution on [n]. Then,\n\npT\nt+1 = pT\n\n(3)\nRegardless of the starting distribution, when the transition matrix has a unique top eigenvalue, the\nrandom walk always converges to a unique distribution: the stationary distribution \u03c0 = limt\u2192\u221e pt.\nIn linear algebra terms, this stationary distribution \u03c0 is the top left eigenvector of P , which makes\ncomputing \u03c0 a simple eigenvector computation. Formally, we state the algorithm, which assigns\nnumerical scores to each node, which we shall call Rank Centrality:\n\nt P .\n\nRank Centrality\nInput: G = ([n], E, A)\nOutput: rank {\u03c0(i)}i\u2208[n]\n1: Compute the transition matrix P according to (2);\n2: Compute the stationary distribution \u03c0.\n\n(cid:88)\n\nj\n\nAji(cid:80)\n\n(cid:96) Ai(cid:96)\n\n\u03c0(i) =\n\n\u03c0(j)\n\n.\n\nThe stationary distribution of the random walk is a \ufb01xed point of the following equation:\n\nThis suggests an alternative intuitive justi\ufb01cation: an object receives a high rank if it has been\npreferred to other high ranking objects or if it has been preferred to many objects.\nOne key question remains: does P have a well de\ufb01ned stationary distribution? As discussed ear-\nlier, when G is connected, the idealized transition matrix \u02dcP has stationary distribution with desired\nproperties. But due to noise, P may not be reversible and the arguments of ideal \u02dcP do not apply to\nour setting. Indeed, it is the \ufb01nite sample error that governs the noise. Therefore, by analyzing the\neffect of this noise (and hence the \ufb01nite samples), it is likely that we can obtain the error bound on\nthe performance of the algorithm. As an important contribution of this work, we will show that even\nthe iterations (cf. (3)) induced by P are close enough to those induced by \u02dcP . Subsequently, we can\nguarantee that the iterative algorithm will converge to a solution that is close to the ideal stationary\ndistribution.\n\n3 Main Results\n\nOur main result, Theorem 1, provides an upper bound on estimating the stationary distribution given\nthe observation model presented above. The results demonstrate that even with random sampling\nwe can estimate the underlying score with high probability with good accuracy. The bounds are\npresented as the rescaled Euclidean norm between our estimate \u03c0 and the underlying stationary dis-\ntribution \u02dcP . This error metric provides us with a means to quantify the relative certainty in guessing\nif one item is preferred over another. Furthermore, producing such scores are ubiquitous [17] as they\nmay also be used to calculate the desired rankings. After presenting our main theoretical result we\nwill then provide simulations demonstrating the empirical performance of our algorithm in different\ncontexts.\n\n3.1 Error bound in stationary distribution recovery via Rank Centrality\n\nThe theorem below presents our main recovery theorem under the sampling assumptions described\nabove. It is worth noting that while the result presented below is for the speci\ufb01c sampling model\ndescribed above. The results can be extended to general graphs as long as the spectral gap of the\ncorresponding Markov chain is well behaved. We will discuss the point further in the sequel.\nTheorem 1. Assume that, among n items, each pair is chosen with probability d/n and for each\nchosen pair we collect the outcomes of k comparisons according to the BTL model. Then, there exists\npositive universal constants C, C(cid:48), and C(cid:48)(cid:48) such that when d \u2265 C(log n)2, and k d \u2265 Cb5 log n,\nthe following bound on the error rate holds with probability at least 1 \u2212 C(cid:48)(cid:48)/n3:\n\n(cid:13)(cid:13)\u03c0 \u2212 \u02dc\u03c0(cid:13)(cid:13)\n\n(cid:107)\u02dc\u03c0(cid:107)\n\n(cid:114)\n\n\u2264 C(cid:48)b3\n\nlog n\nk d\n\n,\n\nwhere \u02dc\u03c0(i) = wi/(cid:80)\n\n(cid:96) w(cid:96) and b \u2261 maxi,j wi/wj.\n\n5\n\n\f2\n\nRemarks. Some remarks are in order. First, the above result implies that as long as we choose\nd = \u0398(log2 n) and k = \u03c9(1) (i.e.\nlarge enough, say k = \u0398(log n)), the error goes to 0 (with\nk = \u0398(log n), it goes down at rate 1/\nlog n) as n increases. Since we are sampling each of\n\n(cid:1) pairs with probability d/n and then sampling them k times, we obtain O(n log3 n) (with\n\nthe(cid:0)n\n\n\u221a\n\nk = \u0398(log n)) comparisons in total. Due to classical results on Erdos-Renyi graphs, the induced\ngraph G is connected with high probability only when total number of pairs sampled scales as\n\u2126(n log n)\u2013we need at least those many comparisons. Thus, our result can be sub-optimal only up\nto log2 n (log1+\u0001 n if k = log\u0001 n).\nSecond, the b parameter should be treated as constant. It is the dynamic range in which we are trying\nto resolve the uncertainty between scores. If b were scaling with n, then it would be really easy to\ndifferentiate scores of items that are at the two opposite end of the dynamic range; in which case one\ncould focus on differentiating scores of items that have their parameter values near-by. Therefore,\nthe interesting and challenging regime is where b is constant and not scaling.\nFinally, observe the interesting consequence that under the conditions on d, since the induced distri-\nbution \u03c0 is close to \u02dc\u03c0, it implies connectivity of G. Thus, the analysis of our algorithm provides an\nalternative proof of connectivity in an Erdos-Renyi graph (of course, by using heavy machinery!).\n\n3.2 Experimental Results\n\nUnder the BTL model, de\ufb01ne an error metric of a estimated ordering \u03c3 as the weighted sum of pairs\n(i, j) whose ordering is incorrect:\n\n(wi \u2212 wj)2 I(cid:0)(wi \u2212 wj)(\u03c3i \u2212 \u03c3j) > 0(cid:1)(cid:111)1/2\n(cid:88)\n\n,\n\nDw(\u03c3) =\n\n(cid:110)\n\n1\n\n2n(cid:107)w(cid:107)2\n\ni