{"title": "Minimax-optimal Inference from Partial Rankings", "book": "Advances in Neural Information Processing Systems", "page_first": 1475, "page_last": 1483, "abstract": "This paper studies the problem of rank aggregation under the Plackett-Luce model. The goal is to infer a global ranking and related scores of the items, based on partial rankings provided by multiple users over multiple subsets of items. A question of particular interest is how to optimally assign items to users for ranking and how many item assignments are needed to achieve a target estimation error. Without any assumptions on how the items are assigned to users, we derive an oracle lower bound and the Cram\\'er-Rao lower bound of the estimation error. We prove an upper bound on the estimation error achieved by the maximum likelihood estimator, and show that both the upper bound and the Cram\\'er-Rao lower bound inversely depend on the spectral gap of the Laplacian of an appropriately defined comparison graph. Since random comparison graphs are known to have large spectral gaps, this suggests the use of random assignments when we have the control. Precisely, the matching oracle lower bound and the upper bound on the estimation error imply that the maximum likelihood estimator together with a random assignment is minimax-optimal up to a logarithmic factor. We further analyze a popular rank-breaking scheme that decompose partial rankings into pairwise comparisons. We show that even if one applies the mismatched maximum likelihood estimator that assumes independence (on pairwise comparisons that are now dependent due to rank-breaking), minimax optimal performance is still achieved up to a logarithmic factor.", "full_text": "Minimax-optimal Inference from Partial Rankings\n\nBruce Hajek\n\nUIUC\n\nb-hajek@illinois.edu\n\nswoh@illinois.edu\n\njxu18@illinois.edu\n\nSewoong Oh\n\nUIUC\n\nJiaming Xu\n\nUIUC\n\nAbstract\n\nThis paper studies the problem of rank aggregation under the Plackett-Luce model.\nThe goal is to infer a global ranking and related scores of the items, based on par-\ntial rankings provided by multiple users over multiple subsets of items. A question\nof particular interest is how to optimally assign items to users for ranking and how\nmany item assignments are needed to achieve a target estimation error. Without\nany assumptions on how the items are assigned to users, we derive an oracle lower\nbound and the Cram\u00b4er-Rao lower bound of the estimation error. We prove an up-\nper bound on the estimation error achieved by the maximum likelihood estimator,\nand show that both the upper bound and the Cram\u00b4er-Rao lower bound inversely de-\npend on the spectral gap of the Laplacian of an appropriately de\ufb01ned comparison\ngraph. Since random comparison graphs are known to have large spectral gaps,\nthis suggests the use of random assignments when we have the control. Precisely,\nthe matching oracle lower bound and the upper bound on the estimation error im-\nply that the maximum likelihood estimator together with a random assignment is\nminimax-optimal up to a logarithmic factor. We further analyze a popular rank-\nbreaking scheme that decompose partial rankings into pairwise comparisons. We\nshow that even if one applies the mismatched maximum likelihood estimator that\nassumes independence (on pairwise comparisons that are now dependent due to\nrank-breaking), minimax optimal performance is still achieved up to a logarithmic\nfactor.\n\n1\n\nIntroduction\n\nGiven a set of individual preferences from multiple decision makers or judges, we address the prob-\nlem of computing a consensus ranking that best represents the preference of the population col-\nlectively. This problem, known as rank aggregation, has received much attention across various\ndisciplines including statistics, psychology, sociology, and computer science, and has found nu-\nmerous applications including elections, sports, information retrieval, transportation, and marketing\n[1, 2, 3, 4]. While consistency of various rank aggregation algorithms has been studied when a\ngrowing number of sampled partial preferences is observed over a \ufb01xed number of items [5, 6],\nlittle is known in the high-dimensional setting where the number of items and number of observed\npartial rankings scale simultaneously, which arises in many modern datasets. Inference becomes\neven more challenging when each individual provides limited information. For example, in the well\nknown Net\ufb02ix challenge dataset, 480,189 users submitted ratings on 17,770 movies, but on average\na user rated only 209 movies. To pursue a rigorous study in the high-dimensional setting, we assume\nthat users provide partial rankings over subsets of items generated according to the popular Plackett-\nLuce (PL) model [7] from some hidden preference vector over all the items and are interested in\nestimating the preference vector (see De\ufb01nition 1).\nIntuitively, inference becomes harder when few users are available, or each user is assigned few\nitems to rank, meaning fewer observations. The \ufb01rst goal of this paper is to quantify the number of\nitem assignments needed to achieve a target estimation error. Secondly, in many practical scenarios\nsuch as crowdsourcing, the systems have the control over the item assignment. For such systems, a\n\n1\n\n\fnatural question of interest is how to optimally assign the items for a given budget on the total num-\nber of item assignments. Thirdly, a common approach in practice to deal with partial rankings is to\nbreak them into pairwise comparisons and apply the state-of-the-art rank aggregation methods spe-\ncialized for pairwise comparisons [8, 9]. It is of both theoretical and practical interest to understand\nhow much the performance degrades when rank breaking schemes are used.\n\nNotation. For any set S, let |S| denote its cardinality. Let sn\n1 = {s1, . . . , sn} denote a set with\nn elements. For any positive integer N, let [N ] = {1, . . . , N}. We use standard big O notations,\ne.g., for any sequences {an} and {bn}, an = \u0398(bn) if there is an absolute constant C > 0 such that\n1/C \u2264 an/bn \u2264 C. For a partial ranking \u03c3 over S, i.e., \u03c3 is a mapping from [|S|] to S, let \u03c3\u22121\ndenote the inverse mapping. All logarithms are natural unless the base is explicitly speci\ufb01ed. We\nsay a sequence of events {An} holds with high probability if P[An] \u2265 1 \u2212 c1n\u2212c2 for two positive\nconstants c1, c2.\n\n1.1 Problem setup\n\nWe describe our model in the context of recommender systems, but it is applicable to other systems\nwith partial rankings. Consider a recommender system with m users indexed by [m] and n items\nindexed by [n]. For each item i \u2208 [n], there is a hidden parameter \u03b8\u2217\ni measuring the underlying\npreference. Each user j, independent of everyone else, randomly generates a partial ranking \u03c3j\nover a subset of items Sj \u2286 [n] according to the PL model with the underlying preference vector\n\u03b8\u2217 = (\u03b8\u2217\nDe\ufb01nition 1 (PL model). A partial ranking \u03c3 : [|S|] \u2192 S is generated from {\u03b8\u2217\ni , i \u2208 S} under\nthe PL model in two steps: (1) independently assign each item i \u2208 S an unobserved value Xi,\nexponentially distributed with mean e\u2212\u03b8\u2217\n\ni ; (2) select \u03c3 so that X\u03c3(1) \u2264 X\u03c3(2) \u2264 \u00b7\u00b7\u00b7 \u2264 X\u03c3(|S|).\n\n1, . . . , \u03b8\u2217\nn).\n\nThe PL model can be equivalently described in the following sequential manner. To generate a\npartial ranking \u03c3, \ufb01rst select \u03c3(1) in S randomly from the distribution e\u03b8\u2217\nselect \u03c3(2) in S \\ {\u03c3(1)} with the probability distribution e\u03b8\u2217\nprocess in the same fashion until all the items in S are assigned. The PL model is a special case of\nthe following class of models.\nDe\ufb01nition 2 (Thurstone model, or random utility model (RUM) ). A partial ranking \u03c3 : [|S|] \u2192 S\ni , i \u2208 S} under the Thurstone model for a given CDF F in two steps: (1)\nis generated from {\u03b8\u2217\nindependently assign each item i \u2208 S an unobserved utility Ui, with CDF F (c\u2212 \u03b8\u2217\ni ); (2) select \u03c3 so\nthat U\u03c3(1) \u2265 U\u03c3(2) \u2265 \u00b7\u00b7\u00b7 \u2265 U\u03c3(|S|).\n\ni(cid:48)\u2208S\\{\u03c3(1)} e\u03b8\u2217\n\ni(cid:48)\u2208S e\u03b8\u2217\n\ni /(cid:0)(cid:80)\n\ni /(cid:0)(cid:80)\n\ni(cid:48)(cid:1); secondly,\ni(cid:48)(cid:1); continue the\n\nTo recover the PL model from the Thurstone model, take F to be the CDF for the standard Gumbel\ndistribution: F (c) = e\u2212(e\u2212c). Equivalently, take F to be the CDF of \u2212 log(X) such that X has the\nexponential distribution with mean one. For this choice of F, the utility Ui having CDF F (c \u2212 \u03b8\u2217\ni ),\nis equivalent to Ui = \u2212 log(Xi) such that Xi is exponentially distributed with mean e\u2212\u03b8\u2217\ni . The\ncorresponding partial permutation \u03c3 is such that X\u03c3(1) \u2264 X\u03c3(2) \u2264 \u00b7\u00b7\u00b7 \u2264 X\u03c3(|S|), or equivalently,\nU\u03c3(1) \u2265 U\u03c3(2) \u2265 \u00b7\u00b7\u00b7 \u2265 U\u03c3(|S|). (Note the opposite ordering of X\u2019s and U\u2019s.)\nGiven the observation of all partial rankings {\u03c3j}j\u2208[m] over the subsets {Sj}j\u2208[m] of items, the\ntask is to infer the underlying preference vector \u03b8\u2217. For the PL model, and more generally for the\nThurstone model, we see that \u03b8\u2217 and \u03b8\u2217 + a1 for any a \u2208 R are statistically indistinguishable,\nwhere 1 is an all-ones vector. Indeed, under our model, the preference vector \u03b8\u2217 is the equivalence\nclass [\u03b8\u2217] = {\u03b8 : \u2203a \u2208 R, \u03b8 = \u03b8\u2217 + a1}. To get a unique representation of the equivalence\ni = 0. Then the space of all possible preference vectors is given by\ni(cid:48) becomes arbitrarily large for all i(cid:48) (cid:54)= i, then\nwith high probability item i is ranked higher than any other item i(cid:48) and there is no way to estimate\n\u03b8i to any accuracy. Therefore, we further put the constraint that \u03b8\u2217 \u2208 [\u2212b, b]n for some b \u2208 R\nand de\ufb01ne \u0398b = \u0398 \u2229 [\u2212b, b]n. The parameter b characterizes the dynamic range of the underlying\npreference. In this paper, we assume b is a \ufb01xed constant. As observed in [10], if b were scaled with\nn, then it would be easy to rank items with high preference versus items with low preference and\none can focus on ranking items with close preference.\n\nclass, we assume (cid:80)n\n\u0398 = {\u03b8 \u2208 Rn :(cid:80)n\n\ni=1 \u03b8\u2217\ni=1 \u03b8i = 0}. Moreover, if \u03b8\u2217\n\ni \u2212 \u03b8\u2217\n\n2\n\n\f(cid:80)m\n\nWe denote the number of items assigned to user j by kj := |Sj| and the average number of assigned\nj=1 kj; parameter k may scale with n in this paper. We consider two\nitems per use by k = 1\nscenarios for generating the subsets {Sj}m\nm\nj=1: the random item assignment case where the Sj\u2019s are\nchosen independently and uniformly at random from all possible subsets of [n] with sizes given by\nthe kj\u2019s, and the deterministic item assignment case where the Sj\u2019s are chosen deterministically.\nOur main results depend on the structure of a weighted undirected graph G de\ufb01ned as follows.\nDe\ufb01nition 3 (Comparison graph G). Each item i \u2208 [n] corresponds to a vertex i \u2208 [n]. For any pair\nof vertices i, i(cid:48), there is a weighted edge between them if there exists a user who ranks both items i\n\nand i(cid:48); the weight equals(cid:80)\nLet A denote the weighted adjacency matrix of G. Let di =(cid:80)\n\nj Aij, so di is the number of users\nwho rank item i, and without loss of generality assume d1 \u2264 d2 \u2264 \u00b7\u00b7\u00b7 \u2264 dn. Let D denote the n\u00d7 n\ndiagonal matrix formed by {di, i \u2208 [n]} and de\ufb01ne the graph Laplacian L as L = D \u2212 A. Observe\nthat L is positive semi-de\ufb01nite and the smallest eigenvalue of L is zero with the corresponding\neigenvector given by the normalized all-one vector. Let 0 = \u03bb1 \u2264 \u03bb2 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03bbn denote the\neigenvalues of L in ascending order.\n\nj:i,i(cid:48)\u2208Sj\n\nkj\u22121.\n\n1\n\n1\n\u03bbi\n\n(\u03bb2\u2212\u221a\n\nestimator that scales as mk log n\n\n(cid:1) pairwise comparisons, Theorem 4 gives an upper bound that scales as mk log n\n\n(cid:80)n\nas(cid:80)n\ncomparison into(cid:0)k\nupper bounds match up to a log n factor. This follows from the fact that(cid:80)\n\nSummary of main results. Theorem 1 gives a lower bound for the estimation error that scales as\n. The lower bound is derived based on a genie-argument and holds for both the PL model\nand the more general Thurstone model. Theorem 2 shows that the Cram\u00b4er-Rao lower bound scales\n. Theorem 3 gives an upper bound for the squared error of the maximum likelihood (ML)\n\u03bbn)2 . Under the full rank breaking scheme that decomposes a k-way\n.\nIf the comparison graph is an expander graph, i.e., \u03bb2 \u223c \u03bbn and mk = \u2126(n log n), our lower and\ni di = mk,\nand for expanders mk = \u0398(n\u03bb2). Since the Erd\u02ddos-R\u00b4enyi random graph is an expander graph with\nhigh probability for average degree larger than log n, when the system is allowed to choose the\nitem assignment, we propose a random assignment scheme under which the items for each user are\nchosen independently and uniformly at random. It follows from Theorem 1 that mk = \u2126(n) is\nnecessary for any item assignment scheme to reliably infer the underlying preference vector, while\nour upper bounds imply that mk = \u2126(n log n) is suf\ufb01cient with the random assignment scheme and\ncan be achieved by either the ML estimator or the full rank breaking or the independence-preserving\nbreaking that decompose a k-way comparison into (cid:98)k/2(cid:99) non-intersecting pairwise comparisons,\nproving that rank breaking schemes are also nearly optimal.\n\ni \u03bbi = (cid:80)\n\n\u03bb2\n2\n\n1\ndi\n\ni=2\n\ni=2\n\n2\n\n1.2 Related Work\n\nThere is a vast literature on rank aggregation, and here we can only hope to cover a fraction of them\nwe see most relevant. In this paper, we study a statistical learning approach, assuming the observed\nranking data is generated from a probabilistic model. Various probabilistic models on permutations\nhave been studied in the ranking literature (see, e.g., [11, 12]). A nonparametric approach to mod-\neling distributions over rankings using sparse representations has been studied in [13]. Most of the\nparametric models fall into one of the following three categories: noisy comparison model, distance\nbased model, and random utility model. The noisy comparison model assumes that there is an un-\nderlying true ranking over n items, and each user independently gives a pairwise comparison which\nagrees with the true ranking with probability p > 1/2. It is shown in [14] that O(n log n) pairwise\ncomparisons, when chosen adaptively, are suf\ufb01cient for accurately estimating the true ranking.\nThe Mallows model is a distance-based model, which randomly generates a full ranking \u03c3 over n\nitems from some underlying true ranking \u03c3\u2217 with probability proportional to e\u2212\u03b2d(\u03c3,\u03c3\u2217), where \u03b2 is\na \ufb01xed spread parameter and d(\u00b7,\u00b7) can be any permutation distance such as the Kemeny distance.\nIt is shown in [14] that the true ranking \u03c3\u2217 can be estimated accurately given O(log n) independent\nfull rankings generated under the Mallows model with the Kemeny distance.\nIn this paper, we study a special case of random utility models (RUMs) known as the Plackett-Luce\n(PL) model. It is shown in [7] that the likelihood function under the PL model is concave and the\nML estimator can be ef\ufb01ciently found using a minorization-maximization (MM) algorithm which is\n\n3\n\n\fa variation of the general EM algorithm. We give an upper bound on the error achieved by such\nan ML estimator, and prove that this is matched by a lower bound. The lower bound is derived by\ncomparing to an oracle estimator which observes the random utilities of RUM directly. The Bradley-\nTerry (BT) model is the special case of the PL model where we only observe pairwise comparisons.\nFor the BT model, [10] proposes RankCentrality algorithm based on the stationary distribution of a\nrandom walk over a suitably de\ufb01ned comparison graph and shows \u2126(npoly(log n)) randomly chosen\npairwise comparisons are suf\ufb01cient to accurately estimate the underlying parameters; one corollary\nof our result is a matching performance guarantee for the ML estimator under the BT model. More\nrecently, [15] analyzed various algorithms including RankCentrality and the ML estimator under a\ngeneral, not necessarily uniform, sampling scheme.\nIn a PL model with priors, MAP inference becomes computationally challenging. Instead, an ef\ufb01-\ncient message-passing algorithm is proposed in [16] to approximate the MAP estimate. For a more\ngeneral family of random utility models, Sou\ufb01ani et al. in [17, 18] give a suf\ufb01cient condition under\nwhich the likelihood function is concave, and propose a Monte-Carlo EM algorithm to compute the\nML estimator for general RUMs. More recently in [8, 9], the generalized method of moments to-\ngether with the rank-breaking is applied to estimate the parameters of the PL model and the random\nutility model when the data consists of full rankings.\n\n2 Main results\n\nIn this section, we present our theoretical \ufb01ndings and numerical experiments.\n\nsup\n\u03b8\u2217\u2208\u0398b\n\nn(cid:88)\n\nE[||(cid:98)\u03b8 \u2212 \u03b8\u2217||2\n2] \u2265\n\n2.1 Oracle lower bound\nIn this section, we derive an oracle lower bound for any estimator of \u03b8\u2217. The lower bound is con-\nstructed by considering an oracle who reveals all the hidden scores in the PL model as side informa-\ntion and holds for the general Thurstone models.\nTheorem 1. Suppose \u03c3m\n\n1 are generated from the Thurstone model for some CDF F. For any esti-\n\nmator(cid:98)\u03b8,\ninf(cid:98)\u03b8\nwhere \u00b5 is the probability density function of F , i.e., \u00b5 = F (cid:48) and I(\u00b5) =(cid:82) (\u00b5(cid:48)(x))2\nTheorem 1 shows that the oracle lower bound scales as(cid:80)n\n\n\u00b5(x) dx; the second\ninequality follows from the Jensen\u2019s inequality. For the PL model, which is a special case of the\nThurstone models with F being the standard Gumbel distribution, I(\u00b5) = 1.\n\n. We remark that the summation\nbegins with 1/d2. This makes some sense, in view of the fact that the parameters \u03b8\u2217\ni need to sum\nto zero. For example, if d1 is a moderate value and all the other di\u2019s are very large, then with the\ni for i (cid:54)= 1 and therefore\nhidden scores as side information, we may be able to accurately estimate \u03b8\u2217\naccurately estimate \u03b8\u2217\n1. The oracle lower bound also depends on the dynamic range b and is tight for\nb = 0, because a trivial estimator that always outputs the all-zero vector achieves the lower bound.\n\n2I(\u00b5) + 2\u03c02\n\nb2(d1+d2)\n\n2I(\u00b5) + 2\u03c02\n\n\u2265\n\n1\ndi\n\n(n \u2212 1)2\n\n,\n\nmk\n\n1\n\nb2(d1+d2)\n\ni=2\n\n1\n\n1\ndi\n\ni=2\n\nassignment scheme to reliably infer \u03b8\u2217, i.e., ensuring E[||(cid:98)\u03b8\u2212 \u03b8\u2217||2\n\nComparison to previous work Theorem 1 implies that mk = \u2126(n) is necessary for any item\n2] = o(n). It provides the \ufb01rst con-\nverse result on inferring the parameter vector under the general Thurstone models to our knowledge.\nFor the Bradley-Terry model, which is a special case of the PL model where all the partial rankings\nreduce to the pairwise comparisons, i.e., k = 2, it is shown in [10] that m = \u2126(n) is necessary\nfor the random item assignment scheme to achieve the reliable inference based on the information-\ntheoretic argument. In contrast, our converse result is derived based on the Bayesian Cram\u00b4e-Rao\nlower bound [19], applies to the general models with any item assignment, and is considerably\ntighter if di\u2019s are of different orders.\n\n2.2 Cram\u00b4er-Rao lower bound\nIn this section, we derive the Cram\u00b4er-Rao lower bound for any unbiased estimator of \u03b8\u2217.\n\n4\n\n\fTheorem 2. Let kmax = maxj\u2208[m] kj and U denote the set of all unbiased estimators of \u03b8\u2217, i.e.,\n\n(cid:32)\n\n(cid:98)\u03b8 \u2208 U if and only if E[(cid:98)\u03b8|\u03b8\u2217 = \u03b8] = \u03b8,\u2200\u03b8 \u2208 \u0398b. If b > 0, then\nkmax(cid:88)\nn(cid:88)\ninf(cid:98)\u03b8\u2208U\nThe Cram\u00b4er-Rao lower bound scales as(cid:80)n\n\nE[(cid:107)(cid:98)\u03b8 \u2212 \u03b8\u2217(cid:107)2\n2] \u2265\n\nwhere the second inequality follows from the Jensen\u2019s inequality.\n\n1 \u2212 1\nkmax\n\n(cid:33)\u22121\n\nsup\n\u03b8\u2217\u2208\u0398b\n\n1\n\u03bbi\n\n\u2265\n\n1\n(cid:96)\n\n(cid:32)\n\n(cid:96)=1\n\ni=2\n\n(cid:33)\u22121\n\nkmax(cid:88)\n\n(cid:96)=1\n\n1\n(cid:96)\n\n(n \u2212 1)2\n\nmk\n\n,\n\n1 \u2212 1\nkmax\n\ni=2\n\n1\n\u03bbi\n\n. When G is disconnected, i.e., all the items can be\npartitioned into two groups such that no user ever compares an item in one group with an item in\nthe other group, \u03bb2 = 0 and the Cram\u00b4er-Rao lower bound is in\ufb01nity, which is valid (and of course\ntight) because there is no basis for gauging any item in one connected component with respect to any\nitem in the other connected component and the accurate inference is impossible for any estimator.\nAlthough the Cram\u00b4er-Rao lower bound only holds for any unbiased estimator, we suspect that a\nlower bound with the same scaling holds for any estimator, but we do not have a proof.\n\n1 ] =\n\nm(cid:88)\n\nkj\u22121(cid:88)\n\nL(\u03b8) = log P\u03b8[\u03c3m\n\n2.3 ML upper bound\nIn this section, we study the ML estimator based on the partial rankings. The ML estimator of \u03b8\u2217 is\n\nde\ufb01ned as(cid:98)\u03b8ML \u2208 arg max\u03b8\u2208\u0398b L(\u03b8), where L(\u03b8) is the log likelihood function given by\n(cid:2)\u03b8\u03c3j ((cid:96)) \u2212 log(cid:0)exp(\u03b8\u03c3j ((cid:96))) + \u00b7\u00b7\u00b7 + exp(\u03b8\u03c3j (kj ))(cid:1)(cid:3) .\n\n(1)\nAs observed in [7], L(\u03b8) is concave in \u03b8 and thus the ML estimator can be ef\ufb01ciently computed\neither via the gradient descent method or the EM type algorithms.\nThe following theorem gives an upper bound on the error rates inversely dependent on \u03bb2. Intu-\nitively, by the well-known Cheeger\u2019s inequality, if the spectral gap \u03bb2 becomes larger, then there are\nmore edges across any bi-partition of G, meaning more pairwise comparisons are available between\nany bi-partition of movies, and therefore \u03b8\u2217 can be estimated more accurately.\nTheorem 3. Assume \u03bbn \u2265 C log n for a suf\ufb01ciently large constant C in the case with k > 2. Then\nwith high probability,\n\nj=1\n\n(cid:96)=1\n\n(cid:40)\n\n(cid:107)(cid:98)\u03b8ML \u2212 \u03b8\u2217(cid:107)2 \u2264\n\n4(1 + e2b)2\u03bb\u22121\n8e4b\u221a\n\u03bb2\u221216e2b\n\n2\n\n\u221a\nm log n If k = 2,\n\u221a\n2mk log n\nIf k > 2.\n\n\u03bbn log n\n\n\u2265 (cid:80)n\n\nthat (cid:80)n\n\ni=2\n\n1\n\u03bbi\n\ni=1 \u03bbi = mk and \u03bb1 = 0. Therefore, mk\n\u03bb2\n2\n\nWe compare the above upper bound with the Cram\u00b4er-Rao lower bound given by Theorem 2. Notice\nand the upper bound is always\nlarger than the Cram\u00b4er-Rao lower bound. When the comparison graph G is an expander and mk =\n\u2126(n log n), by the well-known Cheeger\u2019s inequality, \u03bb2 \u223c \u03bbn = \u2126(log n) , the upper bound is only\nlarger than the Cram\u00b4er-Rao lower bound by a logarithmic factor. In particular, with the random item\nn if mk \u2265 C log n and as a corollary of Theorem 3,\nassignment scheme, we show that \u03bb2, \u03bbn \u223c mk\nn), proving the random item assignment\n\nscheme with the ML estimation is minimax-optimal up to a log n factor.\nCorollary 1. Suppose Sm\n1 are chosen independently and uniformly at random among all possible\nsubsets of [n]. Then there exists a positive constant C > 0 such that if m \u2265 Cn log n when k = 2\nand mk \u2265 Ce2b log n when k > 2, then with high probability\n\nmk = \u2126(n log n) is suf\ufb01cient to ensure (cid:107)(cid:98)\u03b8ML\u2212\u03b8\u2217(cid:107)2 = o(\n\uf8f1\uf8f2\uf8f3 4(1 + e2b)2\n(cid:113) n2 log n\n(cid:113) 2n2 log n\n\n(cid:107)(cid:98)\u03b8ML \u2212 \u03b8\u2217(cid:107)2 \u2264\n\nif k = 2,\n\nif k > 2.\n\nm ,\n,\n\n32e4b\n\n\u221a\n\nmk\n\nComparison to previous work Theorem 3 provides the \ufb01rst \ufb01nite-sample error rates for inferring\nthe parameter vector under the PL model to our knowledge. For the Bradley-Terry model, which\nis a special case of the PL model with k = 2, [10] derived the similar performance guarantee by\nanalyzing the rank centrality algorithm and the ML estimator. More recently, [15] extended the\nresults to the non-uniform sampling scheme of item pairs, but the performance guarantees obtained\n\u221a\n\u03b8\u2217(cid:107)2 = o(\n\nwhen specialized to the uniform sampling scheme require at least m = \u2126(n4 log n) to ensure (cid:107)(cid:98)\u03b8 \u2212\n\nn), while our results only require m = \u2126(n log n).\n\n5\n\n\f2.4 Rank breaking upper bound\n\n(cid:114)\n\n.\n\nmk\n\nt=1\n\nt, yt}(cid:98)k/2(cid:99)\n\nsuch that {is, i(cid:48)\n\nIn this section, we study two rank-breaking schemes which decompose partial rankings into pairwise\ncomparisons.\nDe\ufb01nition 4. Given a partial ranking \u03c3 over the subset S \u2282 [n] of size k, the independence-\npreserving breaking scheme (IB) breaks \u03c3 into (cid:98)k/2(cid:99) non-intersecting pairwise comparisons of form\n{it, i(cid:48)\ns} \u2229 {it, i(cid:48)\nt) and\n0 otherwise. The random IB chooses {it, i(cid:48)\nIf \u03c3 is generated under the PL model, then the IB breaks \u03c3 into independent pairwise comparisons\ngenerated under the PL model. Hence, we can \ufb01rst break partial rankings \u03c3m\n1 into independent pair-\nwise comparisons using the random IB and then apply the ML estimator on the generated pairwise\n\ncomparisons with the constraint that \u03b8 \u2208 \u0398b, denoted by(cid:98)\u03b8IB. Under the random assignment scheme,\nas a corollary of Theorem 3, mk = \u2126(n log n) is suf\ufb01cient to ensure (cid:107)(cid:98)\u03b8IB \u2212 \u03b8\u2217(cid:107)2 = o(\n\nt} = \u2205 for any s (cid:54)= t and yt = 1 if \u03c3\u22121(it) < \u03c3\u22121(i(cid:48)\nt}(cid:98)k/2(cid:99)\nt=1 uniformly at random among all possibilities.\n\nn), proving\nthe random item assignment scheme with the random IB is minimax-optimal up to a log n factor in\nview of the oracle lower bound in Theorem 1.\nCorollary 2. Suppose Sm\n1 are chosen independently and uniformly at random among all possible\nsubsets of [n] with size k. There exists a positive constant C > 0 such that if mk \u2265 Cn log n, then\nwith high probability,\n\n\u221a\n\n2(kj \u2212 1)\n\n(cid:107)(cid:98)\u03b8IB \u2212 \u03b8\u2217(cid:107)2 \u2264 4(1 + e2b)2\n(cid:1) possible pairwise comparisons of form {it, i(cid:48)\n\n2n2 log n\n\nt) and 0 otherwise.\n\n(i(cid:48))} + \u03b8i(cid:48)I{\u03c3\n\n(FB) breaks \u03c3 into all(cid:0)k\n\nDe\ufb01nition 5. Given a partial ranking \u03c3 over the subset S \u2282 [n] of size k, the full breaking scheme\nt, yt}(k\n2)\nt=1 such that yt = 1 if\n\n\u03c3\u22121(it) < \u03c3\u22121(i(cid:48)\nIf \u03c3 is generated under the PL model, then the FB breaks \u03c3 into pairwise comparisons which are not\nindependently generated under the PL model. We pretend the pairwise comparisons induced from\nthe full breaking are all independent and maximize the weighted log likelihood function given by\n\nkj\u22121\nto adjust the contributions of the pairwise comparisons generated from the partial rankings over\nsubsets with different sizes.\n\n(i(cid:48))} \u2212 log(cid:0)e\u03b8i + e\u03b8i(cid:48)(cid:1)(cid:17)\nwith the constraint that \u03b8 \u2208 \u0398b. Let(cid:98)\u03b8FB denote the maximizer. Notice that we put the weight\nTheorem 4. With high probability, (cid:107)(cid:98)\u03b8FB \u2212 \u03b8\u2217(cid:107)2 \u2264 2(1 + e2b)2\na positive constant C > 0 such that if mk \u2265 Cn log n, then with high probability, (cid:107)(cid:98)\u03b8FB \u2212 \u03b8\u2217(cid:107)2 \u2264\nTheorem 4 shows that the error rates of(cid:98)\u03b8FB inversely depend on \u03bb2. When the comparison graph G\n\n. Furthermore, suppose Sm\n1\nare chosen independently and uniformly at random among all possible subsets of [n]. There exists\n\nis an expander, i.e., \u03bb2 \u223c \u03bbn, the upper bound is only larger than the Cram\u00b4er-Rao lower bound by a\nlogarithmic factor. The similar observation holds for the ML estimator as shown in Theorem 3. With\nthe random item assignment scheme, Theorem 4 imply that the FB only need mk = \u2126(n log n)\nto achieve the reliable inference, which is optimal up to a log n factor in view of the oracle lower\nbound in Theorem 1.\n\n(cid:113) n2 log n\n\n\u22121\nj\n\n\u22121\n(i)<\u03c3\nj\n\nm(cid:88)\n\nj=1\n\n(cid:88)\n\n\u22121\nj\n\n(i)>\u03c3\n\n\u22121\nj\n\nL(\u03b8) =\n\n4(1 + e2b)2\n\nmk log n\n\n\u03bb2\n\n(cid:16)\n\n\u03b8iI{\u03c3\n\n(2)\n\n1\n\n2\n\n1\n\ni,i(cid:48)\u2208Sj\n\n.\n\nmk\n\n\u221a\n\nComparison to previous work The rank breaking schemes considered in [8, 9] breaks the full\nrankings according to rank positions while our schemes break the partial rankings according to the\nitem indices. The results in [8, 9] establish the consistency of the generalized method of moments\nunder the rank breaking schemes when the data consists of full rankings. In contrast, Corollary 2 and\nTheorem 4 apply to the more general setting with partial rankings and provide the \ufb01nite-sample error\nrates, proving the optimality of the random IB and FB with the random item assignment scheme.\n\n6\n\n\f2.5 Numerical experiments\nSuppose there are n = 1024 items and \u03b8\u2217 is uniformly distributed over [\u2212b, b]. We \ufb01rst generate\nd full rankings over 1024 items according to the PL model with parameter \u03b8\u2217. Then for each \ufb01xed\nk \u2208 {512, 256, . . . , 2}, we break every full ranking \u03c3 into n/k partial rankings over subsets of\nsize k as follows: Let {Sj}n/k\nj=1 denote a partition of [n] generated uniformly at random such that\nSj \u2229 Sj(cid:48) = \u2205 for j (cid:54)= j(cid:48) and |Sj| = k for all j; generate {\u03c3j}n/k\nj=1 such that \u03c3j is the partial ranking\nover set Sj consistent with \u03c3. In this way, in total we get m = dn/k k-way comparisons which\nare all independently generated from the PL model. We apply the minorization-maximization (MM)\n\nalgorithm proposed in [7] to compute the ML estimator(cid:98)\u03b8ML based on the k-way comparisons and\nthe estimator (cid:98)\u03b8FB based on the pairwise comparisons induced by the FB. The estimation error is\n\n(cid:16) mk\nn2 (cid:107)(cid:98)\u03b8 \u2212 \u03b8\u2217(cid:107)2\n\n2\n\n(cid:17)\n\n.\n\n(cid:16)\n\nmeasured by the rescaled mean square error (MSE) de\ufb01ned by log2\nWe run the simulation with b = 2 and d = 16, 64. The results are depicted in Fig. 1. We also plot\nthe Cram\u00b4er-Rao (CR) limit given by log2\nas per Theorem 2. The oracle lower\nbound in Theorem 1 implies that the rescaled MSE is at least 0. We can see that the rescaled MSE of\n\nthe ML estimator(cid:98)\u03b8ML is close to the CR limit and approaches the oracle lower bound as k becomes\nlarge, suggesting the ML estimator is minimax-optimal. Furthermore, the rescaled MSE of(cid:98)\u03b8FB under\n\nFB is approximately twice larger than the CR limit, suggesting that the FB is minimax-optimal up\nto a constant factor.\n\n(cid:17)\u22121\n\n(cid:80)k\n\n1 \u2212 1\n\nk\n\n1\nl\n\nl=1\n\nFigure 1: The error rate based on nd/k k-way comparisons with and without full breaking.\n\nFinally, we point out that when d = 16 and log2(k) = 1, the MSE returned by the MM algorithm\nis in\ufb01nity. Such singularity occurs for the following reason. Suppose we consider a directed com-\nparison graph with nodes corresponding to items such that for each (i, j), there is a directed edge\n(i \u2192 j) if item i is ever ranked higher than j. If the graph is not strongly connected, i.e., if there\nexists a partition of the items into two groups A and B such that items in A are always ranked higher\nthan items in B, then if all {\u03b8i : i \u2208 A} are increased by a positive constant a, and all {\u03b8i : i \u2208 B}\nare decreased by another positive constant a(cid:48) such that all {\u03b8i, i \u2208 [n]} still sum up to zero, the log\nlikelihood (1) must increase; thus, the log likelihood has no maximizer over the parameter space\n\u0398, and the MSE returned by the MM algorithm will diverge. Theoretically, if b is a constant and\nd exceeds the order of log n, the directed comparison graph will be strongly connected with high\nprobability and so such singularity does not occur in our numerical experiments when d \u2265 64. In\npractice we can deal with this singularity issue in three ways: 1) \ufb01nd the strongly connected com-\nponents and then run MM in each component to come up with an estimator of \u03b8\u2217 restricted to each\ncomponent; 2) introduce a proper prior on the parameters and use Bayesian inference to come up\nwith an estimator (see [16]); 3) add to the log likelihood objective function a regularization term\nbased on (cid:107)\u03b8(cid:107)2 and solve the regularized ML using the gradient descent algorithms (see [10]).\n\n7\n\n1234567891000.511.522.53log2(k)Rescaled MSE FB (d=16)FB (d=64)d=16d=64CR Limit\f3 Proofs\n\ndenote its eigenvalues sorted in increasing order. Let Tr(X) = (cid:80)n\n\nWe sketch the proof of our two upper bounds given by Theorem 3 and Theorem 4. The proofs of\nother results can be found in the supplementary \ufb01le. We introduce some additional notations used\nin the proof. For a vector x, let (cid:107)x(cid:107)2 denote the usual l2 norm. Let 1 denote the all-one vector\nand 0 denote the all-zero vector with the appropriate dimension. Let S n denote the set of n \u00d7 n\nsymmetric matrices with real-valued entries. For X \u2208 S n, let \u03bb1(X) \u2264 \u03bb2(X) \u2264 \u00b7\u00b7\u00b7 \u2264 \u03bbn(X)\ni=1 \u03bbi(X) denote its trace and\n(cid:107)X(cid:107) = max{\u2212\u03bb1(X), \u03bbn(X)} denote its spectral norm. For two matrices X, Y \u2208 S n, we write\nX \u2264 Y if Y \u2212X is positive semi-de\ufb01nite, i.e., \u03bb1(Y \u2212X) \u2265 0. Recall that L(\u03b8) is the log likelihood\nfunction. Let \u2207L(\u03b8) denote its gradient and H(\u03b8) \u2208 S n denote its Hessian matrix.\n\n3.1 Proof of Theorem 3\n\n(cid:107)\u2207L(\u03b8\u2217)(cid:107)2 \u2264(cid:112)2mk log n\n(cid:40)\n\nThe main idea of the proof is inspired from the proof of [10, Theorem 4]. We \ufb01rst introduce several\nkey auxiliary results used in the proof. Observe that E\u03b8\u2217 [\u2207L(\u03b8\u2217)] = 0. The following lemma upper\nbounds the deviation of \u2207L(\u03b8\u2217) from its mean.\nLemma 1. With probability at least 1 \u2212 2e2\nn ,\n\n1\n\n4e4b\n\n(3)\nObserved that \u2212H(\u03b8) is positive semi-de\ufb01nite with the smallest eigenvalue equal to zero. The\nfollowing lemma lower bounds its second smallest eigenvalue.\nLemma 2. Fix any \u03b8 \u2208 \u0398b. Then\n\u03bb2 (\u2212H(\u03b8)) \u2265\n\n(1+e2b)2 \u03bb2\n\n(4)\n\n(cid:0)\u03bb2 \u2212 16e2b\u221a\n\ne2b\n\n\u03bbn log n(cid:1)\n\nIf k = 2,\nIf k > 2,\n\nwhere the inequality holds with probability at least 1 \u2212 n\u22121 in the case with k > 2.\n\nProof of Theorem 3. De\ufb01ne \u2206 = (cid:98)\u03b8ML \u2212 \u03b8\u2217. It follows from the de\ufb01nition that \u2206 is orthogonal to\nthe all-one vector. By the de\ufb01nition of the ML estimator, L((cid:98)\u03b8ML) \u2265 L(\u03b8\u2217) and thus\nthere exists a \u03b8 = a(cid:98)\u03b8ML + (1 \u2212 a)\u03b8\u2217 for some a \u2208 [0, 1] such that\n\n(5)\nwhere the last inequality holds due to the Cauchy-Schwartz inequality. By the Taylor expansion,\n\nL(\u02c6\u03b8ML) \u2212 L(\u03b8\u2217) \u2212 (cid:104)\u2207L(\u03b8\u2217), \u2206(cid:105) \u2265 \u2212(cid:104)\u2207L(\u03b8\u2217), \u2206(cid:105) \u2265 \u2212(cid:107)\u2207L(\u03b8\u2217)(cid:107)2(cid:107)\u2206(cid:107)2,\n\nL(\u02c6\u03b8ML) \u2212 L(\u03b8\u2217) \u2212 (cid:104)\u2207L(\u03b8\u2217), \u2206(cid:105) =\n\n1\n(6)\n2\nwhere the last inequality holds because the Hessian matrix \u2212H(\u03b8) is positive semi-de\ufb01nite with\nH(\u03b8)1 = 0 and \u2206(cid:62)1 = 0. Combining (5) and (6),\n\n\u2206(cid:62)H(\u03b8)\u2206 \u2264 \u2212 1\n2\n\n\u03bb2(\u2212H(\u03b8))(cid:107)\u2206(cid:107)2\n2,\n\nNote that \u03b8 \u2208 \u0398b by de\ufb01nition. The theorem follows by Lemma 1 and Lemma 2.\n\n(cid:107)\u2206(cid:107)2 \u2264 2(cid:107)\u2207L(\u03b8\u2217)(cid:107)2/\u03bb2(\u2212H(\u03b8)).\n\n(7)\n\n(cid:88)\n\n(cid:88)\n\n1\n\n1\n\nH(\u03b8) = \u2212 m(cid:88)\n\n3.2 Proof of Theorem 4\nIt follows from the de\ufb01nition of L(\u03b8) given by (2) that\n\n(cid:88)\n\n\u2207iL(\u03b8\u2217) =\n\n(cid:20)\nI{\u03c3\n\nkj \u2212 1\neffding\u2019s inequality, |\u2207iL(\u03b8\u2217)| \u2264 \u221a\n(cid:107)\u2207L(\u03b8\u2217)(cid:107)2 \u2264 \u221a\n\ni(cid:48)\u2208Sj :i(cid:48)(cid:54)=i\nwhich is a sum of di independent random variables with mean zero and bounded by 1. By Ho-\ndi log n with probability at least 1 \u2212 2n\u22122. By union bound,\n\nmk log n with probability at least 1 \u2212 2n\u22121. The Hessian matrix is given by\n\nexp(\u03b8\u2217\n\n\u22121\n(i)<\u03c3\nj\n\nj:i\u2208Sj\n\n\u22121\nj\n\n,\n\n(8)\n\n(i(cid:48))} \u2212\n\nexp(\u03b8\u2217\ni )\ni ) + exp(\u03b8\u2217\ni(cid:48))\n\n(cid:21)\n\n2(kj \u2212 1)\nIf |\u03b8i| \u2264 b,\u2200i \u2208 [n],\nexp(\u03b8i+\u03b8i(cid:48) )\nand the theorem follows from (7).\n\nj=1\n\n[exp(\u03b8i)+exp(\u03b8i(cid:48) )]2 \u2265 e2b\n\ni,i(cid:48)\u2208Sj\n\n(ei \u2212 ei(cid:48))(ei \u2212 ei(cid:48))(cid:62)\n\nexp(\u03b8i + \u03b8i(cid:48))\n\n[exp(\u03b8i) + exp(\u03b8i(cid:48))]2 .\n\n(1+e2b)2 . It follows that \u2212H(\u03b8) \u2265 e2b\n\n(1+e2b)2 L for \u03b8 \u2208 \u0398b\n\n8\n\n\fReferences\n[1] M. E. Ben-Akiva and S. R. Lerman, Discrete choice analysis: theory and application to travel\n\ndemand. MIT press, 1985, vol. 9.\n\n[2] P. M. Guadagni and J. D. Little, \u201cA logit model of brand choice calibrated on scanner data,\u201d\n\nMarketing science, vol. 2, no. 3, pp. 203\u2013238, 1983.\n\n[3] D. McFadden, \u201cEconometric models for probabilistic choice among products,\u201d Journal of Busi-\n\nness, vol. 53, no. 3, pp. S13\u2013S29, 1980.\n\n[4] P. Sham and D. Curtis, \u201cAn extended transmission/disequilibrium test (TDT) for multi-allele\n\nmarker loci,\u201d Annals of human genetics, vol. 59, no. 3, pp. 323\u2013336, 1995.\n\n[5] G. Simons and Y. Yao, \u201cAsymptotics when the number of parameters tends to in\ufb01nity in the\nBradley-Terry model for paired comparisons,\u201d The Annals of Statistics, vol. 27, no. 3, pp.\n1041\u20131060, 1999.\n\n[6] J. C. Duchi, L. Mackey, and M. I. Jordan, \u201cOn the consistency of ranking algorithms,\u201d in\n\nProceedings of the ICML Conference, Haifa, Israel, June 2010.\n\n[7] D. R. Hunter, \u201cMM algorithms for generalized Bradley-Terry models,\u201d The Annals of Statis-\n\ntics, vol. 32, no. 1, pp. 384\u2013406, 02 2004.\n\n[8] H. A. Sou\ufb01ani, W. Chen, D. C. Parkes, and L. Xia, \u201cGeneralized method-of-moments for rank\naggregation,\u201d in Advances in Neural Information Processing Systems 26, 2013, pp. 2706\u20132714.\n[9] H. Azari Sou\ufb01ani, D. Parkes, and L. Xia, \u201cComputing parametric ranking models via rank-\n\nbreaking,\u201d in Proceedings of the International Conference on Machine Learning, 2014.\n\n[10] S. Negahban, S. Oh, and D. Shah, \u201cRank centrality: Ranking from pair-wise comparisons,\u201d\n\narXiv:1209.1688, 2012.\n\n[11] T. Qin, X. Geng, and T. yan Liu, \u201cA new probabilistic model for rank aggregation,\u201d in Advances\n\nin Neural Information Processing Systems 23, 2010, pp. 1948\u20131956.\n\n[12] J. A. Lozano and E. Irurozki, \u201cProbabilistic modeling on rankings,\u201d Available at http://www.\n\nsc.ehu.es/ccwbayes/members/ekhine/tutorial ranking/info.html, 2012.\n\n[13] S. Jagabathula and D. Shah, \u201cInferring rankings under constrained sensing.\u201d in NIPS, vol. 2008,\n\n2008.\n\n[14] M. Braverman and E. Mossel, \u201cSorting from noisy information,\u201d arXiv:0910.1191, 2009.\n[15] A. Rajkumar and S. Agarwal, \u201cA statistical convergence perspective of algorithms for rank\naggregation from pairwise data,\u201d in Proceedings of the International Conference on Machine\nLearning, 2014.\n\n[16] J. Guiver and E. Snelson, \u201cBayesian inference for Plackett-Luce ranking models,\u201d in Proceed-\nings of the 26th Annual International Conference on Machine Learning, New York, NY, USA,\n2009, pp. 377\u2013384.\n\n[17] A. S. Hossein, D. C. Parkes, and L. Xia, \u201cRandom utility theory for social choice,\u201d in Pro-\n\nceeedings of the 25th Annual Conference on Neural Information Processing Systems, 2012.\n\n[18] H. A. Sou\ufb01ani, D. C. Parkes, and L. Xia, \u201cPreference elicitation for general random utility\n\nmodels,\u201d arXiv preprint arXiv:1309.6864, 2013.\n\n[19] R. D. Gill and B. Y. Levit, \u201cApplications of the van Trees inequality: a Bayesian Cram\u00b4er-Rao\n\nbound,\u201d Bernoulli, vol. 1, no. 1-2, pp. 59\u201379, 03 1995.\n\n9\n\n\f", "award": [], "sourceid": 804, "authors": [{"given_name": "Bruce", "family_name": "Hajek", "institution": null}, {"given_name": "Sewoong", "family_name": "Oh", "institution": "UIUC"}, {"given_name": "Jiaming", "family_name": "Xu", "institution": "University of Illinois at Urbana Champaign"}]}