{"title": "Mallows Models for Top-k Lists", "book": "Advances in Neural Information Processing Systems", "page_first": 4382, "page_last": 4392, "abstract": "The classic Mallows model is a widely-used tool to realize distributions on per- mutations. Motivated by common practical situations, in this paper, we generalize Mallows to model distributions on top-k lists by using a suitable distance measure between top-k lists. Unlike many earlier works, our model is both analytically tractable and computationally efficient. We demonstrate this by studying two basic problems in this model, namely, sampling and reconstruction, from both algorithmic and experimental points of view.", "full_text": "Mallows Models for Top-k Lists\n\nFlavio Chierichetti\n\nSapienza University, Rome, Italy\n\nflavio@di.uniroma1.it\n\nAnirban Dasgupta\n\nIIT, Gandhinagar, India\n\nanirban.dasgupta@gmail.com\n\nShahrzad Haddadan\n\nSapienza University, Rome, Italy\n\nshahrzad.haddadan@uniroma1.it\n\nRavi Kumar\n\nGoogle, Mountain View, CA\n\nravi.k53@gmail.com\n\nSilvio Lattanzi\n\nGoogle, Zurich, Switzerland\n\nsilviolat@gmail.com\n\nAbstract\n\nThe classic Mallows model is a widely-used tool to realize distributions on per-\nmutations. Motivated by common practical situations, in this paper, we generalize\nMallows to model distributions on top-k lists by using a suitable distance measure\nbetween top-k lists. Unlike many earlier works, our model is both analytically\ntractable and computationally ef\ufb01cient. We demonstrate this by studying two\nbasic problems in this model, namely, sampling and reconstruction, from both\nalgorithmic and experimental points of view.\n\n1\n\nIntroduction\n\nOrdering objects according to a set of criteria and presenting a pre\ufb01x of the ordering to a user\nhas become an accepted form of processed knowledge and is ubiquitous in practical settings. A\nsearch engine\u2019s result page typically contains the top ten results for a query. Newspapers and media\noutlets often list the top few movies in a year and the top few restaurants in a neighborhood. Social\nnetworking sites list the top followers or friends of a user. The popularity of top lists has resulted in\nthe proliferation of dedicated portals such as listverse.com and www.thetoptens.com/lists.\nPresenting a top list instead of a total ordering (permutation) also has many advantages in practice.\nFirst, the universe of objects might be too large to order and present, e.g., even for a niche subject\nlike number theory, Google returns 9.7M Web pages. Secondly, the cognitive load on the user can\nbecome immense if the entire ordering is shown to her, especially when the most interesting piece\nof information is in a short pre\ufb01x\u2014in most cases, the user is indifferent to the 100th most popular\nrestaurant in a city. Thirdly, it may be impossible or meaningless to total order the objects beyond a\ncertain pre\ufb01x length. Sociologists would be hard-pressed to pinpoint the 10000th most livable city.\nPermutations and generative models for permutations have been studied for many decades, backed\nby a rich and elegant mathematical theory. In particular, the well-studied Mallows model offers\nthe conceptual equivalent of the Gaussian distribution in the space of permutations: given a center\npermutation and a spread parameter, this generative model induces a distribution on all permutations,\nwhere the probability of a given permutation is a function of the (Kendall) distance to the center,\nscaled by the spread parameter. The problems of generating a random sample from a Mallows\ndistribution [10], learning the center given a set of samples from the model [4, 5], and learning in\na Mallows mixture setting [1, 6], have all been extensively studied in the literature. The ease of\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fanalyzing the repeated insertion model (RIM) [10], the canonical way of generating a sample from\nMallows distribution, forms a bedrock of such algorithms and their analysis.\nThe story, however, is far less mature for the study of top-k lists. While top-k lists have long been\nconsidered in the data mining, information retrieval, and machine learning communities, most of\nthe work (with the exception of a handful of papers), has been mostly experimental [13, 18, 26, 24,\n8, 12, 23]. In particular, there is no crisp extension of the Mallows model to the top list case with\nprovable guarantees on the complexity of either sampling or for the reconstruction of the parameters.\nAny such model has to satisfy the following reasonable desiderata: the model should be conceptually\nsimple, be a true generalization of Mallows, and the algorithms based on the model should have\nrunning times (and sample complexity for learning) that are polynomial only in the size of the top\nlist rather than the entire universe. A reasonable and well-studied alternative, pioneered by [13] and\ndeveloped since then [18, 23] (see related work for a longer discussion), is to posit the existence\nof a Mallows distribution over permutations of all the elements and yet regard only a pre\ufb01x as the\nobservable list. This, however, runs into the following issues: (i) generating a single sample from\nsuch distributions takes time polynomial in universe size, essentially because of using the RIM, (ii)\nsemantically, even positing the existence of a full ranking and trying to learn it is potentially working\nwith overspeci\ufb01ed models, and hence, (iii) given a sample of top-k lists that is of size polynomial in\nk, most of the pairwise relations that are learned are potentially statistically spurious (i.e., some pairs\nof items might be rarely observed). To alleviate these issues and to satisfy our desiderata, we \ufb01rst\nneed a category of models that can replace Mallows models in the study of pre\ufb01x lists.\nOur contributions. A natural pathway towards such a model would be to consider extensions of the\nKendall distance to top-k lists. This is the route we take in this paper. We consider parametrized\ngeneralizations of Kendall distance to top-k lists [11, 7, 16], which have nice mathematical properties,\nand use them to de\ufb01ne a Mallows-like model for top-k lists over universe of size n: given a top-k\ncenter list and a spread parameter, our model induces a distribution on all top-k lists.\nWhile our top-k Mallows model is easy to state, it poses several computational challenges, especially\nif we desire ef\ufb01ciency. We \ufb01rst consider the problem of generating a sample from this model. This\nproblem, solved by RIM in all previous works, turns out to be non-trivial. We show two ef\ufb01cient\nalgorithms: an exact sampling algorithm that runs in time O(k24k + k2 log n) and a Markov chain-\nbased approximate sampling algorithm that runs in time O(k5 log k). We then consider two learning\nproblems, namely, reconstructing the center for a given top-k Mallows model and learning a mixture\nof top-k Mallows models. For both these problems, we propose simple algorithms that have ef\ufb01cient\nsample complexities and provable guarantees. We also conduct experiments on both synthetic and\nreal-world datasets to demonstrate the ef\ufb01ciency and usability of our algorithms in practice.\nOur work (see Section 2 for formal de\ufb01nitions) differs from all previous works [13, 18, 23] on\nmodeling top-k lists in four aspects. Firstly, ours only posits the existence of a top-k list and is based\non a true generalization of Kendall distance to top-k lists. Secondly, by using the p-parametrized\nKendall distance, we control the probability of elements not in the center appearing in the generated\ntop-k list by changing the value of p, which is more general than speci\ufb01c distances [23].1 Thirdly, in\nour model the probability of a pair of elements outside the center appearing in a generated top-k list\nis independent of their relative order, whereas in previous works these probabilities are affected if\nthis pair is inverted. As we discussed earlier, interpreting the ordering among elements outside the\ncenter is not meaningful in many applications; in this sense, our model is more natural and useful\nsince it is able to control and minimize the in\ufb02uence of elements outside the center. Finally, our work\nfocuses on the algorithmic questions with provable performance guarantees for both sampling and\nreconstruction, while previous works focused more on the statistical aspects of the model.\nA criticism of our model could be that it treats all the tail elements equally. While this may look\nsimplistic, assuming there is an implicit full ranking is unrealistic and more so for large domains. In\nthis regard, our model offers a spectrum of \ufb02exibility. Firstly, note that k is parameter and hence one\ncan decide how far to dig into the tail for large domains. Secondly, in terms of \ufb01tting to observed data,\nour model has size O(k log n), rather than the \u2126(n log n) needed for the full permutation assumption,\nand hence, as mentioned before, is less prone to learning spurious orderings among most pairs of\nelements. Thirdly, our model admits natural generalizations in two ways. (i) One can de\ufb01ne a general\n\n1In fact, different p values capture different scenarios. For example, if the top-k list is the most livable cities,\na bigger p might be desirable, whereas if the top-k list presents movies of a certain genre, smaller p might be\nappropriate.\n\n2\n\n\ftop-k model where the bottom n \u2212 k elements are grouped into a small number of classes, each\ncontaining elements that are equally likely to appear in the top k. (ii) One can de\ufb01ne a model where\nthe center is a full permutation and a suitable Kendall-style distance function between a permutation\nand a top-k list is chosen (e.g., an appropriate generalization of K (p) [11] or a Hausdorff variant [7]).\nA number of our results can be extended to both these generalizations (e.g., the exact/approximate\nsampling methods for (i)), or at least helps identify the right algorithmic tools (e.g., random walks\nand dynamic programming) to focus on for these extensions, while an RIM-based analysis cannot.\nRelated work. The papers closest to this work are [13, 18, 23]. Fligner and Verducci [13] de\ufb01ne a\ndistribution on partial rankings by taking a marginal over all possible extensions of the top-k ranking.\nThis idea is extended by Lebanon and Mao [18] to partial rankings, and also by giving a kernel\ndensity-based estimator for the underlying hidden permutation. While such an approach has the\nnice property of a closed-form expression for the sampling probability for top-k lists (and some\ngeneralizations of that), it is not based on any distance function between top-k lists, instead positing\nthe existence of an underlying full permutation. Meila and Bao [23] propose a top-k list generative\nmodel derived from [13] by considering the central permutation to be in\ufb01nitely long. Their model is\nbased on a speci\ufb01c set distance between a top-k list and an permutation and assumes the existence of\na full (in\ufb01nitely long) permutation.\nA RIM-like generative process and a reconstruction algorithm for a generalization of Mallows model\nusing the \u2018average precision\u2019 distance which places more importance on the displacement of the top\nelements was proposed in [8] . Although the focus of their work is on full permutations, [20] outline\na strategy to sample from any distribution over rankings as long as it is easy to sample from the\ncorresponding conditional insertion probabilities; they show that this is the case for restricted classes\nof preference rankings that are generalizations of Mallows. Recently, [12] proposed a rank-dependent\ncoarsening model for generating partial rankings and showed consistency of certain estimations. A\nnumber of other heuristics for estimating a consensus ranking by aggregation of partial rankings can\nbe found in [7, 21].\nComparing top-k lists has found several recent applications in comparing different measures of\nuser importance in social networks [17, 27], diversifying recommendations [28], and various other\ninformation aggregation tasks [15, 14, 9].\n\n2 Preliminaries\nLet [n] = {1, . . . , n} be a universe of n elements. A top-k list over [n] is a partial order of the\nform i1 > \u00b7\u00b7\u00b7 > ik > {ik+1, . . . , in} where i(cid:96)\u2019s are elements of [n]. In other words, a top-k list\nhas k elements that are strictly ordered and ranked above the remaining n \u2212 k elements that are\nincomparable to each other. Let Tk,n be the set of all top-k lists over [n]; clearly Tn,n = Sn, the\nsymmetric group on n elements.\nThroughout the paper, let k \u2264 n; we will think of k as O(log n) or as a constant. We will use \u03c4 and\nsubscripted versions to denote a generic top-k list. Given \u03c4 \u2208 Tk,n and an element i \u2208 [n], we use\ni \u2208 \u03c4 to denote that i is one of the top k elements in \u03c4. For a position i \u2208 [k], we let \u03c4 (i) to denote\nthe element at position i in \u03c4. For i, j \u2208 [n], we use i >\u03c4 j to denote that i is ranked above j in \u03c4, i.e.,\ni \u2208 \u03c4 and either j (cid:54)\u2208 \u03c4 or j \u2208 \u03c4 but ranked below i. We use i (cid:107)\u03c4 j to denote that i /\u2208 \u03c4 and j /\u2208 \u03c4, i.e.,\ni and j are incomparable, and use i\u22a5\u03c4 j to denote that i and j are comparable, i.e, either i <\u03c4 j or\ni >\u03c4 j. Let \u00af\u03c4 = [n] \\ \u03c4 = {i \u2208 [n] and i /\u2208 \u03c4}, the elements ranked below \u03c4. For a subset S \u2286 [n],\nlet S \u2229 \u03c4 = {i \u2208 \u03c4 and i \u2208 S}.\nWe will consider the following distance measure [7, 11] that is a generalization of the well-known\nKendall distance between permutations and while not a metric, has nice mathematical properties [11].\nLet p \u2265 0 be a parameter. The p-parametrized Kendall distance between \u03c41, \u03c42 \u2208 Tk,n is given by\n\nWe next de\ufb01ne the Mallows model for top-k lists. Given a center top-k list \u03c4\u2217 and a decay parameter\n\u03b2 > 0, the top-k Mallows model induces a distribution on Tk,n such that\n\nK (p)(\u03c41, \u03c42) =\n\n(cid:40) 1\n\nwhere,\n\n\u00afK (p)\n\ni,j (\u03c41, \u03c42) =\n\ni,j\u2208\u03c41\u222a\u03c42 and i\u03c42 j) or vice versa\np (i \u22a5\u03c41 j and i (cid:107)\u03c42 j) or vice versa\n0\n\notherwise.\n\n(cid:88)\n\n3\n\n\f(cid:16)\u2212\u03b2 \u00b7 K (p)(\u03c4\u2217, \u03c4 )\n(cid:17)\n(cid:17)\n(cid:16)\u2212\u03b2 \u00b7 K (p)(\u03c4 )\n\n,\n\n.\n\nPr[\u03c4 \u2208 Tk,n] \u221d exp\n\n(cid:88)\n\nZ (p)\n\n\u03b2,k,n =\n\nexp\n\n\u03c4\u2208Tk,n\n\nBy relabeling, we can assume without loss of generality that \u03c4\u2217 is the \u201cidentity\u201d, i.e, \u03c4\u2217 = Ik\n\u2206= 1 >\n\u00b7\u00b7\u00b7 > k > {k + 1, . . . , n}. We denote K (p)(Ik, \u03c4 ) by K (p)(\u03c4 ). We refer to the above distribution as\nM(p)\n\u03b2,\u03c4\u2217,k,n, abbreviating as M\u03b2,\u03c4\u2217 if p, k, n are clear from the context and as M\u03b2 if \u03c4\u2217 = Ik. Let\n\nbe the normalizing constant. All the missing proofs are in the supplementary material.\n\n3 Ef\ufb01cient sampling\n\nWe \ufb01rst study the problem of ef\ufb01ciently sampling from the top-k Mallows model. Our goal is to\nobtain algorithms that are time- and space-ef\ufb01cient in terms of both n and k, ideally, O(n \u00b7 poly(k)).\nRecall that it is possible to ef\ufb01ciently sample the Mallows model on full permutations (k = n) using\nRIM [10]. A naive way of constructing a top-k list would be to sample a full permutation according\nto RIM and discard the bottom n \u2212 k elements; however, this process is incorrect, for instance, it\ndoes not account for the parameter p. A more direct approach would be to modify RIM to work\nin the top-k case. To recap, in RIM, new elements are inserted one-by-one to construct a sample\npermutation, where at the ith step the element i is inserted into the partial permutation \u03c31 . . . \u03c3i\u22121\nat position j \u2264 i with probability pij; it turns out that these insertion probabilities can be explicitly\ncomputed for a given parameter \u03b2 making the method computationally ef\ufb01cient. Unfortunately it\nis unclear how to adapt RIM to the top-k setting for two reasons: (i) the insertion probabilities do\nnot seem explicitly computable and (ii) the distance between two top-k lists can decrease after the\ninsertion of an element (note this is not true for full permutations).\nIn the following we present two algorithms for sampling. The \ufb01rst is an exact sampling algorithm\nbased on dynamic programming with time and space exponential in k. The second is an approximate\nsampling algorithm based on Markov chains with a running time that is poly(k).\n\n3.1 Exact sampling with a dynamic program\n\nFirst note that the sampling problem is easily solvable in time O(nk + k log k). Indeed, we can\nenumerate all the top-k lists (there are at most nk of them) and for each list \u03c4, in time O(k log k),\ncompute K (p)(\u03c4 ). This completely describes the distribution on Tk,n and it is easy to sample exactly\nfrom this distribution in time O(nk + k log n).\nWe now present a more ef\ufb01cient exact sampling algorithm with running time exponential in k but\nlogarithmic in n. The main intuition behind the algorithm is that two top-k lists that differ only\nby elements in {k + 1, . . . , n} have the same probability of being sampled. In particular we can\ndecompose the Kendall distance between a top-k list \u03c4 and the identity Ik by considering separately\nthe number of inversions in the \ufb01rst k elements, the number of inversions between the \ufb01rst k and the\nlast n \u2212 k elements, and the number of elements in k that are incomparable. All the top-k lists for\nwhich these three quantities are the same are equiprobable.\nWe now formally present this approach. Let (cid:96) \u2208 {1, . . . , k} and let P, Q \u2286 [k] such that |P| = (cid:96) =\n\u03c4 \u2208 Tk,n(P, Q, m) iff (i) [k] \u2229 \u03c4 = P ; (ii) the elements of P occur only in positions given by Q; and\n(iii) the contribution to K (p) by the elements in P is m, i.e.,\n\n(cid:1)}. De\ufb01ne Tk,n(P, Q, m) \u2286 Tk,n to be the following set. A top-k list\n\n|Q|. Let m \u2208 {0, 1, . . . ,(cid:0)k\n\n2\n\n(cid:88)\n\ni,j\u2208[k]\u2229\u03c4 and i\u03c4\u2217 xk >\u03c4\u2217 xj or\nxj >\u03c4\u2217 xk >\u03c4\u2217 xi.\nDe\ufb01nition 1 (Chain C). Let \u03c4 \u2208 Tk,n. Choose 1 \u2264 i \u2264 k \u2212 1 u.a.r. and equiprobably do one of:\n(i) Transposition step: Equiprobably do one of:\n\nw.p. e\u03b2/(1 + e\u03b2) and the opposite order w.p. 1/(1 + e\u03b2).\n\nT0: If \u03c4 (i) \u2208 \u03c4\u2217, then \ufb01nd minimum j > i such that \u03c4 (j) \u2208 \u03c4\u2217 and put them in the order of \u03c4\u2217\nT1: If \u03c4 (i) /\u2208 \u03c4\u2217, then \ufb01nd minimum j > i such that \u03c4 (j) /\u2208 \u03c4\u2217 and put them in the order of \u03c4\u2217\nT2: If (\u03c4 (i) \u2208 \u03c4\u2217 and \u03c4 (i + 1) /\u2208 \u03c4\u2217) or (\u03c4 (i) /\u2208 \u03c4\u2217 and \u03c4 (i + 1) \u2208 \u03c4\u2217), then put them in the\n\nw.p. 1/2 and the opposite order w.p. 1/2.\norder of \u03c4\u2217 w.p. e\u03b2/(1 + e\u03b2) and the opposite order w.p. 1/(1 + e\u03b2).\n\n(ii) Substitution step: W.p. 1/2 stay at the current state, and w.p. 1/2 equiprobably do one of:\n\nS0: A homogeneous substitution, i.e.,\n\nS00: If \u03c4 (i) \u2208 \u03c4\u2217, then let xi be the \u03c4\u2217-adjacent element such that xi >\u03c4\u2217 \u03c4 (i) and if xi /\u2208 \u03c4,\nthen replace \u03c4 (i) by xi w.p. e\u03b2/(1 + e\u03b2). If \u03c4 (i) \u2208 \u03c4\u2217, then let xj be the \u03c4\u2217-adjacent\nelement such that xj <\u03c4\u2217 \u03c4 (i) and if xj /\u2208 \u03c4, then replace \u03c4 (i) by xj w.p. 1/(1 + e\u03b2).\nS01: If \u03c4 (i) /\u2208 \u03c4\u2217, pick c u.a.r. from \u00af\u03c4 and if c /\u2208 \u03c4\u2217, replace \u03c4 (i) by c w.p. 1/2.\nfrom \u00af\u03c4 and compare it with \u03c4 (k)\nand if one of them is in \u03c4\u2217 and the other one is not, keep the \u03c4\u2217 element inside w.p.\ne\u03b2(1+p\u00b7i)/(1 + e\u03b2(1+p\u00b7i)) and the element outside \u03c4\u2217 w.p. 1/(1 + e\u03b2(1+p\u00b7i)); here, i =\n|\u03c4 [1, k \u2212 1] \u2229 \u00af\u03c4\u2217|.\n\nS1: A non-homogeneous substitution, i.e., choose c u.a.r.\n\nIf the premise is not satis\ufb01ed in any of the above, then do nothing.\nWe \ufb01rst show that the stationary distribution \u03a0 of C is the desired top-k Mallows distribution.\nLemma 3. \u03a0 = M\u03b2,\u03c4\u2217.\nWe next bound the relaxation time of C. To do this, we employ a useful technique from Markov\nchains known as decomposition. In this technique, we partition the Markov chain C into smaller\nMarkov chains C(1), . . . ,C(k) and connect the C(i)\u2019s by another Markov chain \u00afC. It follows that\ntrel(C) \u2264 trel( \u00afC) \u00b7 maxi trel(C(i)); see [22]. Here, we partition the state space into k + 1 parts.\nIndeed, for 0 \u2264 i \u2264 k, let T (i)\nk,n = {\u03c4 \u2208 Tk,n | |\u03c4 \u2229 \u03c4\u2217| = i}. We de\ufb01ne the restriction of C to each\nk,n as follows: the chain C(i) performs exactly like C in De\ufb01nition 1 except it never takes\npartition T (i)\nthe S1 option. Let \u00afC denote the chain that connects these partitions; we will present its transition\nprobabilities and bound trel( \u00afC) later.\nTo present the analysis, we also need two additional concepts. The \ufb01rst concept concerns biased and\nunbiased adjacent transposition chains. Given an n \u00d7 n non-negative matrix P (\u00b7,\u00b7) with entries in\n[0, 1], an n-adjacent transposition Markov chain is de\ufb01ned on Sn as follows: at state \u03c4 \u2208 Sn, pick\ni u.a.r. and place \u03c4 (i) and \u03c4 (i + 1) in the natural order w.p. P (\u03c4 (i), \u03c4 (i + 1)) and in the opposite\norder w.p. 1 \u2212 P (\u03c4 (i), \u03c4 (i + 1)). We call the chain unbiased if the entries of P are 1/2 and biased\notherwise. If there is a constant p > 1/2 such that \u2200j > i, P (i, j) = p, then the relaxation time of a\n\n5\n\n\fbiased adjacent transposition chain is n2 [2]. For the unbiased chain the relaxation time is bounded\nby n3 log n [25]. (Notice that the mixing time of the biased chain is independent of p. This is quite\ncommon when analyzing the mixing time of random walks, e.g., a biased random walk on a line.) 2\nThe second concept concerns exclusion processes. Given integers k, n such that k < n and a\np \u2208 [0, 1], an (n, k, p)-exclusion process is a Markov chain de\ufb01ned on the subset of n-bit strings\nof Hamming weight k that makes the following transpositions: at a state x choose i \u2208 [n] u.a.r.; if\nx(i) = x(i + 1), then the chain stays at state x, otherwise the new state has (x(i), x(i + 1) = (1, 0))\nw.p. p and (0, 1) w.p. 1 \u2212 p. The relaxation time of this exclusion process is n2 [2].\nLemma 4. For each i \u2208 [k], trel(C(i)) = O(k3 log k).\nWe now proceed to de\ufb01ne \u00afC. This will be a chain de\ufb01ned on the partitions, i.e., its sample space will\n[\u03c4 (k) /\u2208\nbe {T (i)\n\u03c4\u2217]. Note that qi is decreasing with respect to i: q1 = 1 \u2212 1/Z and qk\u22121 = e(k\u22122)\u03b2/Z, where\nZ = e(k\u22121)\u03b2\u22121\n\nk,n} and the transition will be de\ufb01ned by the S1 step in De\ufb01nition 1. Let qi = Pr\u03c4\u2208T (i)\n\ne\u03b2\u22121 \u2264 2e(k\u22122)\u03b2.\n\nk,n\n\nk,n, the chain moves to T (i+1)\n\nk,n w.p. ri = qie\u03b2(1+p(k\u2212i))/(1 + e\u03b2(1+p(k\u2212i))), moves to\nFrom state T (i)\nT (i\u22121)\nk,n w.p. (cid:96)i = (1 \u2212 qi)/(1 + e\u03b2(1+p(k\u2212i))), and does nothing w.p. 1 \u2212 ri \u2212 (cid:96)i. Note that since\nqi \u2265 1/2, we always have ri > 1/2. Furthermore we always have ri/(cid:96)i+1 > 1.\nLemma 5. trel( \u00afC) \u2264 k2/4.\nFinally, we are ready to bound the relaxation time of C. Using Lemma 4, Lemma 5, and employing\nthe decomposition technique, we obtain the following:\nTheorem 6. trel(C) = O(k5 log k).\nWe remark here that using the above result, we can also bound the total variation mixing time, which\nis another measure for mixing. The total variation mixing time, denoted by ttv, is de\ufb01ned as the\nminimum time after which the L1 distance of the current state of C and \u03a0 falls below a constant,\nsay, 1/4. From Theorem 6, and the fact that trel \u2264 ttv \u2264 trel log(minx 1/\u03a0(x)) we conclude that\nttv \u2264 \u03b2k7 log k. In the supplementary material we show ttv = \u2126(k3). Our experimental results\nsuggest that the L1 distance falls bellow a constant in almost k3 number of steps.\n\n4 Reconstruction\n\nIn this section we focus on two basic learning questions in the spirit of [1, 5, 6, 8]. How many samples\nfrom the top-k Mallows model do we need to observe to reconstruct the center top-k list? And, given\nsamples from a mixture of several top-k lists, how to learn the individual components of the mixture.\n\n4.1 Learning the center\n\nIn this section we give a simple algorithm for provably reconstructing the central top-k list, given\nenough samples from M\u03b2. The main idea is to track the ordering among pairs of elements generated\nby M\u03b2 and use this information to reconstruct the center. For the remaining of this section, whenever\n\u03c4 \u2208 Tk,n is a random variable, we assume it is generated according M\u03b2,\u03c4\u2217,k,n. For simplicity, we\nassume that the center \u03c4\u2217 = Ik, the identity top-k list.\nFirst we bound the probability of an inversion a top-k Mallows model.\nLemma 7. For any two elements i < j, i \u2208 [k], it holds that Pr\u03c4 [i <\u03c4 j] \u2265 exp(\u03b2) \u00b7 Pr\u03c4 [j <\u03c4 i].\nWe next bound the probability that the top-k Mallows model contains a given element.3\n\n2In fact, ours is a random walk on a partial order: \u03c3 \u2264 \u03c4 iff when going from \u03c3 to \u03c4 the number of transposi-\ntions from placing a bigger number ahead of a smaller number is greater than the number of transpositions from\nplacing a smaller number ahead of a bigger number. Similar to what happens for a walk on a total order where a\nparticle moves to left w.p. q and to right w.p. 1 \u2212 q, the mixing time is maximized when q = 1/2.\n\n3For simplicity we provide only a bound that involves only involving only \u03b2, n and k, although we note a\n\ntighter bound could be obtained involving by p as well.\n\n6\n\n\fLemma 8. For any i \u2208 [k], it holds that Pr\u03c4 [i \u2208 \u03c4 ] = \u2126(exp(\u03b2)/(n \u2212 k)).\nUsing these two bounds, we \ufb01nally obtain the center reconstruction algorithm.\nTheorem 9. There exists a polynomial time algorithm that uses \u0398\nfrom M\u03b2 and can identify the central top-k list with probability 1 \u2212 o(n\u22123).\n\n(cid:16) e\u2212\u03b2 +1\ne\u03b2\u22121 (n \u2212 k) log n\n\n(cid:17)\n\nsamples\n\n4.2 Learning mixtures\n\ni\n\n1 , . . . , \u03c4\u2217\n\nIn this section we consider the problem of learning a uniform mixture of several top-k Mallows\nmodels. Let \u03c4\u2217\nt be the centers and let \u03b2 be the common decay parameter. For simplicity of\nexposition we assume in this section that \u03b2 \u2264 1, i.e., we assume to be in the case where samples can\nend up being far from the center. Note that this assumption is not crucial and similar bounds can be\nderived for \u03b2 > 1. In the uniform mixture model, \ufb01rst an i \u2208 [t] is chosen u.a.r. and a top-k list is\ngenerated according to M\u03b2,\u03c4\u2217\n. The goal of the reconstruction problem is, given enough samples\ngenerated according to the mixture, learn their centers.\nThe main idea behind the algorithm is to \ufb01rst cluster the given samples into t clusters and then apply\nthe center reconstruction to each cluster. To be able to do the \ufb01rst step without any error, we need a\nsimplifying assumption that every pair of centers is suf\ufb01ciently far apart.\nFirst, we observe that samples from a top-k Mallows model will mostly end up close to its center.\nFor simplicity of exposition, we introduce a slight abuse of notation: if \u03c4 is a top-k list, we will\nsometimes use \u03c4 as if it were the set of its top-k elements. Thus \u03c4 \u2229 \u03c4(cid:48) will denote the set of elements\nthat are within the \ufb01rst k of both \u03c4 and \u03c4(cid:48), and \u03c4 \\ \u03c4(cid:48) will denote the elements that are both within the\n\ufb01rst k of \u03c4 but not within the \ufb01rst k of \u03c4(cid:48).\nLemma 10. If \u03b2 \u2264 1, then we have Pr\u03c4\u223cM\u03b2,\u03c4\u2217\n\n(cid:104)|\u03c4 \u2229 \u03c4\u2217| \u2264 k \u2212(cid:112)\u03b2\u221213k ln n\n\n< n\u22122k.\n\n(cid:105)\n\nGiven this, provided that the centers in the mixture are suf\ufb01ciently far from each other 4, we can cluster\nsampled top-k lists, using single-linkage clustering in such a way that two samples are clustered\ntogether if and only they were produced by the same center.\n\nLemma 11. Suppose that \u03b2 \u2264 1 and, for each {i, j} \u2208(cid:0)[t]\n\nj | > 4(cid:112)\u03b2\u221213k ln n\n\nand suppose each M\u03b2,\u03c4\u2217\ncontributes at most o(n2k) samples. Then, we can cluster sampled top-k\nlists in polynomial time so that, w.p. 1 \u2212 o(1), two samples end up in the same cluster if and only if\nthey were generated by the same center.\n\n(cid:1), it holds |\u03c4\u2217\n\ni \\ \u03c4\u2217\n\n2\n\ni\n\nUsing the clustering, the algorithm for mixtures is easy.\nTheorem 12. Suppose that \u03b2 \u2264 1, that t = o\n|\u03c4\u2217\ni \\ \u03c4\u2217\n\nj | > 4(cid:112)\u03b2\u221213k ln n. Then, there is an algorithm that uses\n\n(cid:18)\n\n(cid:16) \u03b2\u00b7nk\u22121\n(cid:19)\n\nlog n\n\n(cid:17)\nand that, for each {i, j} \u2208(cid:0)[t]\n\u2264 O(cid:0)\u03b2\u22121 \u00b7 nt \u00b7 ln(nt)(cid:1) ,\n\n2\n\n(cid:1), it holds\n\nt \u00b7 e\u2212\u03b2 + 1\ne\u03b2 \u2212 1\n\nO\n\n(n \u2212 k) log(nt)\n\nsamples to identify the t central top-k lists with probability 1 \u2212 o(n\u22123).\n\nFigure 1: Empirical performance of the Markov chain based approximate sampling. The two plots correspond\nto n = 10 (samples=10K) and n = 25 (samples=1M), both having p = 0.5.\n\n4An oft-made assumption in provable learning of mixtures.\n\n7\n\n 0 0.4 0.8 1.2 1.6 2 0 200 400 600 800 1000 1200L1 di\ufb00erenceNumber of random walk stepsn = 10, k = 5, p = 0.5\u03b2=0.50\u03b2=1.00 0 0.4 0.8 1.2 1.6 2 0 50 100 150 200 250 300 350 400L1 di\ufb00erenceNumber of random walk stepsn = 25, k = 5, p = 0.5\u03b2=0.50\u03b2=1.00\fFigure 2: The \ufb01rst three plots detail the results for n = 10 and k = 5, p = 0.5, and \u03b2 \u2208 {0.2, 0.5, 1.0}.\nThe last plot, instead, has n = 100 and k = 10, and p = \u03b2 = 0.5. In each of the plots, the y-axis has range\n[0, 1] and represents the probability that the algorithms return the correct center; the ranges of the x-axis, which\nrepresents the number of samples, vary across the plots.\n5 Experiments\n\nWe describe two sets of experiments. We \ufb01rst show that our Markov chain based algorithm can\nef\ufb01ciently sample top-k lists. We then use this sampler to create synthetic datasets from known\ncenters and show the applicability of our reconstruction results. While our reconstruction algorithm\nfrom Theorem 9 assumes that there are enough samples, we generalize it based on the Condorcet\ncriterion and a tie-breaking rule. We show that our proposed algorithm outperforms reasonable\nbaselines on synthetic datasets. We apply this proposed algorithm on a real dataset of top-k movie\nlists to create an aggregate list and present anecdotal evidence.\nApproximate sampling. First we test the convergence of the Markov chain sampling algorithm\n(Section 3.2). We start a number of walks from a given center and conduct walks for a \ufb01xed number\nof steps. We calculate the empirical distribution of the endpoints and then compute the L1 difference\nof this empirical distribution from the true distribution (calculated via the dynamic program in\nSection 3.1). This L1 difference is then plotted against the number of steps used by the random\nwalk. In Figure 1 we plot the results for n = 10, k = 5 and n = 25, k = 5. Each plot contains two\nlines, one for each value of \u03b2 in (0.5, 1.0), p being set to 0.5 all through. For n = 10, the empirical\ndistribution is computed using 10K different walk samples and 1M for n = 25. Note that the state\nspace increases as O(nk) and hence a larger value of n needs substantially more samples in order to\nachieve similar L1 distances. Also, since a smaller value of \u03b2 results in a more uniform distribution\nover the state space, the L1 difference of the random walk sampler for \u03b2 = 0.5 is higher than the one\nfor \u03b2 = 1.0. In each of the cases, we see the L1 difference stabilizing after roughly 200 steps, any\n\ufb02uctuation after that is due to variance. In each case, it stabilizes to a non-zero value, as the number\nof samples in each case only a constant fraction of the support, and this does not let the L1 difference\nbetween the empirical distribution and the true one to go to zero.\nCenter reconstruction. Next we study the problem of center reconstruction for top-k lists.\n(i) We propose to extend our algorithm from Theorem 9 to the case when are not enough samples\nby using the Condorcet criterion [3] as follows. Given the set of input top-k lists, for the generic\npair of elements {i, j}, we say that i (resp., j) beats j (i) in a head-to-head contest if the number of\ntop-k lists where i > j (j > i) is larger than the number of top-k lists where j > i (i > j); if the two\n\n8\n\n 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 5 10 15 20 25 30 35 40 45 50Probability of Returning the CenterNumber of Samplesn = 10, k = 5, \u03b2 = 1.0, p = 0.5CondorcetBordaPositions 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 140 160 180 200Probability of Returning the CenterNumber of Samplesn = 10, k = 5, \u03b2 = 0.5, p = 0.5CondorcetBordaPositions 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 100 200 300 400 500 600 700 800 900 1000Probability of Returning the CenterNumber of Samplesn = 10, k = 5, \u03b2 = 0.2, p = 0.5CondorcetBordaPositions 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 120 140 160 180 200Probability of Returning the CenterNumber of Samplesn = 100, k = 10, \u03b2 = 0.5, p = 0.5CondorcetBordaPositions\fnumbers are same, then we say that the contest is tied. The value of i is de\ufb01ned to be the number\nof j (cid:54)= i such that i beats j in a head-to-head contest. Our proposed algorithm orders the elements\naccording to their value; if the \ufb01rst k + 1 elements have no value-ties, then the algorithm returns the\n\ufb01rst k elements in the ordering. It is easy to show that, if the condition of Theorem 9 is satis\ufb01ed, then\nwith high probability there will be no value-ties and the returned top-k list will be center. We also\nimplemented a tie-breaking rule5 if the number of samples is not large enough.\n(ii) The Borda count algorithm (named after the Borda method [3]), given a top-k list, assigns a score\nof k to the topmost element of the list, a score k \u2212 1 to the second element of the list, down to a\nscore of 1 to the kth element of the list, and 0 to every other element. The elements are sorted in a\ndecreasing order according to the sum of its scores across the input top-k lists. If there are no ties in\nthe \ufb01rst k + 1 elements, the algorithm returns the \ufb01rst k elements and declares failure if otherwise.\n(iii) As another baseline, we consider the following positional algorithm. The algorithm assigns to\neach element its most frequent position in the input top-k lists. The ideal position for the generic\nelement i is its most frequent position in the range {1, . . . , k}; if there are ties, the algorithm fails. If,\nfor each (cid:96) \u2208 [k], there exists exactly one element whose ideal position is (cid:96), then the returned top-k\nlist will be the perfect matching between the positions and the elements.\nWe ran the algorithms on synthetic datasets, as well as on a real world dataset. The synthetic datasets\nwere all generated by the top-k Mallows model, having a single center.\nSynthetic datasets. We \ufb01rst produced a number of synthetic top-k lists by sampling from M(p)\n\u03b2,Ik,k,n,\nfor a number of choices of n, k, \u03b2, p, by running the Markov chain (Section 3.2) for 1000 steps.\nFigure 2 shows the probability of returning the center (averaged over various runs) of the algorithms\non these inputs. The Condorcet algorithm clearly outperforms the other two.\nCriterion dataset. This dataset contains the top-10 movies lists of a number of directors, actors and\nartists, as collected by the Criterion web site. We acquired 176 movie lists from www.criterion.\ncom/explore/top10 and truncated each list to the \ufb01rst 10 movies. The central top-10 list returned\nby the Condorcet algorithm ran on the aforementioned 176 top-10 lists is, starting from its top element:\nGilliam\u2019s \u201cBrazil\u201d, Fellini\u2019s \u201c8 \u00bd\u201d, Reed\u2019s \u201cThe Third Man\u201d, Maysles, Maysles & Zwerin\u2019s \u201cGimme\nShelter\u201d, Kurosawa\u2019s \u201cShichinin no Samurai\u201d, Laughton\u2019s \u201cThe Night of the Hunter\u201d, Bresson\u2019s\n\u201cAu hasard Balthazar\u201d, Fassbinder\u2019s \u201cAngst essen Seele auf\u201d, Pontecorvo\u2019s \u201cLa battaglia di Algeri\u201d,\nCarn\u00e9\u2019s \u201cLes Enfants du Paradis\u201d. Most of the movies in this list are considered to be masterpieces\nof the 7th Art.\n\n6 Conclusions\n\nIn this work we proposed a top-k Mallows model that generalizes the classic Mallows model to top-k\nlists. While our model is apparently challenging from an analysis point of view, we show that it is\nstill possible to design ef\ufb01cient and provably good algorithms for sampling and reconstruction. Our\nwork opens several promising research directions, including improving the running times, sample\ncomplexity bounds, and extending the model to other measures on top-k lists [7].\n\nAcknowledgments\n\nFlavio Chierichetti and Shahrzad Haddadan were supported by the ERC Starting Grant DMAP\n680153, by the SIR Grant RBSI14Q743, by a Google Focused Award, and by the \u201cDipartimenti di\nEccellenza 2018-2022\u201d grant awarded to the Dipartimento di Informatica at Sapienza.\n\nReferences\n[1] P. Awasthi, A. Blum, O. Sheffet, and A. Vijayaraghavan. Learning mixtures of ranking models.\n\nIn NIPS, pages 2609\u20132617, 2014.\n\n[2] I. Benjamini, N. Berger, C. Hoffman, and M. E. Mixing times of the biased card shuf\ufb02ing and\n\nthe asymmetric exclusion process. T. AMS, 357:3013\u20133029, 2005.\n\n5The tie-breaking rule orders each maximal set of tied elements that covers some of the \ufb01rst k positions, by\ntrying to extend to a full linear order the results of the pairwise head-to-head contests between the elements of\nthe set. The algorithm fails if this step cannot be done uniquely in each of the relevant maximal sets.\n\n9\n\n\f[3] D. Black. The Theory of Committees and Elections. Springer, 1987.\n\n[4] M. Braverman and E. Mossel. Sorting from noisy information. Technical Report 0910.1191,\n\nArXiv, 2009.\n\n[5] F. Chierichetti, A. Dasgupta, R. Kumar, and S. Lattanzi. On reconstructing a hidden permutation.\n\nIn APPROX-RANDOM, pages 604\u2013617, 2014.\n\n[6] F. Chierichetti, A. Dasgupta, R. Kumar, and S. Lattanzi. On learning mixture models for\n\npermutations. In ITCS, pages 85\u201392, 2015.\n\n[7] D. E. Critchlow. Metric Methods for Analyzing Partially Ranked Data, volume 34 of Lecture\n\nNotes in Statistics. Springer, 1985.\n\n[8] L. De Stefani, A. Epasto, E. Upfal, and F. Vandin. Reconstructing hidden permutations using\n\nthe average-precision (AP) correlation statistic. In AAAI, pages 1526\u20131532, 2016.\n\n[9] R. P. DeConde, S. Hawley, S. Falcon, N. Clegg, B. Knudsen, and R. Etzioni. Combining results\nof microarray experiments: A rank aggregation approach. Statistical Applications in Genetics\nand Molecular Biology, 5(1), 2006.\n\n[10] J. Doignon, A. Pekec, and M. Regenwetter. The repeated insertion model for rankings: Missing\n\nlink between two subset choice models. Psychometrika, 69(1):33\u201354, 2004.\n\n[11] R. Fagin, R. Kumar, and D. Sivakumar. Comparing top k lists. SIAM J. Discrete Math.,\n\n17(1):134\u2013160, 2003.\n\n[12] M. A. Fahandar, E. Hullermeier, and I. Couso. Statistical inference for incomplete ranking data:\n\nThe case of rank-dependent coarsening. In ICML, pages 1078\u20131087, 2017.\n\n[13] M. A. Fligner and J. S. Verducci. Distance based ranking models. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 359\u2013369, 1986.\n\n[14] D. F. Hsu and I. Taksa. Comparing rank and score combination methods for data fusion in\n\ninformation retrieval. Information Retrieval, 8(3):449\u2013480, 2005.\n\n[15] I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing techniques in\n\nrelational database systems. ACM Computing Surveys, 40(4):11, 2008.\n\n[16] A. Klementiev, D. Roth, and K. Small. Unsupervised rank aggregation with distance-based\n\nmodels. In ICML 2008, pages 472\u2013479, 2008.\n\n[17] H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In\n\nWWW, pages 591\u2013600, 2010.\n\n[18] G. Lebanon and Y. Mao. Non-parametric modeling of partially rankeed data. JMLR, 9:2401\u2013\n\n2429, 2008.\n\n[19] D. Levin, Y. Peres, and E. L. Wilmer. Markov Chains and Mixing Times. AMS, 2009.\n\n[20] T. Lu and C. Boutilier. Effective sampling and learning for Mallows models with pairwise-\n\npreference data. JMLR, 15(1):3783\u20133829, 2014.\n\n[21] J. I. Marden. Analyzing and Modeling Rank Data. CRC Press, 1996.\n\n[22] R. Martin and D. Randall. Disjoint decomposition of Markov chains and sampling circuits in\n\nCayley graphs. Combinatorics, Probability and Computing, 15:411\u2013448, 2006.\n\n[23] M. Meila and L. Bao. An exponential model for in\ufb01nite rankings. JMLR, pages 3481\u20133518,\n\n2010.\n\n[24] S. Niu, Y. Lan, J. Guo, and X. Cheng. A new probabilistic model for top-k ranking problem. In\n\nCIKM, pages 2519\u20132522, 2012.\n\n[25] D. Wilson. Mixing times of lozenge tiling and card shuf\ufb02ing Markov chains. The Annals of\n\nApplied Probablity, 1:274\u2013325, 2004.\n\n10\n\n\f[26] F. Xia, T.-Y. Liu, and H. Li. Statistical consistency of top-k ranking. In NIPS, pages 2098\u20132106,\n\n2009.\n\n[27] J. Zhang, M. S. Ackerman, and L. Adamic. Expertise networks in online communities: Structure\n\nand algorithms. In WWW, pages 221\u2013230, 2007.\n\n[28] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists\n\nthrough topic diversi\ufb01cation. In WWW, pages 22\u201332, 2005.\n\n11\n\n\f", "award": [], "sourceid": 2155, "authors": [{"given_name": "Flavio", "family_name": "Chierichetti", "institution": "Sapienza University"}, {"given_name": "Anirban", "family_name": "Dasgupta", "institution": "IIT Gandhinagar"}, {"given_name": "Shahrzad", "family_name": "Haddadan", "institution": "Sapienza University, Rome, Italy"}, {"given_name": "Ravi", "family_name": "Kumar", "institution": "Google"}, {"given_name": "Silvio", "family_name": "Lattanzi", "institution": "Google Research"}]}