{"title": "Learning Mixtures of Ranking Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2609, "page_last": 2617, "abstract": "This work concerns learning probabilistic models for ranking data in a heterogeneous population. The specific problem we study is learning the parameters of a {\\em Mallows Mixture Model}. Despite being widely studied, current heuristics for this problem do not have theoretical guarantees and can get stuck in bad local optima. We present the first polynomial time algorithm which provably learns the parameters of a mixture of two Mallows models. A key component of our algorithm is a novel use of tensor decomposition techniques to learn the top-$k$ prefix in both the rankings. Before this work, even the question of {\\em identifiability} in the case of a mixture of two Mallows models was unresolved.", "full_text": "Learning Mixtures of Ranking Models\u2217\n\nPranjal Awasthi\n\nPrinceton University\n\npawashti@cs.princeton.edu\n\nAvrim Blum\n\nCarnegie Mellon University\n\navrim@cs.cmu.edu\n\nOr Sheffet\n\nHarvard University\n\nosheffet@seas.harvard.edu\n\nAravindan Vijayaraghavan\n\nNew York University\n\nvijayara@cims.nyu.edu\n\nAbstract\n\nThis work concerns learning probabilistic models for ranking data in a heteroge-\nneous population. The speci\ufb01c problem we study is learning the parameters of a\nMallows Mixture Model. Despite being widely studied, current heuristics for this\nproblem do not have theoretical guarantees and can get stuck in bad local optima.\nWe present the \ufb01rst polynomial time algorithm which provably learns the param-\neters of a mixture of two Mallows models. A key component of our algorithm is\na novel use of tensor decomposition techniques to learn the top-k pre\ufb01x in both\nthe rankings. Before this work, even the question of identi\ufb01ability in the case of a\nmixture of two Mallows models was unresolved.\n\n1\n\nIntroduction\n\nProbabilistic modeling of ranking data is an extensively studied problem with a rich body of past\nwork [1, 2, 3, 4, 5, 6, 7, 8, 9]. Ranking using such models has applications in a variety of areas\nranging from understanding user preferences in electoral systems and social choice theory, to more\nmodern learning tasks in online web search, crowd-sourcing and recommendation systems. Tradi-\ntionally, models for generating ranking data consider a homogeneous group of users with a central\nranking (permutation) \u03c0\u2217 over a set of n elements or alternatives. (For instance, \u03c0\u2217 might corre-\nspond to a \u201cground-truth ranking\u201d over a set of movies.) Each individual user generates her own\nranking as a noisy version of this one central ranking and independently from other users. The most\npopular ranking model of choice is the Mallows model [1], where in addition to \u03c0\u2217 there is also a\nscaling parameter \u03c6 \u2208 (0, 1). Each user picks her ranking \u03c0 w.p. proportional to \u03c6dkt(\u03c0,\u03c0\u2217) where\ndkt(\u00b7) denotes the Kendall-Tau distance between permutations (see Section 2).1 We denote such a\nmodel as Mn(\u03c6, \u03c0\u2217).\nThe Mallows model and its generalizations have received much attention from the statistics, political\nscience and machine learning communities, relating this probabilistic model to the long-studied\nwork about voting and social choice [10, 11]. From a machine learning perspective, the problem is\nto \ufb01nd the parameters of the model \u2014 the central permutation \u03c0\u2217 and the scaling parameter \u03c6, using\nindependent samples from the distribution. There is a large body of work [4, 6, 5, 7, 12] providing\nef\ufb01cient algorithms for learning the parameters of a Mallows model.\n\n\u2217This work was supported in part by NSF grants CCF-1101215, CCF-1116892, the Simons Institute, and a\nSimons Foundation Postdoctoral fellowhsip. Part of this work was performed while the 3rd author was at the\nSimons Institute for the Theory of Computing at the University of California, Berkeley and the 4th author was\nat CMU.\n\n1In fact, it was shown [1] that this model is the result of the following simple (inef\ufb01cient) algorithm: rank\n1+\u03c6 they agree with \u03c0\u2217 and with\n\n(cid:1) pairs agree on a single ranking \u2013 output this ranking, otherwise resample.\n\nevery pair of elements randomly and independently s.t. with probability\nprobability \u03c6\n\n1+\u03c6 they don\u2019t; if all(cid:0)n\n\n2\n\n1\n\n1\n\n\f1\n\n1, \u03c6(cid:48)\n\n1, \u03c0(cid:48)\n\n1, w(cid:48)\n\n2, \u03c6(cid:48)\n\n1\n\n\u03c61(1\u2212\u03c61) ,\n\n\u03c62(1\u2212\u03c62) , \u0001\u22121).\n\nIn many scenarios, however, the population is heterogeneous with multiple groups of people, each\nwith their own central ranking [2]. For instance, when ranking movies, the population may be di-\nvided into two groups corresponding to men and women; with men ranking movies with one under-\nlying central permutation, and women ranking movies with another underlying central permutation.\nThis naturally motivates the problem of learning a mixture of multiple Mallows models for rankings,\na problem that has received signi\ufb01cant attention [8, 13, 3, 4]. Heuristics like the EM algorithm have\nbeen applied to learn the model parameters of a mixture of Mallows models [8]. The problem has\nalso been studied under distributional assumptions over the parameters, e.g. weights derived from\na Dirichlet distribution [13]. However, unlike the case of a single Mallows model, algorithms with\nprovable guarantees have remained elusive for this problem.\nIn this work we give the \ufb01rst polynomial time algorithm that provably learns a mixture of two\nMallows models. The input to our algorithm consists of i.i.d random rankings (samples), with\neach ranking drawn with probability w1 from a Mallows model Mn(\u03c61, \u03c01), and with probability\nw2(= 1 \u2212 w1) from a different model Mn(\u03c62, \u03c02).\nInformal Theorem. Given suf\ufb01ciently many i.i.d samples drawn from a mixture of two Mallows\nmodels, we can learn the central permutations \u03c01, \u03c02 exactly and parameters \u03c61, \u03c62, w1, w2 up to\n\u0001-accuracy in time poly(n, (min{w1, w2})\u22121,\nIt is worth mentioning that, to the best of our knowledge, prior to this work even the question of iden-\nti\ufb01ability was unresolved for a mixture of two Mallows models; given in\ufb01nitely many i.i.d. samples\ngenerated from a mixture of two distinct Mallow models with parameters {w1, \u03c61, \u03c01, w2, \u03c62, \u03c02}\n(with \u03c01 (cid:54)= \u03c02 or \u03c61 (cid:54)= \u03c62), could there be a different set of parameters {w(cid:48)\n2}\n2, \u03c0(cid:48)\nwhich explains the data just as well. Our result shows that this is not the case and the mixture is\nuniquely identi\ufb01able given polynomially many samples.\nIntuition and a Na\u00a8\u0131ve First Attempt. It is evident that having access to suf\ufb01ciently many random\nsamples allows one to learn a single Mallows model. Let the elements in the permutations be denoted\nas {e1, e2, . . . , en}. In a single Mallows model, the probability of element ei going to position j (for\nj \u2208 [n]) drops off exponentially as one goes farther from the true position of ei [12]. So by assigning\neach ei the most frequent position in our sample, we can \ufb01nd the central ranking \u03c0\u2217.\nThe above mentioned intuition suggests the following clustering based approach to learn a mixture\nof two Mallows models \u2014 look at the distribution of the positions where element ei appears. If the\ndistribution has 2 clearly separated \u201cpeaks\u201d then they will correspond to the positions of ei in the\ncentral permutations. Now, dividing the samples according to ei being ranked in a high or a low\nposition is likely to give us two pure (or almost pure) subsamples, each one coming from a single\nMallows model. We can then learn the individual models separately. More generally, this strategy\nworks when the two underlying permutations \u03c01 and \u03c02 are far apart which can be formulated as\na separation condition.2 Indeed, the above-mentioned intuition works only under strong separator\nconditions: otherwise, the observation regarding the distribution of positions of element ei is no\nlonger true 3. For example, if \u03c01 ranks ei in position k and \u03c02 ranks ei in position k + 2, it is likely\nthat the most frequent position of ei is k + 1, which differs from ei\u2019s position in either permutations!\nHandling arbitrary permutations. Learning mixture models under no separation requirements is\na challenging task. To the best of our knowledge, the only polynomial time algorithm known is\nfor the case of a mixture of a constant number of Gaussians [17, 18]. Other works, like the recent\ndevelopments that use tensor based methods for learning mixture models without distance-based\nseparation condition [19, 20, 21] still require non-degeneracy conditions and/or work for speci\ufb01c\nsub cases (e.g. spherical Gaussians).\nThese sophisticated tensor methods form a key component in our algorithm for learning a mixture\nof two Mallows models. This is non-trivial as learning over rankings poses challenges which are\nnot present in other widely studied problems such as mixture of Gaussians. For the case of Gaus-\nsians, spectral techniques have been extremely successful [22, 16, 19, 21]. Such techniques rely on\nestimating the covariances and higher order moments in terms of the model parameters to detect\nstructure and dependencies. On the other hand, in the mixture of Mallows models problem there is\n\ncan be roughly stated as (cid:107)\u03c01 \u2212 \u03c02(cid:107)\u221e = \u02dc\u2126(cid:0)(min{w1, w2})\u22121 \u00b7 (min{log(1/\u03c61), log(1/\u03c62)}))\u22121(cid:1).\n\n2Identifying a permutation \u03c0 over n elements with a n-dimensional vector (\u03c0(i))i, this separation condition\n\n3Much like how other mixture models are solvable under separation conditions, see [14, 15, 16].\n\n2\n\n\fno \u201cnatural\u201d notion of a second/third moment. A key contribution of our work is de\ufb01ning analogous\nnotions of moments which can be represented succinctly in terms of the model parameters. As we\nlater show, this allows us to use tensor based techniques to get a good starting solution.\nOverview of Techniques. One key dif\ufb01culty in arguing about the Mallows model is the lack of\nclosed form expressions for basic propositions like \u201cthe probability that the i-th element of \u03c0\u2217 is\nranked in position j.\u201d Our \ufb01rst observation is that the distribution of a given element appearing at\nthe top, i.e. the \ufb01rst position, behaves nicely. Given an element e whose rank in the central ranking\n\u03c0\u2217 is i, the probability that a ranking sampled from a Mallows model ranks e as the \ufb01rst element is\n\u221d \u03c6i\u22121. A length n vector consisting of these probabilities is what we de\ufb01ne as the \ufb01rst moment\nvector of the Mallows model. Clearly by sorting the coordinate of the \ufb01rst moment vector, one can\nrecover the underlying central permutation and estimate \u03c6. Going a step further, consider any two\nelements which are in positions i, j respectively in \u03c0\u2217. We show that the probability that a ranking\nsampled from a Mallows model ranks {i, j} in (any of the 2! possible ordering of) the \ufb01rst two\npositions is \u221d f (\u03c6)\u03c6i+j\u22122. We call the n \u00d7 n matrix of these probabilities as the second moment\nmatrix of the model (analogous to the covariance matrix). Similarly, we de\ufb01ne the 3rd moment\ntensor as the probability that any 3 elements appear in positions {1, 2, 3}. We show in the next\nsection that in the case of a mixture of two Mallows models, the 3rd moment tensor de\ufb01ned this way\nhas a rank-2 decomposition, with each rank-1 term corresponds to the \ufb01rst moment vector of each of\ntwo Mallows models. This motivates us to use tensor-based techniques to estimate the \ufb01rst moment\nvectors of the two Mallows models, thus learning the models\u2019 parameters.\nThe above mentioned strategy would work if one had access to in\ufb01nitely many samples from the\nmixture model. But notice that the probabilities in the \ufb01rst-moment vectors decay exponentially, so\nby using polynomially many samples we can only recover a pre\ufb01x of length \u223c log1/\u03c6 n from both\nrankings. This forms the \ufb01rst part of our algorithm which outputs good estimates of the mixture\nweights, scaling parameters \u03c61, \u03c62 and pre\ufb01xes of a certain size from both the rankings. Armed\nwith w1, w2 and these two pre\ufb01xes we next proceed to recover the full permutations \u03c01 and \u03c02.\nIn order to do this, we take two new fresh batches of samples. On the \ufb01rst batch, we estimate\nthe probability that element e appears in position j for all e and j. On the second batch, which is\nnoticeably larger than the \ufb01rst, we estimate the probability that e appears in position j conditioned\non a carefully chosen element e\u2217 appearing as the \ufb01rst element. We show that this conditioning is\nalmost equivalent to sampling from the same mixture model but with rescaled weights w(cid:48)\n1 and w(cid:48)\n2.\nThe two estimations allow us to set a system of two linear equations in two variables: f (1) (e \u2192 j) \u2013\nthe probability of element e appearing in position j in \u03c01, and f (2) (e \u2192 j) \u2014 the same probability\nfor \u03c02. Solving this linear system we \ufb01nd the position of e in each permutation.\nThe above description contains most of the core ideas involved in the algorithm. We need two\nadditional components. First, notice that the 3rd moment tensor is not well de\ufb01ned for triplets\n(i, j, k), when i, j, k are not all distinct and hence cannot be estimated from sampled data. To get\naround this barrier we consider a random partition of our element-set into 3 disjoint subsets. The\nactual tensor we work with consists only of triplets (i, j, k) where the indices belong to different\npartitions. Secondly, we have to handle the case where tensor based-technique fails, i.e. when the\n3rd moment tensor isn\u2019t full-rank. This is a degenerate case. Typically, tensor based approaches for\nother problems cannot handle such degenerate cases. However, in the case of the Mallows mixture\nmodel, we show that such a degenerate case provides a lot of useful information about the problem.\nIn particular, it must hold that \u03c61 (cid:39) \u03c62, and \u03c01 and \u03c02 are fairly close \u2014 one is almost a cyclic\nshift of the other. To show this we use a characterization of the when the tensor decomposition is\nunique (for tensors of rank 2), and we handle such degenerate cases separately. Altogether, we \ufb01nd\nthe mixture model\u2019s parameters with no non-degeneracy conditions.\nLower bound under the pairwise access model. Given that a single Mallows model can be learned\nusing only pairwise comparisons, a very restricted access to each sample, it is natural to ask, \u201cIs it\npossible to learn a mixture of Mallows models from pairwise queries?\u201d. This next example shows\nthat we cannot hope to do this even for a mixture of two Mallows models. Fix some \u03c6 and \u03c0 and\nassume our sample is taken using mixing weights of w1 = w2 = 1\n2 from the two Mallows models\nMn(\u03c6, \u03c0) and Mn(\u03c6, rev(\u03c0)), where rev(\u03c0) indicates the reverse permutation (the \ufb01rst element of\n\u03c0 is the last of rev(\u03c0), the second is the next-to-last, etc.) . Consider two elements, e and e(cid:48). Using\nonly pairwise comparisons, we have that it is just as likely to rank e > e(cid:48) as it is to rank e(cid:48) > e and\nso this case cannot be learned regardless of the sample size.\n\n3\n\n\f3-wise queries. We would also like to stress that our algorithm does not need full access to the\nsampled rankings and instead will work with access to certain 3-wise queries. Observe that the \ufb01rst\npart of our algorithm, where we recover the top elements in each of the two central permutations,\nonly uses access to the top 3 elements in each sample. In that sense, we replace the pairwise query\n\u201cdo you prefer e to e(cid:48)?\u201d with a 3-wise query: \u201cwhat are your top 3 choices?\u201d Furthermore, the\nsecond part of the algorithm (where we solve a set of 2 linear equations) can be altered to support\n3-wise queries of the (admittedly, somewhat unnatural) form \u201cif e\u2217 is your top choice, do you prefer\ne to e(cid:48)?\u201d For ease of exposition, we will assume full-access to the sampled rankings.\nFuture Directions. Several interesting directions come out of this work. A natural next step is to\ngeneralize our results to learn a mixture of k Mallows models for k > 2. We believe that most\nof these techniques can be extended to design algorithms that take poly(n, 1/\u0001)k time. It would\nalso be interesting to get algorithms for learning a mixture of k Mallows models which run in time\npoly(k, n), perhaps in an appropriate smoothed analysis setting [23] or under other non-degeneracy\nassumptions. Perhaps, more importantly, our result indicates that tensor based methods which have\nbeen very popular for learning problems, might also be a powerful tool for tackling ranking-related\nproblems in the \ufb01elds of machine learning, voting and social choice.\nOrganization. In Section 2 we give the formal de\ufb01nition of the Mallow model and of the problem\nstatement, as well as some useful facts about the Mallow model. Our algorithm and its numerous\nsubroutines are detailed in Section 3. In Section 4 we experimentally compare our algorithm with a\npopular EM based approach for the problem. The complete details of our algorithms and proofs are\nincluded in the supplementary material.\n\n2 Notations and Properties of the Mallows Model\nLet Un = {e1, e2, . . . , en} be a set of n distinct elements. We represent permutations over the\nelements in Un through their indices [n]. (E.g., \u03c0 = (n, n \u2212 1, . . . , 1) represents the permutation\n(en, en\u22121, . . . , e1).) Let pos\u03c0(ei) = \u03c0\u22121(i) refer to the position of ei in the permutation \u03c0. We\nomit the subscript \u03c0 when the permutation \u03c0 is clear from context. For any two permutations \u03c0, \u03c0(cid:48)\nwe denote dkt(\u03c0, \u03c0(cid:48)) as the Kendall-Tau distance [24] between them (number of pairwise inversions\nbetween \u03c0, \u03c0(cid:48)). Given some \u03c6 \u2208 (0, 1) we denote Zi(\u03c6) = 1\u2212\u03c6i\n1\u2212\u03c6 , and partition function Z[n](\u03c6) =\n\n\u03c0 \u03c6dkt(\u03c0,\u03c00) =(cid:81)n\n(cid:80)\n\ni=1 Zi(\u03c6) (see Section 6 in the supplementary material).\n\nDe\ufb01nition 2.1. [Mallows model (Mn(\u03c6, \u03c00)).] Given a permutation \u03c00 on [n] and a parameter\n\u03c6 \u2208 (0, 1),4, a Mallows model is a permutation generation process that returns permutation \u03c0 w.p.\n\nPr (\u03c0) = \u03c6dkt(\u03c0,\u03c00)/Z[n](\u03c6)\n\nIn Section 6 we show many useful properties of the Mallows model which we use repeatedly\nthroughout this work. We believe that they provide an insight to Mallows model, and we advise\nthe reader to go through them. We proceed with the main de\ufb01nition.\nDe\ufb01nition 2.2. [Mallows Mixture model w1Mn(\u03c61, \u03c01) \u2295 w2Mn(\u03c62, \u03c02).] Given parameters\nw1, w2 \u2208 (0, 1) s.t. w1 + w2 = 1, parameters \u03c61, \u03c62 \u2208 (0, 1) and two permutations \u03c01, \u03c02, we call\na mixture of two Mallows models to be the process that with probability w1 generates a permutation\nfrom M (\u03c61, \u03c01) and with probability w2 generates a permutation from M (\u03c62, \u03c02).\nOur next de\ufb01nition is crucial for our application of tensor decomposition techniques.\nDe\ufb01nition 2.3. [Representative vectors.] The representative vector of a Mallows model is a vector\nwhere for every i \u2208 [n], the ith-coordinate is \u03c6pos\u03c0(ei)\u22121/Zn.\nThe expression \u03c6pos\u03c0(ei)\u22121/Zn is precisely the probability that a permutation generated by a model\nMn(\u03c6, \u03c0) ranks element ei at the \ufb01rst position (proof deferred to the supplementary material).\nGiven that our focus is on learning a mixture of two Mallows models Mn(\u03c61, \u03c01) and Mn(\u03c62, \u03c02),\nwe denote x as the representative vector of the \ufb01rst model, and y as the representative vector of the\nlatter. Note that retrieving the vectors x and y exactly implies that we can learn the permutations \u03c01\nand \u03c02 and the values of \u03c61, \u03c62.\n\n4It is also common to parameterize using \u03b2 \u2208 R+ where \u03c6 = e\u2212\u03b2. For small \u03b2 we have (1 \u2212 \u03c6) \u2248 \u03b2.\n\n4\n\n\fA tensor T \u2208 Rn1\u00d7n2\u00d7n3 has a rank-r decomposition if T can be expressed as(cid:80)\n\nFinally, let f (i \u2192 j) be the probability that element ei goes to position j according to mixture\nmodel. Similarly f (1) (i \u2192 j) be the corresponding probabilities according to Mallows model M1\nand M2 respectively. Hence, f (i \u2192 j) = w1f (1) (i \u2192 j) + w2f (2) (i \u2192 j).\nTensors: Given two vectors u \u2208 Rn1, v \u2208 Rn2, we de\ufb01ne u\u2297v \u2208 Rn1\u00d7n2 as the matrix uvT . Given\nalso z \u2208 Rn3 then u\u2297 v\u2297 z denotes the 3-tensor (of rank- 1) whose (i, j, k)-th coordinate is uivjzk.\ni\u2208[r] ui \u2297 vi \u2297 zi\nwhere ui \u2208 Rn1, vi \u2208 Rn2, zi \u2208 Rn3. Given two vectors u, v \u2208 Rn, we use (u; v) to denote the\nn \u00d7 2 matrix that is obtained with u and v as columns.\nWe now de\ufb01ne \ufb01rst, second and third order statistics (frequencies) that serve as our proxies for the\n\ufb01rst, second and third order moments.\nDe\ufb01nition 2.4. [Moments] Given a Mallows mixture model, we denote for every i, j, k \u2208 [n]\n\u2022 Pi = Pr (pos (ei) = 1) is the probability that element ei is ranked at the \ufb01rst position\n\u2022 Pij = Pr (pos ({ei, ej}) = {1, 2}), is the probability that ei, ej are ranked at the \ufb01rst two\n\npositions (in any order)\n\n\u2022 Pijk = Pr (pos ({ei, ej, ek}) = {1, 2, 3}) is the probability that ei, ej, ek are ranked at\n\nthe \ufb01rst three positions (in any order).\n\nFor convenience, let P represent the set of quantities (Pi, Pij, Pijk)1\u2264i<j<k\u2264n. These can be esti-\nmated up to any inverse polynomial accuracy using only polynomial samples. The following simple,\nyet crucial lemma relates P to the vectors x and y, and demonstrates why these statistics and repre-\nsentative vectors are ideal for tensor decomposition.\nLemma 2.5. Given a mixture w1M (\u03c61, \u03c01) \u2295 w2M (\u03c62, \u03c02) let x, y and P be as de\ufb01ned above.\n\n1. For any i it holds that Pi = w1xi + w2yi.\n\n2. Denote c2(\u03c6) = Zn(\u03c6)\nZn\u22121(\u03c6)\n\nw2c2(\u03c62)yiyj.\n\n1+\u03c6\n\n\u03c6 . Then for any i (cid:54)= j it holds that Pij = w1c2(\u03c61)xixj +\n\n3. Denote c3(\u03c6) =\n\nPijk = w1c3(\u03c61)xixjxk + w2c3(\u03c62)yiyjyk.\n\nZn\u22121(\u03c6)Zn\u22122(\u03c6)\n\n\u03c63\n\nZ2\n\nn(\u03c6)\n\n1+2\u03c6+2\u03c62+\u03c63\n\n. Then for any distinct i, j, k it holds that\n\nClearly, if i = j then Pij = 0, and if i, j, k are not all distinct then Pijk = 0.\nIn addition, in Lemma 13.2 in the supplementary material we prove the bounds c2(\u03c6) = O(1/\u03c6)\nand c3(\u03c6) = O(\u03c6\u22123).\nPartitioning Indices: Given a partition of [n] into Sa, Sb, Sc, let x(a), y(a) be the representative\nvectors x, y restricted to the indices (rows) in Sa (similarly for Sb, Sc). Then the 3-tensor\n\nT (abc) \u2261 (Pijk)i\u2208Sa,j\u2208Sb,k\u2208Sc = w1c3(\u03c61)x(a) \u2297 x(b) \u2297 x(c) + w2c3(\u03c62)y(a) \u2297 y(b) \u2297 y(c).\n\nThis tensor has a rank-2 decomposition, with one rank-1 term for each Mallows model. Finally for\nconvenience we de\ufb01ne the matrix M = (x; y), and similarly de\ufb01ne the matrices Ma = (x(a); y(a)),\nMb = (x(b); y(b)), Mc = (x(c); y(c)).\nError Dependency and Error Polynomials. Our algorithm gives an estimate of the parameters\nw, \u03c6 that we learn in the \ufb01rst stage, and we use these estimates to \ufb01gure out the entire central rankings\nin the second stage. The following lemma essentially allows us to assume instead of estimations, we\nhave access to the true values of w and \u03c6.\nLemma 2.6. For every \u03b4 > 0 there exists a function f (n, \u03c6, \u03b4) s.t. for every n, \u03c6 and \u02c6\u03c6 satisfying\n|\u03c6\u2212 \u02c6\u03c6| <\n\nf (n,\u03c6,\u03b4) we have that the total-variation distance satis\ufb01es (cid:107)M (\u03c6, \u03c0)\u2212M(cid:16) \u02c6\u03c6, \u03c0\n\n(cid:17)(cid:107)TV \u2264 \u03b4.\n\n\u03b4\n\nFor the ease of presentation, we do not optimize constants or polynomial factors in all parameters.\nIn our analysis, we show how our algorithm is robust (in a polynomial sense) to errors in various\nstatistics, to prove that we can learn with polynomial samples. However, the simpli\ufb01cation when\nthere are no errors (in\ufb01nite samples) still carries many of the main ideas in the algorithm \u2014 this in\nfact shows the identi\ufb01ability of the model, which was not known previously.\n\n5\n\n\f3 Algorithm Overview\n\nAlgorithm 1 LEARN MIXTURES OF TWO MALLOWS MODELS, Input: a set S of N samples from\nw1M (\u03c61, \u03c01) \u2295 w2M (\u03c62, \u03c02), Accuracy parameters \u0001, \u00012.\n1. Let (cid:98)P be the empirical estimate of P on samples in S.\n(a) Partition [n] randomly into Sa, Sb and Sc. Let T (abc) =(cid:0)(cid:98)Pijk\n\n2. Repeat O(log n) times:\n\n(cid:1)\n\n.\n\ni\u2208Sa,j\u2208Sb,k\u2208Sc\n\n(b) Run TENSOR-DECOMP from [25, 26, 23] to get a decomposition of T (abc) = u(a) \u2297 u(b) \u2297\n(c) If min{\u03c32(u(a); v(a)), \u03c32(u(b); v(b)), \u03c32(u(c); v(c))} > \u00012\n\nu(c) + v(a) \u2297 v(b) \u2297 v(c).\n\n(In the non-degenerate case these matrices are far from being rank-1 matrices in the sense that\ntheir least singular value is bounded away from 0.)\n\ni. Obtain parameter estimates ((cid:98)w1,(cid:98)w2,(cid:98)\u03c61,(cid:98)\u03c62 and pre\ufb01xes of the central rankings \u03c01\nfrom INFER-TOP-K((cid:98)P , M(cid:48)\nii. Use RECOVER-REST to \ufb01nd the full central rankings(cid:98)\u03c01,(cid:98)\u03c02.\nReturn SUCCESS and output ((cid:98)w1,(cid:98)w2,(cid:98)\u03c61,(cid:98)\u03c62,(cid:98)\u03c01,(cid:98)\u03c02).\n\ni = (u(i); v(i)) for i \u2208 {a, b, c}.\n\nc), with M(cid:48)\n\na, M(cid:48)\n\nb, M(cid:48)\n\n(cid:48), \u03c02\n\n(cid:48))\n\n3. Run HANDLE DEGENERATE CASES ((cid:98)P ).\n\n1\n\n, 1\nw2\n\nmax\n\nn,\n\n1\n\n(cid:17)\n\n1\n\n\u03c61(1\u2212\u03c61) ,\n\n(cid:16)\n\nmin{\u0001,\u00010} ,\n\n\u03c62(1\u2212\u03c62) , 1\nw1\n\n. Then, given any 0 < \u0001 < \u00010, suitably small \u00012 = poly( 1\n\nOur algorithm (Algorithm 1) has two main components. First we invoke a decomposition algo-\nrithm [25, 26, 23] over the tensor T (abc), and retrieve approximations of the two Mallows models\u2019\nrepresentative vectors which in turn allow us to approximate the weight parameters w1, w2, scale\nparameters \u03c61, \u03c62, and the top few elements in each central ranking. We then use the inferred pa-\nrameters to recover the entire rankings \u03c01 and \u03c02. Should the tensor-decomposition fail, we invoke\na special procedure to handle such degenerate cases. Our algorithm has the following guarantee.\nTheorem 3.1. Let w1M (\u03c61, \u03c01) \u2295 w2M (\u03c62, \u03c02) be a mixture of two Mallows models and let\nwmin = min{w1, w2} and \u03c6max = max{\u03c61, \u03c62} and similarly \u03c6min = min{\u03c61, \u03c62}. Denote\nmin(1\u2212\u03c6max)10\n\u00010 = w2\nn , \u0001, \u03c6min, wmin)\n16n22\u03c62\ni.i.d samples from the mixture model,\nand N = poly\nAlgorithm 1 recovers, in poly-time and with probability \u2265 1 \u2212 n\u22123, the model\u2019s parameters with\nw1, w2, \u03c61, \u03c62 recovered up to \u0001-accuracy.\nNext we detail the various subroutines of the algorithm, and give an overview of the analysis for\neach subroutine. The full analysis is given in the supplementary material.\nThe TENSOR-DECOMP Procedure. This procedure is a straight-forward invocation of the al-\ngorithm detailed in [25, 26, 23]. This algorithm uses spectral methods to retrieve the two vec-\ntors generating the rank-2 tensor T (abc). This technique works when all factor matrices Ma =\n(x(a); y(a)), Mb = (x(b); y(b)), Mc = (x(c); y(c)) are well-conditioned. We note that any algorithm\nthat decomposes non-symmetric tensors which have well-conditioned factor matrices, can be used\nas a black box.\nLemma 3.2 (Full rank case). In the conditions of Theorem 3.1, suppose our algorithm picks\nsome partition Sa, Sb, Sc such that the matrices Ma, Mb, Mc are all well-conditioned \u2014 i.e. have\n\u03c32(Ma), \u03c32(Mb), \u03c32(Mc) \u2265 \u0001(cid:48)\nn , \u0001, \u00012, w1, w2) then with high probability, Algorithm\nTENSORDECOMP of [25] \ufb01nds M(cid:48)\nc = (u(c); v(c)) such\nthat for any \u03c4 \u2208 {a, b, c}, we have u(\u03c4 ) = \u03b1\u03c4 x(\u03c4 ) + z(\u03c4 )\n2 ; with\n1 (cid:107),(cid:107)z(\u03c4 )\n(cid:107)z(\u03c4 )\nThe INFER-TOP-K procedure. This procedure uses the output of the tensor-decomposition to\nretrieve the weights, \u03c6\u2019s and the representative vectors. In order to convert u(a), u(b), u(c) into an\napproximation of x(a), x(b), x(c) (and similarly with v(a), v(b), v(c) and y(a), y(b), y(c)), we need to\n\ufb01nd a good approximation of the scalars \u03b1a, \u03b1b, \u03b1c. This is done by solving a certain linear system.\n\ufb01rst elements of \u03c01 \u2014 we sort the coordinates of x, setting \u03c0(cid:48)\n\nThis also allows us to estimate (cid:98)w1,(cid:98)w2. Given our approximation of x, it is easy to \ufb01nd \u03c61 and the top\n\nn , \u0001, \u00012, wmin) and, \u03c32(M(cid:48)\n\n\u03c4 ) > \u00012 for \u03c4 \u2208 {a, b, c}.\n\n1 to be the \ufb01rst elements in the sorted\n\na = (u(a); v(a)), M(cid:48)\n\nb = (u(b); v(b)), M(cid:48)\n\nand v(\u03c4 ) = \u03b2\u03c4 y(\u03c4 ) + z(\u03c4 )\n\n2 (cid:107) \u2264 poly( 1\n\n2 \u2265 poly( 1\n\n1\n\n6\n\n\fvector, and \u03c61 as the ratio between any two adjacent entries in the sorted vector. We refer the reader\nto Section 8 in the supplementary material for full details. The RECOVER-REST procedure. The\nalgorithm for recovering the remaining entries of the central permutations (Algorithm 2) is more\ninvolved.\nAlgorithm 2 RECOVER-REST, Input: a set S of N samples from w1M (\u03c61, \u03c01)\u2295 w2M (\u03c62, \u03c02),\nparameters \u02c6w1, \u02c6w2, \u02c6\u03c61, \u02c6\u03c62 and initial permutations \u02c6\u03c01, \u02c6\u03c02, and accuracy parameter \u0001.\n\n1. For elements in \u02c6\u03c01 and \u02c6\u03c02, compute representative vectors \u02c6x and \u02c6y using estimates \u02c6\u03c61 and \u02c6\u03c62.\n2. Let | \u02c6\u03c01| = r1, | \u02c6\u03c02| = r2 and wlog r1 \u2265 r2.\nIf there exists an element ei such that pos\u02c6\u03c01\ncase), then:\nLet S1 be the subsample with ei ranked in the \ufb01rst position.\n(a) Learn a single Mallows model on S1 to \ufb01nd \u02c6\u03c01. Given \u02c6\u03c01 use dynamic programming to \ufb01nd \u02c6\u03c02\n3. Let ei\u2217 be the \ufb01rst element in \u02c6\u03c01 having its probabilities of appearing in \ufb01rst place in \u03c01 and \u03c02 differ\n1. Let S1 be the subsample with ei\u2217\n\n(ei) < r2/2 (or in the symmetric\n\n(ei) > r1 and pos\u02c6\u03c02\n\n2 = 1 \u2212 \u02c6w(cid:48)\n\n(cid:17)\u22121\n\nand \u02c6w(cid:48)\n\n(cid:16)\n\n1 + \u02c6w2\n\u02c6w1\n\n\u02c6y(ei\u2217 )\n\u02c6x(ei\u2217 )\n\nby at least \u0001. De\ufb01ne \u02c6w(cid:48)\n1 =\nranked at the \ufb01rst position.\n\n4. For each ei that doesn\u2019t appear in either \u02c6\u03c01 or \u02c6\u03c02 and any possible position j it might belong to\n\n(a) Use S to estimate \u02c6fi,j = Pr (ei goes to position j), and S1 to estimate \u02c6f (i \u2192 j|ei\u2217 \u2192 1) =\n\nPr (ei goes to position j|ei\u2217 (cid:55)\u2192 1).\n\n(b) Solve the system\n\n(1)\n(2)\n5. To complete \u02c6\u03c01 assign each ei to position arg maxj{f (1) (i \u2192 j)}. Similarly complete \u02c6\u03c02 using\n\n\u02c6f (i \u2192 j) = \u02c6w1f (1) (i \u2192 j) + \u02c6w2f (2) (i \u2192 j)\n2f (2) (i \u2192 j)\n(cid:48)\n\n\u02c6f (i \u2192 j|ei\u2217 \u2192 1) = \u02c6w\n\n1f (1) (i \u2192 j) + \u02c6w\n(cid:48)\n\nf (2) (i \u2192 j). Return the two permutations.\n\nAlgorithm 2 \ufb01rst attempts to \ufb01nd a pivot \u2014 an element ei which appears at a fairly high rank in\none permutation, yet does not appear in the other pre\ufb01x \u02c6\u03c02. Let Eei be the event that a permutation\nranks ei at the \ufb01rst position. As ei is a pivot, then PrM1 (Eei) is noticeable whereas PrM2 (Eei )\nis negligible. Hence, conditioning on ei appearing at the \ufb01rst position leaves us with a subsample in\nwhich all sampled rankings are generated from the \ufb01rst model. This subsample allows us to easily\nretrieve the rest of \u03c01. Given \u03c01, the rest of \u03c02 can be recovered using a dynamic programming\nprocedure. Refer to the supplementary material for details.\nThe more interesting case is when no such pivot exists, i.e., when the two pre\ufb01xes of \u03c01 and \u03c02\ncontain almost the same elements. Yet, since we invoke RECOVER-REST after successfully calling\nTENSOR-DECOMP , it must hold that the distance between the obtained representative vectors \u02c6x and\n\u02c6y is noticeably large. Hence some element ei\u2217 satis\ufb01es |\u02c6x(ei\u2217 ) \u2212 \u02c6y(ei\u2217 )| > \u0001, and we proceed by\nsetting up a linear system. To \ufb01nd the complete rankings, we measure appropriate statistics to set\nup a system of linear equations to calculate f (1) (i \u2192 j) and f (2) (i \u2192 j) up to inverse polynomial\nranking of M1.\n\naccuracy. The largest of these values(cid:8)f (1) (i \u2192 j)(cid:9) corresponds to the position of ei in the central\nTo compute the values(cid:8)f (r) (i \u2192 j)(cid:9)\n\nr=1,2 we consider f (1) (i \u2192 j|ei\u2217 \u2192 1) \u2013 the probability that\nei is ranked at the jth position conditioned on the element ei\u2217 ranking \ufb01rst according to M1 (and\nresp. for M2). Using w(cid:48)\n\n1 and w(cid:48)\n\nPr (ei \u2192 j|ei\u2217 \u2192 1) = w(cid:48)\n\n2 as in Algorithm 2, it holds that\n1f (1) (i \u2192 j|ei\u2217 \u2192 1) + w(cid:48)\n\n2f (2) (i \u2192 j|ei\u2217 \u2192 1) .\n\nWe need to relate f (r) (i \u2192 j|ei\u2217 \u2192 1) to f (r) (i \u2192 j).\nIndeed Lemma 10.1 shows that\nPr (ei \u2192 j|ei\u2217 \u2192 1) is an almost linear equations in the two unknowns. We show that if ei\u2217 is\nranked above ei in the central permutation, then for some small \u03b4 it holds that\n\n2f (2) (i \u2192 j) \u00b1 \u03b4\nWe refer the reader to Section 10 in the supplementary material for full details.\n\nPr (ei \u2192 j|ei\u2217 \u2192 1) = w(cid:48)\n\n1f (1) (i \u2192 j) + w(cid:48)\n\n7\n\n\fThe HANDLE-DEGENERATE-CASES procedure. We call a mixture model w1M (\u03c61, \u03c01) \u2295\nw2M (\u03c62, \u03c02) degenerate if the parameters of the two Mallows models are equal, and the edit dis-\ntance between the pre\ufb01xes of the two central rankings is at most two i.e., by changing the positions\nof at most two elements in \u03c01 we retrieve \u03c02. We show that unless w1M (\u03c61, \u03c01)\u2295w2M (\u03c62, \u03c02) is\ndegenerate, a random partition (Sa, Sb, Sc) is likely to satisfy the requirements of Lemma 3.2 (and\nTENSOR-DECOMP will be successful). Hence, if TENSOR-DECOMP repeatedly fail, we deduce our\nmodel is indeed degenerate. To show this, we characterize the uniqueness of decompositions of rank\n2, along with some very useful properties of random partitions. In such degenerate cases, we \ufb01nd\nthe two pre\ufb01xes and then remove the elements in the pre\ufb01xes from U, and recurse on the remaining\nelements. We refer the reader to Section 9 in the supplementary material for full details.\n\n4 Experiments\nGoal. The main contribution of our paper is devising an algorithm that provably learns any mixture\nof two Mallows models. But could it be the case that the previously existing heuristics, even though\nthey are unproven, still perform well in practice? We compare our algorithm to existing techniques,\nto see if, and under what settings our algorithm outperforms them.\nBaseline. We compare our algorithm to the popular EM based algorithm of [5], seeing as EM based\nheuristics are the most popular way to learn a mixture of Mallows models. The EM algorithm starts\nwith a random guess for the two central permutations. At iteration t, EM maintains a guess as to\nthe two Mallows models that generated the sample. First (expectation step) the algorithm assigns a\nweight to each ranking in our sample, where the weight of a ranking re\ufb02ects the probability that it\nwas generated from the \ufb01rst or the second of the current Mallows models. Then (the maximization\nstep) the algorithm updates its guess of the models\u2019 parameters based on a local search \u2013 minimizing\nthe average distance to the weighted rankings in our sample. We comment that we implemented\nonly the version of our algorithm that handles non-degenerate cases (more interesting case). In our\nexperiment the two Mallows models had parameters \u03c61 (cid:54)= \u03c62, so our setting was never degenerate.\nSetting. We ran both the algorithms on synthetic data comprising of rankings of size n = 10. The\nweights were sampled u.a.r from [0, 1], and the \u03c6-parameters were sampled by sampling ln(1/\u03c6)\n\n(cid:1) we generated the two central rankings \u03c01 and \u03c02 to\n\nbe within distance d in the following manner. \u03c01 was always \ufb01xed as (1, 2, 3, . . . , 10). To describe\n\u03c02, observe that it suf\ufb01ces to note the number of inversion between 1 and elements 2, 3, ..., 10; the\nnumber of inversions between 2 and 3, 4, ..., 10 and so on. So we picked u.a.r a non-negative integral\nsolution to x1 + . . . + xn = d which yields a feasible permutation and let \u03c02 be the permutation that\nit details. Using these models\u2019 parameters, we generated N = 5 \u00b7 106 random samples.\nEvaluation Metric and Results. For each value of d, we ran both algorithms 20 times and counted\nthe fraction of times on which they returned the true rankings that generated the sample. The results\nof the experiment for rankings of size n = 10 are in Table 1. Clearly, the closer the two centrals\nrankings are to one another, the worst EM performs. On the other hand, our algorithm is able to\nrecover the true rankings even at very close distances. As the rankings get slightly farther, our algo-\nrithm recovers the true rankings all the time. We comment that similar performance was observed\nfor other values of n as well. We also comment that our algorithm\u2019s runtime was reasonable (less\nthan 10 minutes on a 8-cores Intel x86 64 computer). Surprisingly, our implementation of the EM\nalgorithm typically took much longer to run \u2014 due to the fact that it simply did not converge.\n\nu.a.r from [0, 5]. For d ranging from 0 to(cid:0)n\n\n2\n\nsuccess rate of EM success rate of our algorithm\n\ndistance between rankings\n\n0\n2\n4\n8\n16\n24\n30\n35\n40\n45\n\n0%\n0%\n0%\n10%\n30%\n30%\n60%\n60%\n80%\n60%\n\n10%\n10%\n40%\n70%\n60 %\n100%\n100%\n100%\n100%\n100%\n\nTable 1: Results of our experiment.\n\n8\n\n\fReferences\n[1] C. L. Mallows. Non-null ranking models i. Biometrika, 44(1-2), 1957.\n[2] John I. Marden. Analyzing and Modeling Rank Data. Chapman & Hall, 1995.\n[3] Guy Lebanon and John Lafferty. Cranking: Combining rankings using conditional probability models on\n\npermutations. In ICML, 2002.\n\n[4] Thomas Brendan Murphy and Donal Martin. Mixtures of distance-based models for ranking data. Com-\n\nputational Statistics and Data Analysis, 41, 2003.\n\n[5] Marina Meila, Kapil Phadnis, Arthur Patterson, and Jeff Bilmes. Consensus ranking under the exponential\n\nmodel. Technical report, UAI, 2007.\n\n[6] Ludwig M. Busse, Peter Orbanz, and Joachim M. Buhmann. Cluster analysis of heterogeneous rank data.\n\nIn ICML, ICML \u201907, 2007.\n\n[7] Bhushan Mandhani and Marina Meila. Tractable search for learning exponential models of rankings.\n\nJournal of Machine Learning Research - Proceedings Track, 5, 2009.\n\n[8] Tyler Lu and Craig Boutilier. Learning mallows models with pairwise preferences. In ICML, 2011.\n[9] Joel Oren, Yuval Filmus, and Craig Boutilier. Ef\ufb01cient vote elicitation under candidate uncertainty. JCAI,\n\n2013.\n\n[10] H Peyton Young. Condorcet\u2019s theory of voting. The American Political Science Review, 1988.\n[11] Persi Diaconis. Group representations in probability and statistics. Institute of Mathematical Statistics,\n\n1988.\n\n[12] Mark Braverman and Elchanan Mossel. Sorting from noisy information. CoRR, abs/0910.1191, 2009.\n[13] Marina Meila and Harr Chen. Dirichlet process mixtures of generalized mallows models. In UAI, 2010.\n[14] Sanjoy Dasgupta. Learning mixtures of gaussians. In FOCS, 1999.\n[15] Sanjeev Arora and Ravi Kannan. Learning mixtures of arbitrary gaussians. In STOC, 2001.\n[16] Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distributions. In COLT,\n\n2005.\n\n[17] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Ef\ufb01ciently learning mixtures of two gaussians.\n\nIn STOC, STOC \u201910, 2010.\n\n[18] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In Foundations\n\nof Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, 2010.\n\n[19] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decomposi-\n\ntions for learning latent variable models. CoRR, abs/1210.7559, 2012.\n\n[20] Animashree Anandkumar, Daniel Hsu, and Sham M. Kakade. A method of moments for mixture models\n\nand hidden markov models. In COLT, 2012.\n\n[21] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: moment methods and\n\nspectral decompositions. In ITCS, ITCS \u201913, 2013.\n\n[22] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixture models. J. Comput. Syst.\n\nSci., 68(4), 2004.\n\n[23] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan. Smoothed analysis of\n\ntensor decompositions. In Symposium on the Theory of Computing (STOC), 2014.\n\n[24] M. G. Kendall. Biometrika, 30(1/2), 1938.\n[25] Aditya Bhaskara, Moses Charikar, and Aravindan Vijayaraghavan. Uniqueness of tensor decompositions\n\nwith applications to polynomial identi\ufb01ability. CoRR, abs/1304.8087, 2013.\n\n[26] Naveen Goyal, Santosh Vempala, and Ying Xiao. Fourier pca. In Symposium on the Theory of Computing\n\n(STOC), 2014.\n\n[27] R.P. Stanley. Enumerative Combinatorics. Number v. 1 in Cambridge studies in advanced mathematics.\n\nCambridge University Press, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1363, "authors": [{"given_name": "Pranjal", "family_name": "Awasthi", "institution": "Princeton University"}, {"given_name": "Avrim", "family_name": "Blum", "institution": "CMU"}, {"given_name": "Or", "family_name": "Sheffet", "institution": "Carnegie Mellon University"}, {"given_name": "Aravindan", "family_name": "Vijayaraghavan", "institution": "Carnegie Mellon University"}]}