{"title": "Recursive Inversion Models for Permutations", "book": "Advances in Neural Information Processing Systems", "page_first": 631, "page_last": 639, "abstract": "We develop a new exponential family probabilistic model for permutations that can capture hierarchical structure, and that has the well known Mallows and generalized Mallows models as subclasses. We describe how one can do parameter estimation and propose an approach to structure search for this class of models. We provide experimental evidence that this added flexibility both improves predictive performance and enables a deeper understanding of collections of permutations.", "full_text": "Recursive Inversion Models for Permutations\n\nChristopher Meek\nMicrosoft Research\n\nRedmond, Washington 98052\nmeek@microsoft.com\n\nMarina Meil\u02d8a\n\nUniversity of Washington\nSeattle, Washington 98195\n\nmmp@stat.washington.edu\n\nAbstract\n\nWe develop a new exponential family probabilistic model for permutations that\ncan capture hierarchical structure and that has the Mallows and generalized Mal-\nlows models as subclasses. We describe how to do parameter estimation and pro-\npose an approach to structure search for this class of models. We provide experi-\nmental evidence that this added \ufb02exibility both improves predictive performance\nand enables a deeper understanding of collections of permutations.\n\n1\n\nIntroduction\n\nAmong the many probabilistic models over permutations, models based on penalizing inversions\nwith respect to a reference permutation have proved particularly elegant, intuitive, and useful. Typi-\ncally these generative models \u201cconstruct\u201d a permutation in stages by inserting one item at each stage.\nAn example of such models are the Generalized Mallows Models (GMMs) of Fligner and Verducci\n(1986). In this paper, we propose a superclass of the GMM, which we call the recursive inversion\nmodel (RIM), which allows more \ufb02exibility than the original GMM, while preserving its elegant and\nuseful properties of compact parametrization, tractable normalization constant, and interpretability\nof parameters. Essentially, while the GMM constructs a permutation sequentially by a stochastic\ninsertion sort process, the RIM constructs one by a stochastic merge sort. In this sense, the RIM is a\ncompactly parametrized Rif\ufb02e Independence (RI) model (Huang & Guestrin, 2012) de\ufb01ned in terms\nof inversions rather than independence.\n\n(cid:26) 1\n\n0\n\ni <\u03c0 j \u2227 j <\u03c00 i\notherwise\n\n2 Recursive Inversion Models\nWe are interested in probabilistic models of permutations of a set of elements E = {e1, ..., en}. We\nuse \u03c0 \u2208 SE to denote a permutation (a total ordering) of the elements in E, and use ei <\u03c0 ej to\ndenote that two elements are ordered. We de\ufb01ne an n \u00d7 n (lower diagonal) discrepancy matrix Dij\nthat captures the discrepancies between two permutations.\n\nDij(\u03c0, \u03c00) =\n\nMallows model is de\ufb01ned in terms of the inversion distance d(\u03c0, \u03c00) = (cid:80)\n\n(1)\nWe call the \ufb01rst argument of Dij(\u00b7,\u00b7) the test permutation (typically \u03c0) and the second argument the\nreference permutation (typically \u03c00).\nTwo classic models for permutations are the Mallows and the generalized Mallows models. The\nij Dij(\u03c0, \u03c00) which\nis the total number of inversions between \u03c0 and \u03c00 (Mallows, 1957). The Mallows models is\nthen P (\u03c0|\u03c00, \u03b8) = 1\nZ(\u03b8) exp(\u2212\u03b8d(\u03c0, \u03c00)), \u03b8 \u2208 R. Note that the normalization constant does\nnot depend on \u03c00 but only on the concentration parameter \u03b8. The Generalized Mallows model\n(GMM) of Fligner and Verducci (1986) extends the Mallows model by introducing a parame-\nter for each of the elements in E and decomposes the inversion distance into a per element dis-\n\n1\n\n\fe\u2208E \u03b8eve)\n\nwith respect to \u03c00 is vj(\u03c0, \u03c00) = (cid:80)\n\ni>\u03c00 j Dij(\u03c0, \u03c00).\n\nZ(\u03b8) exp(\u2212(cid:80)\n\ntance1.\nIn particular, we de\ufb01ne vj(\u03c0, \u03c00) to be the number of inversions for element j in \u03c0\nIn this case, the GMM is de\ufb01ned as\nP (\u03c0|\u03c00, \u03b8) = 1\n\u03b8 \u2208 Rn. The GMM can be thought of as a stagewise\nmodel in which each of the elements in E are inserted according to the reference permutation \u03c00\ninto a list where the parameter \u03b8e controls how likely the insertion of element e will yield an inver-\nsion with respect to the reference permutation. For both of these models the normalization constant\ncan be computed in closed form\nOur RIMs generalize the GMM by replacing the sequence of single element insertions with a se-\nquence of recursive merges of subsequences where the relative order within the subsequences is pre-\nserved. For example, the sequence [a, b, c, d, e] can be obtained by merging the two subsequences\n[a, b, c] with [d, e] with zero inversions and the sequence [a, d, b, e, c] can be obtained from these\nsubsequences with 3 inversions. The RIM generates a permutation recursively by merging subse-\nquences de\ufb01ned by a binary recursive decomposition of the elements in E and where the number of\ninversions is controlled by a separate parameter associated with each merge operation.\nMore formally, a RIM \u03c4 (\u03b8) for a set of elements E = {e1, . . . , en}, has a structure \u03c4 that repre-\nsents a recursive decomposition of the set E and a set of parameters \u03b8 \u2208 Rn\u22121. We represent a\nRIM as a binary tree with n = |E| leaves, each associated with a distinct element of E. We denote\nthe set of internal vertices of the binary tree by I and each internal vertex is represented as a triple\ni = (\u03b8i, iL, iR) where iL (iR) is the left (right) subtree, and \u03b8i controls the number of inversions\nwhen merging the subsequences generated from each of the subtrees. Traversing the tree \u03c4 in pre-\norder, with the left child preceding the right child induces a permutation on E called the reference\npermutation of the RIM which we denote as \u03c0\u03c4 .\nThe RIM is de\ufb01ned in terms of the vertex discrepancy, the number of inversions at (internal) vertex\nDlr(\u03c0, \u03c0\u03c4 ) where Li\n(Ri) is the subset of elements E that appear as leaves of iL (iR), the left (right) subtree of internal\nvertex i. Note that the sum of the vertex discrepancies over the internal vertices is the inversion\ndistance between \u03c0 and the reference permutation \u03c0\u03c4 . Finally, the likelihood of a permutation \u03c0\nwith respect to RIM \u03c4 (\u03b8) is as follows:\n\ni = (\u03b8i, iL, iR) of \u03c4 (\u03b8) for test permutation \u03c0 is vi(\u03c0, \u03c0\u03c4 ) =(cid:80)\n\n(cid:80)\n\nl\u2208Li\n\nr\u2208Ri\n\nP (\u03c0|\u03c4 ) \u221d(cid:89)\n\ni\u2208I\n\nexp(\u2212\u03b8ivi(\u03c0, \u03c0\u03c4 ))\n\n(2)\n\nExample: For elements E = {a, b, c, d},\nFigure 1 shows a RIM \u03c4 for preferences over\nfour types of fruit. The reference permuta-\ntion for this model is \u03c0\u03c4 = (a, b, c, d) and the\nmodal permutation is (c, d, a, b) due to the\nsign of the root vertex. For test permutation\n\u03c0 = (d, a, b, c), we have that vroot(\u03c0, \u03c0\u03c4 ) =\n2, vlef t = 0, and vright = 1. Note that the\nmodel captures strong preferences between\nthe pairs (a, b) and (c, d) and weak prefer-\nences between (c, a),(d, a),(c, b) and (d, b).\nThis is an example of a set of preferences that\ncannot be captured in a GMM as choosing\na strong preference between the pairs (a, b)\nand (c, d) induces a strong preference be-\ntween either (a, d) or (c, b) which differs in\nboth strength and order from the example.\nNaive computation of the partition function Z(\u03c4 (\u03b8)) for a recursive inversion model would require\na sum with n! summands (all permutations). We can, however, use the recursive structure of \u03c4 (\u03b8) to\ncompute it as follows:\n\nFigure 1: An example of a RIM for fruit pref-\nerences among (a)pple, (b)anana, (c)herry, and\n(d)urian. The parameter for internal vertices indi-\ncates the preference between items in the left and\nright subtree with 0 indicating no preference and\na negative number indicating the right items are\nmore preferable than the left items.\n\n1Note that a GMM can be parameterized in terms of n \u2212 1 parameters due to the fact that vn = 0.\n\n2\n\napplebananacherrydurian0.81.6-0.1\f(cid:89)\n\ni\u2208I\n(q)n+m\n(q)n(q)m\n\nProposition 1\n\nZ(\u03c4 (\u03b8)) =\n\nG(|Li|,|Ri|; exp(\u2212\u03b8i)).\n\ndef\u2261 Zn,m(q) .\n\nG(n, m; q) =\n\nIn the above G(n, m; q) is the Gaussian polynomial (Andrews, 1985) and (q)n = (cid:81)n\n(cid:0)n+m\n\n(cid:1) which corresponds to the limit of the Gaussian polynomial as q approaches 1 (and \u03b8 ap-\n\ni=1(1 \u2212 qi).\nThe Gaussian polynomial is not de\ufb01ned for q = 1 so we extend the de\ufb01nition so that G(n, m, 1) =\n\nproaches 0).\nNote that when all \u03b8i \u2265 0 the reference permutation \u03c0\u03c4 is also a modal permutation and that this\nmodal permutation is unique when all \u03b8i > 0. Also note that a GMM can be represented by using\na chain-like tree structure in which each element in the reference permutation is split from the\nremaining elements one at a time.\n\nm\n\n3 Estimating Recursive Inversion Models\n\n(3)\n\n(4)\n\n(5)\n\n(cid:80)\n\nIn this section, we present a Maximum Likelihood (ML) approach to parameter and structure esti-\nmation from an observed data D = {\u03c01, \u03c02, . . . \u03c0N} of permutations over E.\nParameter estimation is straight-forward. Given a structure \u03c4, we see from (2) that the likelihood\nfactors according to the structure. In particular, a RIM is a product of exponential family models,\none for each internal node i \u2208 I. Consequently, the (negative) log-likelihood given D decomposes\ninto a sum\n\n\u2212 ln P (D|\u03c4 (\u03b8)) =\n\n(cid:88)\n\ni\u2208I\n\n(cid:124)\n(cid:125)\n[\u03b8i \u00afVi + ln Z|Li|,|Ri|(e\u2212\u03b8i)]\n\n(cid:123)(cid:122)\n\nscore(i,\u03b8i)\n\nwith \u00afVi = 1|D|\n\u03c0\u2208D vi(\u03c0, \u03c0\u03c4 ) representing the suf\ufb01cient statistic for node i from data. This is\na convex function of the parameters \u03b8i, and hence the ML estimate can be obtained numerically\nsolving a set of univariate minimization problems.\nIn the remainder of the paper we use D to\nbe the sum of the discrepancy matrices for all of the observed data D with respect to the identity\npermutation. Note that this matrix provides a basis for ef\ufb01ciently computing the suf\ufb01cient statistics\nof any RIM.\nIn the remainder of this section, we consider the problem of estimating the structure of a RIM from\nobserved data beginning with a brief exploration of the degree to which the structure of a RIM can\nbe identi\ufb01ed.\n\n3.1\n\nIdenti\ufb01ability\n\nFirst, we consider whether the structure of a RIM can be identi\ufb01ed from data. From the previous\nsection, we know that the parameters are identi\ufb01able given the structure. However, the structure of\na RIM can only be identi\ufb01ed under suitable assumptions.\nThe \ufb01rst type of non-identi\ufb01ability occurs when some \u03b8i parameters are zero.\nIn this case, the\npermutation \u03c0\u03c4 is not identi\ufb01able, because switching the left and right child of node i with \u03b8i = 0\nwill not change the distribution represented by the RIM. In fact, as shown by the next proposition, the\nleft and right children can be swapped without changing the distribution if the sign of the parameter\nis changed.\n\nProposition 2 Let \u03c4 (\u03b8) be a RIM over E, D a matrix of suf\ufb01cient statistics and i any internal node\nof \u03c4, with parameter \u03b8i and iL, iR its left and right children. Denote by \u03c4(cid:48)(\u03b8(cid:48)) the RIM obtained\ni = \u2212\u03b8i. Then, P (\u03c0|\u03c4 (\u03b8)) = P (\u03c0|\u03c4(cid:48)(\u03b8(cid:48))) for all\nfrom \u03c4 (\u03b8) by switching iL, iR, and setting \u03b8(cid:48)\npermutations \u03c0 of E.\n\nThis proposition demonstrates that the structure of a RIM cannot be identi\ufb01ed in general and that\nthere is an equivalence class of alternative structures among which we cannot distinguish. We elimi-\n\n3\n\n\fnate this particular type of non-identi\ufb01ability by considering RIM that are in canonical form. Propo-\nsition 2 provides a way to put any \u03c4 (\u03b8) in canonical form.\n\nAlgorithm 1 Algorithm CANONICALPERMUTATION\n\nInput any \u03c4 (\u03b8)\nfor each internal node i with parameter \u03b8i do\n\nif \u03b8i < 0 then\n\n\u03b8i \u2190 \u2212\u03b8i; switch left child with right child\n\nend if\nend for\n\nProposition 3 For any matrix of suf\ufb01cient statistics D, and any RIM \u03c4 (\u03b8), Algorithm CANONI-\nCALPERMUTATION does not change the log-likelihood.\nThe proof of correctness follows from repeated application of Proposition 2. Moreover, if \u03b8i (cid:54)= 0\nbefore applying CANONICALPERMUTATION, then the output of the algorithm will have all \u03b8i > 0.\nA further non identi\ufb01ability arises when parameters of the generating model are equal. It is easy to\nsee that if all the parameters \u03b8i are equal to the same value \u03b8, then the likelihood of a permutation\n\u03c0 would be P (\u03c0|\u03c4, (\u03b8, . . . \u03b8)) \u221d exp(\u2212\u03b8d(\u03c0, \u03c0\u03c4 )), which is the likelihood corresponding to the\nMallows model. In this case \u03c0\u03c4 is identi\ufb01able, but the internal structure is not. Similarly, if all the\nparameters \u03b8i are equal in a subtree of \u03c4, then the structure in that subtree is not identi\ufb01able.\nWe say that a RIM \u03c4 (\u03b8) is locally identi\ufb01able iff \u03b8i (cid:54)= 0, i \u2208 I and |\u03b8i| (cid:54)= |\u03b8i(cid:48)| whenever i is a\nchild of i(cid:48). We say that a RIM \u03c4 (\u03b8) is identi\ufb01able if there is a unique canonical RIM that represents\nthe same distribution. The following proposition captures the degree to which one can identify the\nstructure of a RIM.\n\nProposition 4 A RIM \u03c4 (\u03b8) is identi\ufb01able iff it is locally identi\ufb01able.\n\n3.2 ML estimation or \u03c4 for \ufb01xed \u03c0\u03c4 is tractable\n\nWe \ufb01rst consider ML estimation when we \ufb01x \u03c0\u03c4 , the reference permutation over the leaves in E.\nFor the remainder of this section we assume that the optimal value of \u02c6\u03b8i for any internal node i is\navailable (e.g., via the convex optimization problem described in the previous section). Hence, what\nremains to be estimated is the internal tree structure\nProposition 5 For any set E, permutation \u03c0\u03c4 over E, and observed data D, the Maximum Likeli-\nhood RIM structure inducing this \u03c0\u03c4 can be computed in polynomial time by Dynamic Programming\nalgorithm STRUCTBYDP.\n\nProof sketch Note that there is a one-to-one correspondence between tree structures representing\nalternative binary recursive partitioning over a \ufb01xed permutation of E and alternative ways in which\nthe one can parenthesize the permutation of E. The negative log-likelihood decomposes according\nto the structure of the model with the cost of a subtree rooted at i depending only on the structure\nof this subtree. Furthermore, this cost can be decomposed recursively into a sum of score(i, \u02c6\u03b8i)\nand the costs of iL, iR the subtrees of i. The recursion is identical to the recursion of the \u201coptimal\nmatrix chain multiplication\u201d problem, or to the \u201cinside\u201d part of the Inside-Outside algorithm in string\nparsing by SCFGs (Earley, 1970).\nWithout loss of generality, we consider that \u03c0\u03c4 is the identity, \u03c0\u03c4 = (e1, . . . en). For any subse-\nquence ej, . . . em of length l = m \u2212 j + 1, we de\ufb01ne the variables cost(j, m), \u03b8(j, m), Z(j, m)\nthat will store respectively the negative log-likelihood, the parameter at the root, and the Z for the\nroot node of the optimal tree over the subsequence ej, . . . em. If all the values of cost(j, m) are\nknown for m \u2212 j + 1 < l, then the values of cost(j, j + l \u2212 1), \u03b8(j, j + l \u2212 1), Z(j, j + l \u2212 1)\nare obtained recursively from the existing values. We also maintain the pointers back(j, m) that\nindicate which subtrees were used in obtaining cost(j, m). When cost(1, n) and the corresponding\n\u03b8 and Z are obtained, the optimal structure and its parameters have been found, and they can be read\n\n4\n\n\frecursively by following the pointers back(j, m). Note that in the innermost loop, the quantities\nscore(j, m), \u03b8(j, m), \u00afV are recalculated for each k.\nWe call the algorithm implementing this optimization STRUCTBYDP.\n\nfor j \u2190 1 : n \u2212 l + 1 do\n\nm \u2190 j + l \u2212 1\ncost(j, m) \u2190 \u221e\nfor k \u2190 j : m \u2212 1 do\nj(cid:48)=j\n\ncalculate \u00afV =(cid:80)k\n\nAlgorithm 2 Algorithm STRUCTBYDP\n1: Input sample discrepancy matrix D computed from the observed data\n2: for m = 1 : n do\ncost(m, m) \u2190 0\n3:\n4: end for\n5: for l \u2190 2 . . . n do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21: end for\n\nL = k \u2212 j + 1, R = m \u2212 k\nestimate \u03b8jm from L, R, \u00afV\ncalculate score(j, m) by (5)\ns \u2190 cost(j, k) + cost(k + 1, m) + score(j, m)\nif s < cost(j, m) then\n\ncost(j, m) \u2190 s, back(j, m) \u2190 k\nstore \u03b8(j, m), ZLR(j, m)\n\nend if\nend for\n\nend for\n\n(cid:80)m\n\nm(cid:48)=k Dm(cid:48)j(cid:48)\n\nAlgorithm 3 Algorithm SASEARCH\n\nInput set E, discrepancy matrix D computed from observed data, inverse temperature \u03b2\nInitialize Estimate GMM \u03c40 by BRANCH&BOUND , \u03c4 best = \u03c40\nfor t = 1, 2, . . . tmax do\n\nwhile accept= FALSE do\nsample \u03c0 \u223c P (\u03c0|\u03c4t\u22121)\n\u03c4(cid:48) \u2190 STRUCTBYDP(\u03c0, D)\n\u03c4(cid:48) \u2190 CANONICALPERMUTATION(\u03c4(cid:48))\n\u03c0(cid:48) \u2190 reference order of \u03c4(cid:48)\n\u03c4(cid:48) \u2190 STRUCTBYDP(\u03c0(cid:48), D)\naccept=TRUE, u \u223c unif orm[0, 1)\nif e\u2212\u03b2(ln P (D|\u03c4t\u22121)\u2212ln P (D|\u03c4(cid:48))) < u then\n\naccept\u2190 FALSE\n\nend if\n\nend while\n\u03c4t \u2190 \u03c4(cid:48) (store accepted new model)\nif P (D|\u03c4t) > P (D|\u03c4 best) then\nend if\nend for\nOutput \u03c4 best\n\n\u03c4 best \u2190 \u03c4t\n\nTo evaluate the running time of STRUCTBYDP algorithm, we consider the inner loop over k for\na given l. This loop computes \u00afV , \u02c6\u03b8, Z for each L, R split of l, with L + R = l. Apparently, this\nwould take time cubic in l, since \u00afV is a summation over LR terms. However, one can notice that\nin the calculations of all \u00afV values over this submatrix of size l \u00d7 l, for L = 1, 2, . . . l \u2212 1, each\nof the Drl elements is added once to the sum, is kept in the sum for a number of steps, then is\nremoved. Therefore, the total number of additions and subtractions is no more than twice l(l\u2212 1)/2,\nthe number of submatrix elements . Estimating \u03b8 and the score involved computing Z by (3) (for\n\n5\n\n\fthe score) and its gradient (for the \u03b8 estimation). These take min(L, R) < l operations per iteration.\nIf we consider the number of iterations to convergence a constant, then the inner loop over k will\ntake O(l2) operations. Since there are n \u2212 l subsequences of length l, it is easy now to see that the\nrunning time of the whole STRUCTBYDP algorithm is of the the order n4.\n\n3.3 A local search algorithm\n\nNext we develop a local search algorithm for the structure when a reference permutation is not\nprovided. In part, this approach can be motivated by previous work on structure estimation for the\nMallows model, where the structure is a permutation. For these problems, researchers have found\nthat an approach in which one greedily improves the log-likelihood by transposing adjacent elements\ncoupled with a good initialization is a very effective approximate optimization method (Schalekamp\n& van Zuylen, 2009; Ali & Meila, 2011).\nIn our approach, we take a similar approach and treat the problem as a search for good reference\npermutations leveraging the STRUCTBYDP algorithm to \ufb01nd the structure given a reference per-\nmutation. At a high level, we initialize \u03c0\u03c4 = \u03c00 by estimating a GMM from the data D and then\nimprove \u03c0\u03c4 by \u201clocal changes\u201d starting from \u03c00.\nWe rely on estimation of a GMM for initialization but, unfortunately, the ML estimation of a Mal-\nlows model, as well as that of a GMM, is NP-hard (Bartholdi et al., 1989). For the initialization, we\ncan use any of the fast heuristic methods of estimating a Mallows model, or a more computation-\nally expensive search algorithm, The latter approach, if the search space is small enough, can \ufb01nd a\nprovably optimal permutation but, in most cases, it will return a suboptimal result.\nFor the local search, we make two variations with respect to the previous works, and we add a local\noptimization step speci\ufb01c to the class of Recursive Inversion models. First, we replace the greedy\nsearch with a simulated annealing search. Thus, we will generate proposal permutations \u03c0(cid:48) near\nthe current \u03c0. Second, the proposals permutations \u03c0(cid:48) are not restricted to pairwise transpositions.\nInstead, we sample a permutation \u03c0(cid:48) from the current RIM \u03c4t. The reason is that if some of the\npairs e \u227a\u03c0\u03c4 e(cid:48) are only weakly ordered by \u03c4t (which would happen if this ordering or e, e(cid:48) is not\nwell supported by the data), then the sampling process will be likely to create inversions between\nthese pairs. Conversely, if \u03c4t puts a very high con\ufb01dence on e \u227a e(cid:48), then it is probable that this\nordering is well supported by the data and reversing it will be improbable in the proposed \u03c4.\nFor each accepted proposal permutation \u03c0, we estimate the optimal structure \u03c4 give this \u03c0 and the\noptimal parameters \u02c6\u03b8 given the structure \u03c4. Rather than sampling a permutation from the RIM \u03c4 (\u02c6\u03b8)\nwe then apply CANONICALPERMUTATION, which does not change the log-likelihood, to convert\n\u03c4 (\u02c6\u03b8) into a canonical model and perform another structure optimization step STRUCTBYDP. This\nhas the chance of once again increasing the log-likelihood, and experimentally we \ufb01nd that it often\ndoes increase the log-likelihood signi\ufb01cantly. We then use the estimated structure and associated\nparameters to sample a new permutation. These steps are implemented by algorithm SASEARCH.\n\n4 Related work\n\nIn addition to the Mallows and GMM models, our RIM model is related to the work of Manilla\n& Meek (2000). To understand the connection between this work and our RIM model consider a\nrestricted RIM model in which parameter values can either be 0 or \u221e. Such a model provides\na uniform distribution over permutations consistent with a series-parallel partial order de\ufb01ned in\nterms of the binary recursive partition where a parameter whose value is 0 corresponds to a parallel\ncombination and a parameter value of \u221e corresponds to a series combination. The work of Manilla\n& Meek (2000) considers the problem of learning the structures and estimating the parameters of\nmixtures of these series-parallel RIM models using a local greedy search over recursive partitions of\nelements.\nAnother close connection exists between RIM models and the rif\ufb02e independence models (RI) pro-\nposed by Huang et al. (2009); Huang & Guestrin (2012); Huang et al. (2012). Both approaches use\na recursive partitioning of the set of elements to de\ufb01ne a distribution over permutations. Unlike the\nRIM model, the RI model is not de\ufb01ned in terms of inversions but rather in terms of independence\nbetween the merging processes. The RI model requires exponentially more parameters than the\n\n6\n\n\fIrish Meath elections\n\nSushi\n\nFigure 2: Log-likelihood scores for the models alph, HG, and GMM as differences from the log-\nlikelihood of SASEARCH output, on held-out sets from Meath elections data (left) and Sushi data\n(middle). Train/test set split was 90/2400, respectively 300/4700, with 50 random replications.\nNegative score indicate that a model has lower likelihood than the model obtained by SASEARCH.\nThe far outlier(s) in meath represent one run where SA scored poorly on the test set. Right: Most\ncommon structure and typical parameters learned for the sushi data. Interior nodes contain the asso-\nciated parameter value, with higher values and darker green indicating a stronger ordering between\nthe items in the left and right subtrees. The leaves are the different types of sushi.\n\nRIM model due to the fact that the model de\ufb01nes a general distribution over mergings which grows\nexponentially in the cardinality of the left and right sets of elements. In addition, the RI models\ndo not have the same ease of interpretation as the RIM model. For instance, one cannot easily ex-\ntract a reference permutation or modal permutation from a given a RI model, and the comparison\nof alternative RI models, even when the two RI models have the same structure, is limited to the\ncomparison of rank marginals and Fourier coef\ufb01cients.\nIt is worth noting that there have been a wide range of approaches that use multiple reference per-\nmutations. One bene\ufb01t of such approaches is that they enable the model to capture multi-modal\ndistributions over permutations. Examples of such approaches include the mixture modeling ap-\nproaches of Manilla & Meek (2000) discussed above and the work of Lebanon & Lafferty (2002)\nand Klementiev et al. (2008), where the model is a weighted product of a set of Mallows models each\nwith their own reference order. It is natural to consider both mixtures and products of RIM models.\n\n5 Experiments\n\nWe performed experiments on synthetic data and real-world data sets. In our synthetic experiments\nwe found that our approach was typically able to identify both the structure and parameters of the\ngenerative model. More speci\ufb01cally, we ran extensive experiments with n = 16 and n = 33, choos-\ning the model structures to have varying degrees of balance, and choosing the parameters randomly\nchosen with exp(\u2212\u03b8i) between 0.4 and 0.9. We then used these RIMs to generate datasets con-\ntaining varying numbers of permutations to investigate whether the true model could be recovered.\nWe found that all models were recoverable with high probability when using between 200-1000\nSASEARCH iterations. We did \ufb01nd that the identi\ufb01cation of the correct tree structure in its entirety\ntypically required a large sample size. We note that failures to identify the correct structure were typ-\nically due to the fact that alternative structures had higher likelihood than the generating structure in\na particular sample rather than a failure of the search algorithm. While our experiments had at most\nn = 33 this was not due to the running time of the algorithms. For instance, STRUCTBYDP ran in a\nfew seconds for domains with 33 items. For the smaller domains and for the real-world data below,\nthe whole search with hundreds of accepted proposals typically ran in less than three minutes. In\nparticular, this search was faster than the BRANCH&BOUND search for GMM models.\nIn our experiments on real-world data sets we examine two datasets. The \ufb01rst data set is an Irish\nHouse of Parliament election dataset from the Meath constituency in Ireland. The parliament uses\nthe single transferable vote election system, in which voters rank candidates. There were 14 candi-\n\n7\n\n\fdates in the 2002 election, running for \ufb01ve seats. Candidates are associated with the two major rival\npolitical parties, as well as a number of smaller parties. We use the roughly 2500 fully ranked ballots\nfrom the election. See Gormley & Murphy (2007) for more details about the dataset. The second\ndataset consists of 5,000 permutations of 10 different types of sushi where the permutation captures\npreferences about sushi (Kamisha, 2003). The different types of sushi considered are: anago (sea\neel), ebi (shrimp), ika (squid), kappa-maki (cucumber roll), maguro (tuna), sake (salmon), tamago\n(egg), tekka-maki (tuna roll), toro (fatty tuna), uni (sea urchin).\nWe compared a set of alternative recursive inversion models and approaches for identifying their\nstructure. Our baseline approach, denoted alph, is one where the reference permutation is alphabet-\nical and \ufb01xed and we estimate the optimal structure given that order by STRUCTBYDP. Our second\napproach, GMM, is to use the BRANCH&BOUND algorithm of Mandhani & Meila (2009)2 to esti-\nmate a generalized Mallows Model. A third approach, HG, is to \ufb01t the optimal RIM parametrization\nto the hierarchical tree structure identi\ufb01ed by Huang & Guestrin (2012) on the same data.3 Finally,\nwe search over both structures and orderings with SASEARCH, with 150 (100) iterations for Meath\n(sushi) at temperature 0.02.\nThe quantitative results are shown in Figure 2. We plot the difference in test log-likelihood for each\nmodel as compared with SASEARCH. We see that on the Meath data SASEARCH outperforms\nalph in 94% of the runs, HG in 75%, and GMM in 98%; on the Sushi data, SASEARCH is always\nsuperior to alph and GMM, and has higher likelihood than GMM in 75% of runs. On the training sets,\nSASEARCH had always the best \ufb01t (not shown).\nWe also investigated the structure and parameters of the learned models. For the Meath data we\nfound that there was signi\ufb01cant variation in the learned structure across runs. Despite the variation\nthere were a number of substructures common to the learned models. Similar to the \ufb01ndings in\nHuang & Guestrin (2012) on the structure of a learned rif\ufb02e independence model, we found that\ncandidates from the same party were typically separated from candidates of other parties as a group.\nIn addition, within these political clusters we found systematic preference orderings among the can-\ndidates. Thus, many substructures in our trees were also found in the HG tree. In addition, again as\nfound by Huang & Guestrin (2012), we found that a single candidate in an extreme political party\nis typically split near the top of the hierarchy, with a \u03b8 \u2248 0, indicating that this candidate can be\ninserted anywhere in a ranking. We suspect that the inability of a GMM to capture such dependen-\ncies leads to the poor empirical performance relative to HG and full search which can capture such\ndependencies. We note that alph is allowed to have \u03b8i < 0, and therefore the alphabetic reference\npermutation does not represent a major handicap.\nFor the sushi data roughly 90% of the runs had the structure shown in Figure 2 with the other\nvariants being quite similar. The structure found is interesting in a number of different ways. First,\nthe model captures a strong preference between different varieties of tuna (toro, maguro and tekka)\nwhich corresponds with the typical price of these varieties. Second, the model captures a preference\nagainst tamago and kappa as compared with several other types of sushi and both of these varieties\nare distinct in that they are not varieties of \ufb01sh but rather egg and cucumber respectively. Finally, uni\n(sea urchin), which many people describe as being quite distinct in \ufb02avor, is ranked independently\nof preferences between other sushi and, additionally, there is no consensus on its rank.\n\n2www.stat.washington.edu/mmp/intransitive.html\n3We would have liked to make a direct comparison with the algorithm of Huang & Guestrin (2012), but the\ncode was not available. Due to this, we aim only at comparing the quality of the HG structure, a structure found\nto model these data well albeit with a different estimation algorithm, with the structures found by SASEARCH.\n\n8\n\n\fReferences\nAli, Alnur and Meila, Marina. Experiments with Kemeny ranking: What works when? Mathematics\n\nof Social Sciences, Special Issue on Computational Social Choice, pp. (in press), 2011.\n\nAndrews, G.E. The Theory of Partitions. Cambridge University Press, 1985.\nBartholdi, J., Tovey, C. A., and Trick, M. Voting schemes for which it can be dif\ufb01cult to tell who\nwon. Social Choice and Welfare, 6(2):157\u2013165, 1989. proof that consensus ordering is NP hard.\nEarley, Jay. An ef\ufb01cient context-free parsing algorithm. Communications of the ACM, 13(2):94\u2013102,\n\n1970.\n\nFligner, M. A. and Verducci, J. S. Distance based ranking models. Journal of the Royal Statistical\n\nSociety B, 48:359\u2013369, 1986.\n\nGormley, I. C. and Murphy, T. B. A latent space model for rank data. In Proceedings of the 24th\n\nAnnual International Conference on Machine Learning, pp. 90\u2013102, New York, 2007. ACM.\n\nHuang, Jonathan and Guestrin, Carlos. Uncovering the rif\ufb02ed independence structure of ranked\n\ndata. Electronic Journal of Statistics, 6:199\u2013230, 2012.\n\nHuang, Jonathan, Guestrin, Carlos, and Guibas, Leonidas. Fourier theoretic probabilistic inference\n\nover permutations. Journal of Machine Learning Research, 10:997\u20131070, May 2009.\n\nHuang, Jonathan, Kapoor, Ashish, and Guestrin, Carlos. Rif\ufb02ed independence for ef\ufb01cient inference\n\nwith partial rankings. Journal of Arti\ufb01cial Intelligence Research, 44:491\u2013532, 2012.\n\nKamisha, T. Nantonac collaborative \ufb01ltering: recommendation based on order responses. In Pro-\nceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, pp. 583\u2013588, New York, 2003. ACM.\n\nKlementiev, Alexandre, Roth, Dan, and Small, Kevin. Unsupervised rank aggregation with distance-\nbased models. In Proceedings of the 25th International Conference on Machine Learning, pp.\n472\u2013479, New York, NY, USA, 2008. ACM.\n\nLebanon, Guy and Lafferty, John. Cranking: Combining rankings using conditional probability\nmodels on permutations. In Proceedings of the 19th International Conference on Machine Learn-\ning, pp. 363\u2013370, 2002.\n\nMallows, C. L. Non-null ranking models. Biometrika, 44:114\u2013130, 1957.\nMandhani, Bhushan and Meila, Marina. Better search for learning exponential models of rank-\nings. In VanDick, David and Welling, Max (eds.), Arti\ufb01cial Intelligence and Statistics AISTATS,\nnumber 12, 2009.\n\nManilla, Heiki and Meek, Christopher. Global partial orders from sequential data. In Proceedings\nof the Sixth Annual Confrerence on Knowledge Discovery and Data Mining (KDD), pp. 161\u2013168,\n2000.\n\nSchalekamp, Frans and van Zuylen, Anke. Rank aggregation: Together we\u2019re strong. In Finocchi,\nIrene and Hershberger, John (eds.), Proceedings of the Workshop on Algorithm Engineering and\nExperiments, ALENEX 2009, New York, New York, USA, January 3, 2009, pp. 38\u201351. SIAM,\n2009.\n\n9\n\n\f", "award": [], "sourceid": 441, "authors": [{"given_name": "Christopher", "family_name": "Meek", "institution": "Microsoft Research"}, {"given_name": "Marina", "family_name": "Meila", "institution": "University of Washington"}]}