{"title": "A Structured Prediction Approach for Label Ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 8994, "page_last": 9004, "abstract": "We propose to solve a label ranking problem as a structured output regression task. In this view, we adopt a least square surrogate loss\napproach that solves a supervised learning problem in two steps:\na regression step in a well-chosen feature space and a pre-image (or decoding) step. We use specific feature maps/embeddings for ranking data, which convert any ranking/permutation into a vector representation. These embeddings are all well-tailored for our approach, either by resulting in consistent estimators, or by solving trivially the pre-image problem which is often the bottleneck in structured prediction. Their extension to the case of incomplete or partial rankings is also discussed. Finally, we provide empirical results on synthetic and real-world datasets showing the relevance of our method.", "full_text": "A Structured Prediction Approach for Label Ranking\n\nAnna Korba, Alexandre Garcia, Florence d\u2019Alch\u00e9-Buc\n\nfirstname.lastname@telecom-paristech.fr\n\nLTCI, T\u00e9l\u00e9com ParisTech\nUniversit\u00e9 Paris-Saclay\n\nParis, France\n\nAbstract\n\nWe propose to solve a label ranking problem as a structured output regression task.\nIn this view, we adopt a least square surrogate loss approach that solves a supervised\nlearning problem in two steps: a regression step in a well-chosen feature space\nand a pre-image (or decoding) step. We use speci\ufb01c feature maps/embeddings for\nranking data, which convert any ranking/permutation into a vector representation.\nThese embeddings are all well-tailored for our approach, either by resulting in\nconsistent estimators, or by solving trivially the pre-image problem which is often\nthe bottleneck in structured prediction. Their extension to the case of incomplete or\npartial rankings is also discussed. Finally, we provide empirical results on synthetic\nand real-world datasets showing the relevance of our method.\n\n1\n\nIntroduction\n\nLabel ranking is a prediction task which aims at mapping input instances to a (total) order over a\ngiven set of labels indexed by {1, . . . , K}. This problem is motivated by applications where the\noutput re\ufb02ects some preferences, or order of relevance, among a set of objects. Hence there is an\nincreasing number of practical applications of this problem in the machine learning litterature. In\npattern recognition for instance (Geng and Luo, 2014), label ranking can be used to predict the\ndifferent objects which are the more likely to appear in an image among a prede\ufb01ned set. Similarly, in\nsentiment analysis, (Wang et al., 2011) where the prediction of the emotions expressed in a document\nis cast as a label ranking problem over a set of possible affective expressions. In ad targeting, the\nprediction of preferences of a web user over ad categories (Djuric et al., 2014) can be also formalized\nas a label ranking problem, and the prediction as a ranking guarantees that each user is quali\ufb01ed into\nseveral categories, eliminating overexposure. Another application is metalearning, where the goal\nis to rank a set of algorithms according to their suitability based on the characteristics of a target\ndataset and learning problem (see Brazdil et al. (2003); Aiguzhinov et al. (2010)). Interestingly,\nthe label ranking problem can also be seen as an extension of several supervised tasks, such as\nmulticlass classi\ufb01cation or multi-label ranking (see Dekel et al. (2004); F\u00fcrnkranz and H\u00fcllermeier\n(2003)). Indeed for these tasks, a prediction can be obtained by postprocessing the output of a label\nranking model in a suitable way. However, label ranking differs from other ranking problems, such as\nin information retrieval or recommender systems, where the goal is (generally) to predict a target\nvariable under the form of a rating or a relevance score (Cao et al., 2007).\nMore formally, the goal of label ranking is to map a vector x lying in some feature space X to a\nranking y lying in the space of rankings Y. A ranking is an ordered list of items of the set {1, . . . , K}.\nThese relations linking the components of the y objects induce a structure on the output space\nY. The label ranking task thus naturally enters the framework of structured output prediction for\nwhich an abundant litterature is available (Nowozin and Lampert, 2011). In this paper, we adopt\nthe Surrogate Least Square Loss approach introduced in the context of output kernels (Cortes et al.,\n2005; Kadri et al., 2013; Brouard et al., 2016) and recently theoretically studied by Ciliberto et al.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(2016) and Osokin et al. (2017) using Calibration theory (Steinwart and Christmann, 2008). This\napproach divides the learning task in two steps: the \ufb01rst one is a vector regression step in a Hilbert\nspace where the outputs objects are represented through an embedding, and the second one solves a\npre-image problem to retrieve an output object in the Y space. In this framework, the algorithmic\ncomplexity of the learning and prediction tasks as well as the generalization properties of the resulting\npredictor crucially rely on some properties of the embedding. In this work we study and discuss some\nembeddings dedicated to ranking data.\nOur contribution is three folds: (1) we cast the label ranking problem into the structured prediction\nframework and propose embeddings dedicated to ranking representation, (2) for each embedding we\npropose a solution to the pre-image problem and study its algorithmic complexity and (3) we provide\ntheoretical and empirical evidence for the relevance of our method.\nThe paper is organized as follows. In section 2, de\ufb01nitions and notations of objects considered through\nthe paper are introduced, and section 3 is devoted to the statistical setting of the learning problem.\nsection 4 describes at length the embeddings we propose and section 5 details the theoretical and\ncomputational advantages of our approach. Finally section 6 contains empirical results on benchmark\ndatasets.\n\n2 Preliminaries\n\n2.1 Mathematical background and notations\n\nConsider a set of items indexed by {1, . . . , K}, that we will denote(cid:74)K(cid:75). Rankings, i.e. ordered lists\nof items of(cid:74)K(cid:75), can be complete (i.e, involving all the items) or incomplete and for both cases, they\nties ranking of the items in(cid:74)K(cid:75). It can be seen as a permutation, i.e a bijection \u03c3 :(cid:74)K(cid:75) \u2192(cid:74)K(cid:75),\n\ncan be without-ties (total order) or with-ties (weak order). A full ranking is a complete, and without-\n\nmapping each item i to its rank \u03c3(i). The rank of item i is thus \u03c3(i) and the item ranked at position\nj is \u03c3\u22121(j). We say that i is preferred over j (denoted by i (cid:31) j) according to \u03c3 if and only if i is\nranked lower than j: \u03c3(i) < \u03c3(j). The set of all permutations over K items is the symmetric group\nwhich we denote by SK. A partial ranking is a complete ranking including ties, and is also referred\nas a weak order or bucket order in the litterature (see Kenkre et al. (2011)). This includes in particular\nthe top-k rankings, that is to say partial rankings dividing items in two groups, the \ufb01rst one being the\nk \u2264 K most relevant items and the second one including all the rest. These top-k rankings are given\na lot of attention because of their relevance for modern applications, especially search engines or\nrecommendation systems (see Ailon (2010)). An incomplete ranking is a strict order involving only a\nsmall subset of items, and includes as a particular case pairwise comparisons, another kind of ranking\nwhich is very relevant in large-scale settings when the number of items to be ranked is very large.\nWe now introduce the main notations used through the paper. For any function f, Im(f ) denotes\nthe image of f, and f\u22121 its inverse. The indicator function of any event E is denoted by I{E}. We\nwill denote by sign the function such that for any x \u2208 R, sign(x) = I{x > 0} \u2212 I{x < 0}. The\nnotations (cid:107).(cid:107) and |.| denote respectively the usual l2 and l1 norm in an Euclidean space. Finally, for\n\nany integers a \u2264 b,(cid:74)a, b(cid:75) denotes the set {a, a + 1, . . . , b}, and for any \ufb01nite set C, #C denotes its\n\ncardinality.\n\n2.2 Related work\n\nAn overview of label ranking algorithms can be found in Vembu and G\u00e4rtner (2010), Zhou et al.\n(2014)), but we recall here the main contributions. One of the \ufb01rst proposed approaches, called\npairwise classi\ufb01cation (see F\u00fcrnkranz and H\u00fcllermeier (2003)) transforms the label ranking problem\ninto K(K \u2212 1)/2 binary classi\ufb01cation problems. For each possible pair of labels 1 \u2264 i < j \u2264 K,\nthe authors learn a model mij that decides for any given example whether i (cid:31) j or j (cid:31) i holds. The\nmodel is trained with all examples for which either i (cid:31) j or j (cid:31) i is known (all examples for which\nnothing is known about this pair are ignored). At prediction time, an example is submitted to all\nK(K \u2212 1)/2 classi\ufb01ers, and each prediction is interpreted as a vote for a label: if the classi\ufb01er mij\npredicts i (cid:31) j, this counts as a vote for label i. The labels are then ranked according to the number\nof votes. Another approach (see Dekel et al. (2004)) consists in learning for each label a linear\nutility function from which the ranking is deduced. Then, a large part of the dedicated literature was\ndevoted to adapting classical partitioning methods such as k-nearest neighbors (see Zhang and Zhou\n(2007), Chiang et al. (2012)) or tree-based methods, in a parametric (Cheng et al. (2010), Cheng et al.\n\n2\n\n\f(2009), Aledo et al. (2017)) or a non-parametric way (see Cheng and H\u00fcllermeier (2013), Yu et al.\n(2010), Zhou and Qiu (2016), Cl\u00e9men\u00e7on et al. (2017), S\u00e1 et al. (2017)). Finally, some approaches\nare rule-based (see Gurrieri et al. (2012), de S\u00e1 et al. (2018)). We will compare our numerical results\nwith the best performances attained by these methods on a set of benchmark datasets of the label\nranking problem in section 6.\n\n3 Structured prediction for label ranking\n\nthat we set to be SK the space of full rankings over the set of items(cid:74)K(cid:75). The quality of a prediction\n\n3.1 Learning problem\nOur goal is to learn a function s : X \u2192 Y between a feature space X and a structured output space Y,\ns(x) is measured using a loss function \u2206 : SK \u00d7 SK \u2192 R, where \u2206(s(x), \u03c3) is the cost suffered\nby predicting s(x) for the true output \u03c3. We suppose that the input/output pairs (x, \u03c3) come from\nsome \ufb01xed distribution P on X \u00d7 SK. The label ranking problem is then de\ufb01ned as:\n\nminimizes:X\u2192SKE(s), with\n\nE(s) =\n\n\u2206(s(x), \u03c3)dP (x, \u03c3).\n\n(1)\n\n(cid:90)\n\nX\u00d7SK\n\nIn this paper, we propose to study how to solve this problem and its empirical counterpart for a\nfamily of loss functions based on some ranking embedding \u03c6 : SK \u2192 F that maps the permutations\n\u03c3 \u2208 SK into a Hilbert space F:\n\n\u2206(\u03c3, \u03c3(cid:48)) = (cid:107)\u03c6(\u03c3) \u2212 \u03c6(\u03c3(cid:48))(cid:107)2F .\n\n(2)\n\nThis loss presents two main advantages: \ufb01rst, there exists popular losses for ranking data that can\ntake this form within a \ufb01nite dimensional Hilbert Space F, second, this choice bene\ufb01ts from the\ntheoretical results on Surrogate Least Square problems for structured prediction using Calibration\nTheory of Ciliberto et al. (2016) and of works of Brouard et al. (2016) on Structured Output Prediction\nwithin vector-valued Reproducing Kernel Hilbert Spaces. These works approach Structured Output\nPrediction along a common angle by introducing a surrogate problem involving a function g : X \u2192 F\n(with values in F) and a surrogate loss L(g(x), \u03c3) to be minimized instead of Eq. 1. The surrogate\nloss is said to be calibrated if a minimizer for the surrogate loss is always optimal for the true loss\n(Calauzenes et al., 2012). In the context of true risk minimization, the surrogate problem for our case\nwrites as:\n\n(cid:90)\n\nminimize g:X\u2192FR(g), with R(g) =\n\nwith the following surrogate loss:\n\nX\u00d7SK\n\nL(g(x), \u03c6(\u03c3))dP (x, \u03c3).\n\n(3)\n\nL(g(x), \u03c6(\u03c3)) = (cid:107)g(x) \u2212 \u03c6(\u03c3)(cid:107)2F .\n\n(4)\nProblem of Eq. (3) is in general easier to optimize since g has values in F instead of the set of\nstructured objects Y, here SK. The solution of (3), denoted as g\u2217, can be written for any x \u2208 X :\ng\u2217(x) = E[\u03c6(\u03c3)|x]. Eventually, a candidate s(x) pre-image for g\u2217(x) can then be obtained by\nsolving:\n\n(5)\nIn the context of Empirical Risk Minimization, a training sample S = {(xi, \u03c3i), i = 1, . . . , N}, with\nN i.i.d. copies of the random variable (x, \u03c3) is available. The Surrogate Least Square approach for\nLabel Ranking Prediction decomposes into two steps:\n\ns(x) = argmin\n\u03c3\u2208SK\n\nL(g\u2217(x), \u03c6(\u03c3)).\n\n\u2022 Step 1: minimize a regularized empirical risk to provide an estimator of the minimizer of\n\nthe regression problem in Eq. (3):\n\nminimize g\u2208H RS (g), with RS (g) =\n\nL(g(xi), \u03c6(\u03c3i)) + \u2126(g).\n\n(6)\n\nwith an appropriate choice of hypothesis space H and complexity term \u2126(g). We denote by\n\n(cid:98)g a solution of (6).\n\nN(cid:88)\n\ni=1\n\n1\nN\n\n3\n\n\foriginal space SK:\n\n\u2022 Step 2: solve, for any x in X , the pre-image problem that provides a prediction in the\n\n(cid:98)s(x) = argmin\n\n\u03c3\u2208SK\n\n(cid:107)\u03c6(\u03c3) \u2212(cid:98)g(x)(cid:107)2F .\n\nThe pre-image operation can be written as(cid:98)s(x) = d \u25e6(cid:98)g(x) with d the decoding function:\napplied on(cid:98)g for any x \u2208 X .\n\n(cid:107)\u03c6(\u03c3) \u2212 h(cid:107)2F for all h \u2208 F,\n\nd(h) = argmin\n\u03c3\u2208SK\n\n(8)\n\n(7)\n\nThis paper studies how to leverage the choice of the embedding \u03c6 to obtain a good compromise\nbetween computational complexity and theoretical guarantees. Typically, the pre-image problem\non the discrete set SK (of cardinality K!) can be eased for appropriate choices of \u03c6 as we show in\nsection 4, leading to ef\ufb01cient solutions. In the same time, one would like to bene\ufb01t from theoretical\n\nguarantees and control the excess risk of the proposed predictor(cid:98)s.\n\nIn the following subsection we exhibit popular losses for ranking data that we will use for the label\nranking problem.\n\n3.2 Losses for ranking\n\nWe now present losses \u2206 on SK that we will consider for the label ranking task. A natural loss\nfor full rankings, i.e. permutations in SK, is a distance between permutations. Several distances\non SK are widely used in the literature (Deza and Deza, 2009), one of the most popular being the\nKendall\u2019s \u03c4 distance, which counts the number of pairwise disagreements between two permutations\n\u03c3, \u03c3(cid:48) \u2208 SK:\n\n\u2206\u03c4 (\u03c3, \u03c3(cid:48)) =\n\nI[(\u03c3(i) \u2212 \u03c3(j))(\u03c3(cid:48)(i) \u2212 \u03c3(cid:48)(j)) < 0].\n\n(9)\n\nThe maximal Kendall\u2019s \u03c4 distance is thus K(K\u22121)/2, the total number of pairs. Another well-spread\ndistance between permutations is the Hamming distance, which counts the number of entries on\nwhich two permutations \u03c3, \u03c3(cid:48) \u2208 SK disagree:\n\n(cid:88)\n\ni i. Thus to encode the transitivity constraint we introduce \u03c6(cid:48)\n\u03c3)i,j = \u2212(\u03c6\u03c3)i,j else, and write the ILP problem as\nby (\u03c6(cid:48)\n\n\u03c3 = (\u03c6(cid:48)\n\n(cid:88)\n\n\u03c3)i,j = (\u03c6\u03c3)i,j if 1 \u2264 i < j \u2264 K and (\u03c6(cid:48)\n(cid:98)g(x)i,j(\u03c6(cid:48)\n\u03c6(cid:48)\n\u03c3)i,j \u2208 {\u22121, 1} \u2200 i, j\n\u03c3)i,j + (\u03c6(cid:48)\n\nfollows: (cid:99)\u03c6\u03c3 = argmin\n\uf8f1\uf8f2\uf8f3(\u03c6(cid:48)\n\n\u03c3)j,i = 0 \u2200 i, j\n\u03c3)j,k + (\u03c6(cid:48)\n\n(\u03c6(cid:48)\n\u22121 \u2264 (\u03c6(cid:48)\n\n\u03c3)i,j + (\u03c6(cid:48)\n\n1\u2264i,j\u2264K\n\n\u03c3)i,j,\n\ns.c.\n\n\u03c3\n\n(12)\n\n\u2200 i, j, k s.t. i (cid:54)= j (cid:54)= k.\n\n\u03c3)k,i \u2264 1\n\nSuch a problem is NP-Hard. In previous works (see Calauzenes et al. (2012); Ramaswamy et al.\n(2013)), the complexity of designing calibrated surrogate losses for the Kendall\u2019s \u03c4 distance had\nalready been investigated. In particular, Calauzenes et al. (2012) proved that there exists no convex\nK-dimensional calibrated surrogate loss for Kendall\u2019s \u03c4 distance. As a consequence, optimizing this\ntype of loss has an inherent computational cost. However, in practice, branch and bound based ILP\nsolvers \ufb01nd the solution of (12) in a reasonable time for a reduced number of labels K. We discuss\nthe computational implications of choosing the Kemeny embedding section 5.2. We now turn to the\nstudy of an embedding devoted to build a surrogate loss for the Hamming distance.\n\n4.2 The Hamming embedding\n\nAnother well-spread embedding for permutations, that we will call the Hamming embedding, consists\nin mapping \u03c3 to its permutation matrix \u03c6H (\u03c3):\n\n\u03c6H : SK \u2192 RK\u00d7K\n\n\u03c3 (cid:55)\u2192 (I{\u03c3(i) = j})1\u2264i,j\u2264K ,\n\nwhere we have embedded the set of permutation matrices Im(\u03c6H ) (cid:40) {0, 1}K\u00d7K into the Hilbert\nspace (RK\u00d7K,(cid:104)., .(cid:105)) with (cid:104)., .(cid:105) the Froebenius inner product. This embedding shares similar\nproperties with the Kemeny embedding: \ufb01rst, it is also of constant (Froebenius) norm, since\n\u2200\u03c3 \u2208 SK, (cid:107)\u03c6H (\u03c3)(cid:107) =\nK. Then, the squared euclidean distance between the mappings of\ntwo permutations \u03c3, \u03c3(cid:48) \u2208 SK recovers their Hamming distance (proving that \u03c6H is also injective):\n(cid:107)\u03c6H (\u03c3) \u2212 \u03c6H (\u03c3(cid:48))(cid:107)2 = \u2206H (\u03c3, \u03c3(cid:48)). Once again, the pre-image problem consists in solving the linear\nprogram:\n\n\u221a\n\n\u2212(cid:104)\u03c6H (\u03c3),(cid:98)g(x)(cid:105),\n1Copeland method \ufb01rstly affects a score si for item i as: si =(cid:80)\n\n(cid:98)s(x) = argmin\n\n\u03c3\u2208SK\n\nI{\u03c3(i) < \u03c3(j)} and then ranks the\n\nj(cid:54)=i\n\n(13)\n\nitems by decreasing score.\n\n5\n\n\fwhich is, as for the Kemeny embedding previously, divided in a minimization step, i.e. \ufb01nd(cid:99)\u03c6\u03c3 =\nH ((cid:99)\u03c6\u03c3). The inversion\nstep is of complexity O(K 2) since it involves scrolling through all the rows (items i) of the matrix(cid:99)\u03c6\u03c3\nargmin\u03c6\u03c3\u2208Im(\u03c6H ) \u2212(cid:104)\u03c6\u03c3, g(x)(cid:105), and an inversion step, i.e. compute \u03c3 = \u03c6\u22121\n\nand all the columns (to \ufb01nd their positions \u03c3(i)). The minimization step itself writes as the following\nproblem:\n\n(cid:98)g(x)i,j(\u03c6\u03c3)i,j,\n\n(cid:99)\u03c6\u03c3 = argmax\n(cid:40)\n\n(cid:88)\n(cid:80)\ni(\u03c6\u03c3)i,j =(cid:80)\n\n1\u2264i,j\u2264K\n\ns.c\n\n\u03c6\u03c3\n\n(\u03c6\u03c3)i,j \u2208 {0, 1} \u2200 i, j\n\nj(\u03c6\u03c3)i,j = 1 \u2200 i, j ,\n\n(14)\n\nwhich can be solved with the Hungarian algorithm (see Kuhn (1955)) in O(K 3) time. Now we turn\nto the study of an embedding which presents ef\ufb01cient algorithmic properties.\n\n4.3 Lehmer code\nA permutation \u03c3 = (\u03c3(1), . . . , \u03c3(K)) \u2208 SK may be uniquely represented via its Lehmer code (also\n\ncalled the inversion vector), i.e. a word of the form c\u03c3 \u2208 CK =\u2206 {0}\u00d7(cid:74)0, 1(cid:75)\u00d7(cid:74)0, 2(cid:75)\u00d7\u00b7\u00b7\u00b7\u00d7(cid:74)0, K\u22121(cid:75),\n\nwhere for j = 1, . . . , K:\n\nc\u03c3(j) = #{i \u2208(cid:74)K(cid:75) : i < j, \u03c3(i) > \u03c3(j)}.\n\n(15)\nThe coordinate c\u03c3(j) is thus the number of elements i with index smaller than j that are ranked\nhigher than j in the permutation \u03c3. By default, c\u03c3(1) = 0 and is typically omitted. For instance, we\nhave:\n\ne\n\u03c3\nc\u03c3\n\n1\n2\n0\n\n2\n1\n1\n\n3\n4\n0\n\n4\n5\n0\n\n5\n7\n0\n\n6\n3\n3\n\n7\n6\n1\n\n8\n9\n0\n\n9\n8\n1\n\nIt is well known that the Lehmer code is bijective, and that the encoding and decoding algorithms\nhave linear complexity O(K) (see Mare\u0161 and Straka (2007), Myrvold and Ruskey (2001)). This\nembedding has been recently used for ranking aggregation of full or partial rankings (see Li et al.\n(2017)). Our idea is thus to consider the following Lehmer mapping for label ranking;\n\n\u03c6L : SK \u2192 RK\n\n\u03c3 (cid:55)\u2192 (c\u03c3(i)))i=1,...,K ,\n\nwhich maps any permutation \u03c3 \u2208 SK into the space CK (that we have embedded into the Hilbert\nspace (RK,(cid:104)., .(cid:105))). The loss function in the case of the Lehmer embedding is thus the following:\n\n\u2206L(\u03c3, \u03c3(cid:48)) = (cid:107)\u03c6L(\u03c3) \u2212 \u03c6L(\u03c3(cid:48))(cid:107)2,\n\n(16)\n\nwhich does not correspond to a known distance over permutations (Deza and Deza, 2009). Notice that\n|\u03c6L(\u03c3)| = d\u03c4 (\u03c3, e) where e is the identity permutation, a quantity which is also called the number of\ninversions of \u03c3. Therefore, in contrast to the previous mappings, the norm (cid:107)\u03c6L(\u03c3)(cid:107) is not constant for\nany \u03c3 \u2208 SK. Hence it is not possible to write the loss \u2206L(\u03c3, \u03c3(cid:48)) as \u2212(cid:104)\u03c6L(\u03c3), \u03c6L(\u03c3(cid:48))(cid:105)2.Moreover,\nK\u22121 \u2206\u03c4 (\u03c3, \u03c3(cid:48)) \u2264 |\u03c6L(\u03c3) \u2212\nthis mapping is not distance preserving and it can be proven that\n\u03c6L(\u03c3(cid:48))| \u2264 \u2206\u03c4 (\u03c3, \u03c3(cid:48)) (see Wang et al. (2015)). However, the Lehmer embedding still enjoys great\nadvantages. Firstly, its coordinates are decoupled, which will enable a trivial solving of the inverse\nimage step (7). Indeed we can write explicitly its solution as:\n\n1\n\n(cid:98)s(x) = \u03c6\u22121\n(cid:124)\n(cid:123)(cid:122)\n(cid:125)\nL \u25e6 dL\n\nd\n\n\u25e6(cid:98)g(x) with\n\ndL : RK \u2192 CK\n\n(hi)i=1,...,K (cid:55)\u2192 ( argmin\n\nj\u2208(cid:74)0,i\u22121(cid:75)(hi \u2212 j))i=1,...,K,\n\n(17)\n\nwhere d is the decoding function de\ufb01ned in (8). Then, there may be repetitions in the coordinates of\nthe Lehmer embedding, allowing for a compact representation of the vectors.\n\n2The scalar product of two embeddings of two permutations \u03c6L(\u03c3), \u03c6L(\u03c3(cid:48)) is not maximized for \u03c3 = \u03c3(cid:48).\n\n6\n\n\f4.4 Extension to partial and incomplete rankings\n\nIn many real-world applications, one does not observe full rankings but only partial or incomplete\nrankings (see the de\ufb01nitions section 2.1). We now discuss to what extent the embeddings we propose\nfor permutations can be adapted to this kind of rankings as input data. Firstly, the Kemeny embedding\ncan be naturally extended to partial and incomplete rankings since it encodes relative information\n\nabout the positions of the items. Indeed, we propose to map any partial ranking(cid:101)\u03c3 to the vector:\n\n\u03c6((cid:101)\u03c3) = (sign((cid:101)\u03c3(i) \u2212(cid:101)\u03c3(j))1\u2264i 0.\n\nThe generalized representation c(cid:48) takes into account ties, so that for any partial ranking(cid:101)\u03c3:\nClearly, c(cid:48)(cid:101)\u03c3(j) \u2265 c(cid:101)\u03c3(j) for all j \u2208(cid:74)K(cid:75). Given a partial ranking(cid:101)\u03c3, it is possible to break its ties to\nconvert it in a permutation \u03c3 as follows: for i, j \u2208(cid:74)K(cid:75)2, if(cid:101)\u03c3(i) =(cid:101)\u03c3(j) then \u03c3(i) = \u03c3(j) iff i < j.\nThe entries j = 1, . . . , K of the Lehmer codes of(cid:101)\u03c3 (see (20)) and \u03c3 (see (15)) then verify:\nwhere INj = #{i \u2264 j,(cid:101)\u03c3(i) =(cid:101)\u03c3(j)}. An example illustrating the extension of the Lehmer code to\ncode c\u03c3(j) for any j \u2208(cid:74)K(cid:75) requires to sum over the(cid:74)K(cid:75) items. As an incomplete ranking do not\n\nc(cid:48)(cid:101)\u03c3(j) = #{i \u2208(cid:74)K(cid:75) : i < j,(cid:101)\u03c3(i) \u2265(cid:101)\u03c3(j)}.\n\npartial rankings is given in the Supplementary. However, computing each coordinate of the Lehmer\n\ninvolve the whole set of items, it is also tricky to extend the Lehmer code to map incomplete rankings.\nTaking as input partial or incomplete rankings only modi\ufb01es Step 1 of our method since it corresponds\nto the mapping step of the training data, and in Step 2 we still predict a full ranking. Extending our\nmethod to the task of predicting as output a partial or incomplete ranking raises several mathematical\nquestions that we did not develop at length here because of space limitations. For instance, to predict\npartial rankings, a naive approach would consist in predicting a full ranking and then converting it\nto a partial ranking according to some threshold (i.e, keep the top-k items of the full ranking). A\nmore formal extension of our method to make it able to predict directly partial rankings as outputs\nwould require to optimize a metric tailored for this data and which could be written as in Eq. (2). A\npossibility for future work could be to consider the extension of the Kendall\u2019s \u03c4 distance with penalty\nparameter p for partial rankings proposed in Fagin et al. (2004).\n\n(19)\n\nc(cid:48)(cid:101)\u03c3(j) = c\u03c3(j) + INj \u2212 1\n\nc(cid:101)\u03c3(j) = c\u03c3(j),\n\n,\n\n(20)\n\n5 Computational and theoretical analysis\n\n5.1 Theoretical guarantees\n\nIn this section, we give some statistical guarantees for the estimators obtained by following the steps\ndescribed in section 3. To this end, we build upon recent results in the framework of Surrogate Least\nSquare by Ciliberto et al. (2016). Consider one of the embeddings \u03c6 on permutations presented in the\nprevious section, which de\ufb01nes a loss \u2206 as in Eq. (2). Let c\u03c6 = max\u03c3\u2208SK (cid:107)\u03c6(\u03c3)(cid:107). We will denote\nfunction as (8)3. Given an estimator(cid:98)g of g\u2217 from Step 1, i.e. a minimizer of the empirical surrogate\nby s\u2217 a minimizer of the true risk (1), g\u2217 a minimizer of the surrogate risk (3), and d a decoding\nrisk (6) we can then consider in Step 2 an estimator(cid:98)s = d \u25e6(cid:98)g. The following theorem reveals how\nthe performance of the estimator(cid:98)s we propose can be related to a solution s\u2217 of (1) for the considered\n\nembeddings.\n\n3Note that d = \u03c6\n\nL \u25e6 dL for \u03c6L and is obtained as the composition of two steps for \u03c6\u03c4 and \u03c6H: solving an\n\u22121\n\noptimization problem and compute the inverse of the embedding.\n\n7\n\n\fEmbedding\n\u03c6\u03c4\n\u03c6H\n\u03c6L\n\nStep 2 (b)\nStep 1 (a)\nO(K 2N )\nNP-hard\nO(KN ) O(K 3N )\nO(KN )\nO(KN )\n\nRegressor\nkNN\nRidge\n\nStep 1 (b)\nO(1)\nO(N 3)\n\nStep 2 (a)\nO(N m)\nO(N m)\n\nTable 1: Embeddings and regressors complexities.\n\nTheorem 1 The excess risks of the proposed predictors are linked to the excess surrogate risks as:\n(i) For the loss (2) de\ufb01ned by the Kemeny and Hamming embedding \u03c6\u03c4 and \u03c6H respectively:\n\nE(d \u25e6(cid:98)g) \u2212 E(s\u2217) \u2264 c\u03c6\n\n(cid:112)R((cid:98)g) \u2212 R(g\u2217)\n\nwith c\u03c6\u03c4 =\n\n(cid:113) K(K\u22121)\n(cid:114)\nE(d \u25e6(cid:98)g) \u2212 E(s\u2217) \u2264\n\n2\n\n\u221a\n\nand c\u03c6H =\n\nK.\n\n(ii) For the loss (2) de\ufb01ned by the Lehmer embedding \u03c6L:\n\n(cid:112)R((cid:98)g) \u2212 R(g\u2217) + E(d \u25e6 g\u2217) \u2212 E(s\u2217) + O(K\n\n\u221a\n\nK)\n\nK(K \u2212 1)\n\n2\n\n\u221a\n\n\u221a\n\n2(cid:112)K(K \u2212 1)E(s\u2217) + O(K\n\nThe full proof is given in the Supplementary. Assertion (i) is a direct application of Theorem 2 in\nCiliberto et al. (2016). In particular, it comes from a preliminary consistency result which shows that\nE(d \u25e6 g\u2217) = E(s\u2217) for both embeddings. Concerning the Lehmer embedding, it is not possible to\napply their consistency results immediately; however a large part of the arguments of their proof is\nused to bound the estimation error for the surrogate risk, and we remain with an approximation error\nE(d \u25e6 g\u2217) \u2212 E(s\u2217) + O(K\nK) resulting in Assertion (ii). In Remark 2 in the Supplementary, we\ngive several insights about this approximation error. Firstly we show that it can be upper bounded\nby 2\nK). Then, we explain how this term results from using \u03c6L in\nthe learning procedure. The Lehmer embedding thus have weaker statistical guarantees, but has the\nadvantage of being more computationnally ef\ufb01cient, as we explain in the next subsection.\n\nNotice that for Step 1, one can choose a consistent regressor with vector values(cid:98)g, i.e such that\nR((cid:98)g) \u2192 R(g\u2217) when the number of training points tends to in\ufb01nity. Examples of such methods that\nwe use in our experiments to learn(cid:98)g, are the k-nearest neighbors (kNN) or kernel ridge regression\nsurrogate risk R((cid:98)g) \u2212 R(g\u2217) implies the control of E((cid:98)s) \u2212 E(s\u2217) where(cid:98)s = d \u25e6(cid:98)g by Theorem 1.\n\n(Micchelli and Pontil, 2005) methods whose consistency have been proved (see Chapter 5 in Devroye\net al. (2013) and Caponnetto and De Vito (2007)). In this case the control of the excess of the\n\n\u221a\n\nRemark 1 We clarify that the consistency results of Theorem 1 are established for the task of\npredicting full rankings which is adressed in this paper. In the case of predicting partial or incomplete\nrankings, these results are not guaranteed to hold. Providing theoretical guarantees for this task is\nleft for future work.\n\n5.2 Algorithmic complexity\n\nWe now discuss the algorithmic complexity of our approach. We recall that K is the number of\nitems/labels whereas N is the number of samples in the dataset. For a given embedding \u03c6, the\ntotal complexity of our approach for learning decomposes as follows. Step 1 in Section 3 can be\ndecomposed in two steps: a preprocessing step (Step 1 (a)) consisting in mapping the training sample\n{(xi, \u03c3i), i = 1, . . . , N} to {(xi, \u03c6(\u03c3i)), i = 1, . . . , N}, and a second step (Step 1 (b)) that consists\n\nin computing the estimator(cid:98)g of the Least squares surrogate empirical minimization (6). Then, at\nmapping new inputs to a Hilbert space using(cid:98)g (Step 2 (a)), and then solving the preimage problem (7)\n\nprediction time, Step 2 Section 3 can also be decomposed in two steps: a \ufb01rst one consisting in\n\n(Step 2 (b)). The complexity of a predictor corresponds to the worst complexity across all steps. The\ncomplexities resulting from the choice of an embedding and a regressor are summarized Table 1,\nwhere we denoted by m the dimension of the ranking embedded representations. The Lehmer\nembedding with kNN regressor thus provides the fastest theoretical complexity of O(KN ) at the\ncost of weaker theoretical guarantees. The fastest methods previously proposed in the litterature\ntypically involved a sorting procedure at prediction Cheng et al. (2010) leading to a O(N Klog(K))\ncomplexity. In the experimental section we compare our approach with the former (denoted as Cheng\n\n8\n\n\fPL), but also with the label wise decomposition approach in Cheng and H\u00fcllermeier (2013) (Cheng\nLWD) involving a kNN regression followed by a projection on SK computed in O(K 3N ), and the\nmore recent Random Forest Label Ranking (Zhou RF) Zhou and Qiu (2016). In their analysis, if dX\nis the size of input features and Dmax the maximum depth of a tree, then RF have a complexity in\nO(DmaxdX K 2N 2).\n\n6 Numerical Experiments\n\nFinally we evaluate the performance of our approach on standard benchmarks. We present the results\nobtained with two regressors : Kernel Ridge regression (Ridge) and k-Nearest Neighbors (kNN).\nBoth regressors were trained with the three embeddings presented in Section 4. We adopt the same\nsetting as Cheng et al. (2010) and report the results of our predictors in terms of mean Kendall\u2019s \u03c4:\n\n(cid:26)C : number of concordant pairs between 2 rankings\n\nC \u2212 D\n\n,\n\n(21)\n\nk\u03c4 =\n\nK(K \u2212 1)/2\n\nD : number of discordant pairs between 2 rankings\n\nfrom \ufb01ve repetitions of a ten-fold cross-validation (c.v.). Note that k\u03c4 is an af\ufb01ne transformation of the\nKendall\u2019s tau distance \u2206\u03c4 mapping on the [\u22121, 1] interval. We also report the standard deviation of\nthe resulting scores as in Cheng and H\u00fcllermeier (2013). The parameters of our regressors were tuned\nin a \ufb01ve folds inner c.v. for each training set. We report our parameter grids in the supplementary\nmaterials.\n\nTable 2: Mean Kendall\u2019s \u03c4 coef\ufb01cient on benchmark datasets\n\nkNN Hamming\nkNN Kemeny\nkNN Lehmer\nridge Hamming\nridge Lehmer\nridge Kemeny\nCheng PL\nCheng LWD\nZhou RF\n\nauthorship\n0.01\u00b10.02\n0.94\u00b10.02\n0.93\u00b10.02\n-0.00\u00b10.02\n0.92\u00b10.02\n0.94\u00b10.02\n0.94\u00b10.02\n0.93\u00b10.02\n0.91\n\nglass\n0.08\u00b10.04\n0.85\u00b10.06\n0.85\u00b10.05\n0.08\u00b10.05\n0.83\u00b10.05\n0.86\u00b10.06\n0.84\u00b10.07\n0.84\u00b10.08\n0.89\n\niris\n-0.15\u00b10.13\n0.95\u00b10.05\n0.95\u00b10.04\n-0.10\u00b10.13\n0.97\u00b10.03\n0.97\u00b10.05\n0.96\u00b10.04\n0.96\u00b10.04\n0.97\n\nvehicle\n-0.21\u00b10.04\n0.85\u00b10.03\n0.84\u00b10.03\n-0.21\u00b10.03\n0.85\u00b10.02\n0.89\u00b10.03\n0.86\u00b10.03\n0.85\u00b10.03\n0.86\n\nvowel\n0.24\u00b10.04\n0.85\u00b10.02\n0.78\u00b10.03\n0.26\u00b10.04\n0.86\u00b10.01\n0.92\u00b10.01\n0.85\u00b10.02\n0.88\u00b10.02\n0.87\n\nwine\n-0.36\u00b10.04\n0.94\u00b10.05\n0.94\u00b10.06\n-0.36\u00b10.03\n0.84\u00b10.08\n0.94\u00b10.05\n0.95\u00b10.05\n0.94\u00b10.05\n0.95\n\nThe Kemeny and Lehmer embedding based approaches are competitive with the state of the art\nmethods on these benchmarks datasets. The Hamming based methods give poor results in terms of\nk\u03c4 but become the best choice when measuring the mean Hamming distance between predictions and\nground truth (see Table 3 in the Supplementary). In contrast, the fact that the Lehmer embedding\nperforms well for the optimization of the Kendall\u2019s \u03c4 distance highlights its practical relevance for\nlabel ranking. The Supplementary presents additional results (on additional datasets and results in\nterms of Hamming distance) which show that our method remains competitive with the state of the\nart. The code to reproduce our results is available: https://github.com/akorba/Structured_\nApproach_Label_Ranking/\n\n7 Conclusion\n\nThis paper introduces a novel framework for label ranking, which is based on the theory of Surrogate\nLeast Square problem for structured prediction. The structured prediction approach we propose\ncomes along with theoretical guarantees and ef\ufb01cient algorithms, and its performance has been\nshown on real-world datasets. To go forward, extensions of our methodology to predict partial and\nincomplete rankings are to be investigated. In particular, the framework of prediction with abstention\nshould be of interest.\n\nReferences\nAiguzhinov, A., Soares, C., and Serra, A. P. (2010). A similarity-based adaptation of naive bayes for label\nranking: Application to the metalearning problem of algorithm recommendation. In International Conference\non Discovery Science, pages 16\u201326. Springer.\n\n9\n\n\fAilon, N. (2010). Aggregation of partial rankings, p-ratings and top-m lists. Algorithmica, 57(2):284\u2013300.\n\nAledo, J. A., G\u00e1mez, J. A., and Molina, D. (2017). Tackling the supervised label ranking problem by bagging\n\nweak learners. Information Fusion, 35:38\u201350.\n\nBrazdil, P. B., Soares, C., and Da Costa, J. P. (2003). Ranking learning algorithms: Using ibl and meta-learning\n\non accuracy and time results. Machine Learning, 50(3):251\u2013277.\n\nBrouard, C., Szafranski, M., and d?Alch\u00e9 Buc, F. (2016). Input output kernel regression: supervised and\nsemi-supervised structured output prediction with operator-valued kernels. Journal of Machine Learning\nResearch, 17(176):1\u201348.\n\nCalauzenes, C., Usunier, N., and Gallinari, P. (2012). On the (non-) existence of convex, calibrated surrogate\n\nlosses for ranking. In Advances in Neural Information Processing Systems, pages 197\u2013205.\n\nCao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. (2007). Learning to rank: from pairwise approach to listwise\napproach. In Proceedings of the 24th Annual International Conference on Machine learning (ICML-07),\npages 129\u2013136. ACM.\n\nCaponnetto, A. and De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Foundations\n\nof Computational Mathematics, 7(3):331\u2013368.\n\nCheng, W., H\u00fchn, J., and H\u00fcllermeier, E. (2009). Decision tree and instance-based learning for label ranking. In\nProceedings of the 26th Annual International Conference on Machine Learning (ICML-09), pages 161\u2013168.\nACM.\n\nCheng, W. and H\u00fcllermeier, E. (2013). A nearest neighbor approach to label ranking based on generalized\n\nlabelwise loss minimization.\n\nCheng, W., H\u00fcllermeier, E., and Dembczynski, K. J. (2010). Label ranking methods based on the plackett-luce\nmodel. In Proceedings of the 27th Annual International Conference on Machine Learning (ICML-10), pages\n215\u2013222.\n\nChiang, T.-H., Lo, H.-Y., and Lin, S.-D. (2012). A ranking-based knn approach for multi-label classi\ufb01cation. In\n\nAsian Conference on Machine Learning, pages 81\u201396.\n\nCiliberto, C., Rosasco, L., and Rudi, A. (2016). A consistent regularization approach for structured prediction.\n\nIn Advances in Neural Information Processing Systems, pages 4412\u20134420.\n\nCl\u00e9men\u00e7on, S., Korba, A., and Sibony, E. (2017). Ranking median regression: Learning to order through local\n\nconsensus. arXiv preprint arXiv:1711.00070.\n\nCortes, C., Mohri, M., and Weston, J. (2005). A general regression technique for learning transductions. In\nProceedings of the 22nd Annual International Conference on Machine learning (ICML-05), pages 153\u2013160.\n\nde S\u00e1, C. R., Azevedo, P., Soares, C., Jorge, A. M., and Knobbe, A. (2018). Preference rules for label ranking:\n\nMining patterns in multi-target relations. Information Fusion, 40:112\u2013125.\n\nDekel, O., Singer, Y., and Manning, C. D. (2004). Log-linear models for label ranking. In Advances in neural\n\ninformation processing systems, pages 497\u2013504.\n\nDevroye, L., Gy\u00f6r\ufb01, L., and Lugosi, G. (2013). A probabilistic theory of pattern recognition, volume 31.\n\nSpringer Science & Business Media.\n\nDeza, M. and Deza, E. (2009). Encyclopedia of Distances. Springer.\n\nDjuric, N., Grbovic, M., Radosavljevic, V., Bhamidipati, N., and Vucetic, S. (2014). Non-linear label ranking for\n\nlarge-scale prediction of long-term user interests. In AAAI, pages 1788\u20131794.\n\nFagin, R., Kumar, R., Mahdian, M., Sivakumar, D., and Vee, E. (2004). Comparing and aggregating rankings\nwith ties. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of\ndatabase systems, pages 47\u201358. ACM.\n\nFathony, R., Behpour, S., Zhang, X., and Ziebart, B. (2018). Ef\ufb01cient and consistent adversarial bipartite\n\nmatching. In International Conference on Machine Learning, pages 1456\u20131465.\n\nF\u00fcrnkranz, J. and H\u00fcllermeier, E. (2003). Pairwise preference learning and ranking. In European conference on\n\nmachine learning, pages 145\u2013156. Springer.\n\n10\n\n\fGeng, X. and Luo, L. (2014). Multilabel ranking with inconsistent rankers. In Computer Vision and Pattern\n\nRecognition (CVPR), 2014 IEEE Conference on, pages 3742\u20133747. IEEE.\n\nGurrieri, M., Siebert, X., Fortemps, P., Greco, S., and S\u0142owi\u00b4nski, R. (2012). Label ranking: A new rule-based\nlabel ranking method. In International Conference on Information Processing and Management of Uncertainty\nin Knowledge-Based Systems, pages 613\u2013623. Springer.\n\nJiao, Y., Korba, A., and Sibony, E. (2016). Controlling the distance to a kemeny consensus without computing\nit. In Proceedings of the 33rd Annual International Conference on Machine learning (ICML-16), pages\n2971\u20132980.\n\nKadri, H., Ghavamzadeh, M., and Preux, P. (2013). A generalized kernel approach to structured output learning.\nIn Proceedings of the 30th Annual International Conference on Machine learning (ICML-13), pages 471\u2013479.\n\nKamishima, T., Kazawa, H., and Akaho, S. (2010). A survey and empirical comparison of object ranking\n\nmethods. In Preference learning, pages 181\u2013201. Springer.\n\nKenkre, S., Khan, A., and Pandit, V. (2011). On discovering bucket orders from preference data. In Proceedings\n\nof the 2011 SIAM International Conference on Data Mining, pages 872\u2013883. SIAM.\n\nKuhn, H. W. (1955). The hungarian method for the assignment problem. Naval Research Logistics (NRL),\n\n2(1-2):83\u201397.\n\nLi, P., Mazumdar, A., and Milenkovic, O. (2017). Ef\ufb01cient rank aggregation via lehmer codes. arXiv preprint\n\narXiv:1701.09083.\n\nMare\u0161, M. and Straka, M. (2007). Linear-time ranking of permutations. In European Symposium on Algorithms,\n\npages 187\u2013193. Springer.\n\nMerlin, V. R. and Saari, D. G. (1997). Copeland method ii: Manipulation, monotonicity, and paradoxes. Journal\n\nof Economic Theory, 72(1):148\u2013172.\n\nMicchelli, C. A. and Pontil, M. (2005). Learning the kernel function via regularization. Journal of machine\n\nlearning research, 6(Jul):1099\u20131125.\n\nMyrvold, W. and Ruskey, F. (2001). Ranking and unranking permutations in linear time. Information Processing\n\nLetters, 79(6):281\u2013284.\n\nNowozin, S. and Lampert, C. H. (2011). Structured learning and prediction in computer vision. Found. Trends.\n\nComput. Graph. Vis., 6(3:8211;4):185\u2013365.\n\nOsokin, A., Bach, F. R., and Lacoste-Julien, S. (2017). On structured prediction theory with calibrated convex\n\nsurrogate losses. In Advances in Neural Information Processing Systems (NIPS) 2017, pages 301\u2013312.\n\nRamaswamy, H. G., Agarwal, S., and Tewari, A. (2013). Convex calibrated surrogates for low-rank loss matrices\nwith applications to subset ranking losses. In Advances in Neural Information Processing Systems, pages\n1475\u20131483.\n\nS\u00e1, C. R., Soares, C. M., Knobbe, A., and Cortez, P. (2017). Label ranking forests.\n\nSteinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer.\n\nVembu, S. and G\u00e4rtner, T. (2010). Label ranking algorithms: A survey. In Preference learning, pages 45\u201364.\n\nSpringer.\n\nWang, D., Mazumdar, A., and Wornell, G. W. (2015). Compression in the space of permutations. IEEE\n\nTransactions on Information Theory, 61(12):6417\u20136431.\n\nWang, Q., Wu, O., Hu, W., Yang, J., and Li, W. (2011). Ranking social emotions by learning listwise preference.\n\nIn Pattern Recognition (ACPR), 2011 First Asian Conference on, pages 164\u2013168. IEEE.\n\nYu, P. L. H., Wan, W. M., and Lee, P. H. (2010). Preference Learning, chapter Decision tree modelling for\n\nranking data, pages 83\u2013106. Springer, New York.\n\nZhang, M.-L. and Zhou, Z.-H. (2007). Ml-knn: A lazy learning approach to multi-label learning. Pattern\n\nrecognition, 40(7):2038\u20132048.\n\nZhou, Y., Liu, Y., Yang, J., He, X., and Liu, L. (2014). A taxonomy of label ranking algorithms. JCP,\n\n9(3):557\u2013565.\n\nZhou, Y. and Qiu, G. (2016). Random forest for label ranking. arXiv preprint arXiv:1608.07710.\n\n11\n\n\f", "award": [], "sourceid": 5380, "authors": [{"given_name": "Anna", "family_name": "Korba", "institution": "TELECOM PARISTECH"}, {"given_name": "Alexandre", "family_name": "Garcia", "institution": "Telecom ParisTech"}, {"given_name": "Florence", "family_name": "d'Alch\u00e9-Buc", "institution": "LTCI,T\u00e9l\u00e9com ParisTech, University of Paris-Saclay"}]}