{"title": "Ranking Data with Continuous Labels through Oriented Recursive Partitions", "book": "Advances in Neural Information Processing Systems", "page_first": 4600, "page_last": 4608, "abstract": "We formulate a supervised learning problem, referred to as continuous ranking, where a continuous real-valued label Y is assigned to an observable r.v. X taking its values in a feature space X and the goal is to order all possible observations x in X by means of a scoring function s : X \u2192 R so that s(X) and Y tend to increase or decrease together with highest probability. This problem generalizes bi/multi-partite ranking to a certain extent and the task of finding optimal scoring functions s(x) can be naturally cast as optimization of a dedicated functional cri- terion, called the IROC curve here, or as maximization of the Kendall \u03c4 related to the pair (s(X), Y ). From the theoretical side, we describe the optimal elements of this problem and provide statistical guarantees for empirical Kendall \u03c4 maximiza- tion under appropriate conditions for the class of scoring function candidates. We also propose a recursive statistical learning algorithm tailored to empirical IROC curve optimization and producing a piecewise constant scoring function that is fully described by an oriented binary tree. Preliminary numerical experiments highlight the difference in nature between regression and continuous ranking and provide strong empirical evidence of the performance of empirical optimizers of the criteria proposed.", "full_text": "Ranking Data with Continuous Labels\nthrough Oriented Recursive Partitions\n\nStephan Cl\u00b4emenc\u00b8on\n\nMastane Achab\nLTCI, T\u00b4el\u00b4ecom ParisTech, Universit\u00b4e Paris-Saclay\n\n75013 Paris, France\n\nfirst.last@telecom-paristech.fr\n\nAbstract\n\nWe formulate a supervised learning problem, referred to as continuous ranking,\nwhere a continuous real-valued label Y is assigned to an observable r.v. X taking\nits values in a feature space X and the goal is to order all possible observations\nx in X by means of a scoring function s : X \u2192 R so that s(X) and Y tend to\nincrease or decrease together with highest probability. This problem generalizes\nbi/multi-partite ranking to a certain extent and the task of \ufb01nding optimal scoring\nfunctions s(x) can be naturally cast as optimization of a dedicated functional cri-\nterion, called the IROC curve here, or as maximization of the Kendall \u03c4 related to\nthe pair (s(X), Y ). From the theoretical side, we describe the optimal elements of\nthis problem and provide statistical guarantees for empirical Kendall \u03c4 maximiza-\ntion under appropriate conditions for the class of scoring function candidates. We\nalso propose a recursive statistical learning algorithm tailored to empirical IROC\ncurve optimization and producing a piecewise constant scoring function that is\nfully described by an oriented binary tree. Preliminary numerical experiments\nhighlight the difference in nature between regression and continuous ranking and\nprovide strong empirical evidence of the performance of empirical optimizers of\nthe criteria proposed.\n\n1\n\nIntroduction\n\nThe predictive learning problem considered in this paper can be easily stated in an informal fashion,\nas follows. Given a collection of objects of arbitrary cardinality, N \u2265 1 say, respectively described\n. . . , xN in a feature space X , the goal is to learn how to order them by\nby characteristics x1,\nincreasing order of magnitude of a certain unknown continuous variable y. To \ufb01x ideas, the attribute\ny can represent the \u2019size\u2019 of the object and be dif\ufb01cult to measure, as for the physical measurement of\nmicroscopic bodies in chemistry and biology or the cash \ufb02ow of companies in quantitative \ufb01nance\nand the features x may then correspond to indirect measurements. The most convenient way to\nde\ufb01ne a preorder on a feature space X is to transport the natural order on the real line onto it by\nmeans of a (measurable) scoring function s : X \u2192 R: an object with charcateristics x is then said to\nbe \u2019larger\u2019 (\u2019strictly larger\u2019, respectively) than an object described by x(cid:48) according to the scoring rule\ns when s(x(cid:48)) \u2264 s(x) (when s(x) < s(x(cid:48))). Statistical learning boils down here to build a scoring\nfunction s(x), based on a training data set Dn = {(X1, Y1), . . . , (Xn, Yn)} of objects for which\nthe values of all variables (direct and indirect measurements) have been jointly observed, such that\ns(X) and Y tend to increase or decrease together with highest probability or, in other words, such\nthat the ordering of new objects induced by s(x) matches that de\ufb01ned by their true measures as well\nas possible. This problem, that shall be referred to as continuous ranking throughout the article can\nbe viewed as an extension of bipartite ranking, where the output variable Y is assumed to be binary\nand the objective can be naturally formulated as a functional M-estimation problem by means of the\nconcept of ROC curve, see [7]. Refer also to [4], [11], [1] for approaches based on the optimization\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fof summary performance measures such as the AUC criterion in the binary context. Generalization\nto the situation where the random label is ordinal and may take a \ufb01nite number K \u2265 3 of values\nis referred to as multipartite ranking and has been recently investigated in [16] (see also e.g. [14]),\nwhere distributional conditions guaranteeing that ROC surface and the VUS criterion can be used\nto determine optimal scoring functions are exhibited in particular.\nIt is the major purpose of this paper to formulate the continuous ranking problem in a quantitative\nmanner and explore the connection between the latter and bi/multi-partite ranking. Intuitively, op-\ntimal scoring rules would be also optimal for any bipartite subproblem de\ufb01ned by thresholding the\ncontinuous variable Y with cut-off t > 0, separating the observations X such that Y < t from\nthose such that Y > t. Viewing this way continuous ranking as a continuum of nested bipartite\nranking problems, we provide here suf\ufb01cient conditions for the existence of such (optimal) scoring\nrules and we introduce a concept of integrated ROC curve (IROC curve in abbreviated form) that\nmay serve as a natural performance measure for continuous ranking, as well as the related notion of\nintegrated AUC criterion, a summary scalar criterion, akin to Kendall tau. Generalization properties\nof empirical Kendall tau maximizers are discussed in the Supplementary Material. The paper also\nintroduces a novel recursive algorithm that solves a discretized version of the empirical integrated\nROC curve optimization problem, producing a scoring function that can be computed by means of\na hierarchical combination of binary classi\ufb01cation rules. Numerical experiments providing strong\nempirical evidence of the relevance of the approach promoted in this paper are also presented.\nThe paper is structured as follows. The probabilistic framework we consider is described and key\nconcepts of bi/multi-partite ranking are brie\ufb02y recalled in section 2. Conditions under which optimal\nsolutions of the problem of ranking data with continuous labels exist are next investigated in section\n3, while section 4 introduces a dedicated quantitative (functional) performance measure, the IROC\ncurve. The algorithmic approach we propose in order to learn scoring functions with nearly optimal\nIROC curves is presented at length in section 5. Numerical results are displayed in section 6. Some\ntechnical proofs are deferred to the Supplementary Material.\n\n2 Notation and Preliminaries\nThroughout the paper, the indicator function of any event E is denoted by I{E}. The pseudo-inverse\nof any cdf F (t) on R is denoted by F \u22121(u) = inf{s \u2208 R : F (s) \u2265 u}, while U([0, 1]) denotes the\nuniform distribution on the unit interval [0, 1].\n\n2.1 The probabilistic framework\n\nGiven a continuous real valued r.v. Y representing an attribute of an object, its \u2019size\u2019 say, and a\nrandom vector X taking its values in a (typically high dimensional euclidian) feature space X\nmodelling other observable characteristics of the object (e.g. \u2019indirect measurements\u2019 of the size\nof the object), hopefully useful for predicting Y , the statistical learning problem considered here\nis to learn from n \u2265 1 training independent observations Dn = {(X1, Y1),\n. . . , (Xn, Yn)},\ndrawn as the pair (X, Y ), a measurable mapping s : X \u2192 R, that shall be referred to as a\nscoring function throughout the paper, so that the variables s(X) and Y tend to increase or de-\ncrease together: ideally, the larger the score s(X), the higher the size Y . For simplicity, we as-\nsume throughout the article that X = Rd with d \u2265 1 and that the support of Y \u2019s distribution\nis compact, equal to [0, 1] say. For any q \u2265 1, we denote by \u03bbq the Lebesgue measure on Rq\nequipped with its Borelian \u03c3-algebra and suppose that the joint distribution FX,Y (dxdy) of the\npair (X, Y ) has a density fX,Y (x, y) w.r.t. the tensor product measure \u03bbd \u2297 \u03bb1. We also intro-\nduces the marginal distributions FY (dy) = fY (y)\u03bb1(dy) and FX (dx) = fX (x)\u03bbd(dx), where\ny\u2208[0,1] fX,Y (x, y)\u03bb1(dy) as well as the condi-\ntional densities fX|Y =y(x) = fX,Y (x, y)/fY (y) and fY |X=x(y) = fX,Y (x, y)/fX (x). Observe\nincidentally that the probabilistic framework of the continuous ranking problem is quite similar to\nthat of distribution-free regression. However, as shall be seen in the subsequent analysis, even if\nthe regression function m(x) = E[Y | X = x] can be optimal under appropriate conditions, just\nlike for regression, measuring ranking performance involves criteria that are of different nature than\nthe expected least square error and plug-in rules may not be relevant for the goal pursued here, as\ndepicted by Fig. 2 in the Supplementary Material.\n\nx\u2208X fX,Y (x, y)\u03bbd(dx) and fX (x) = (cid:82)\n\nfY (y) = (cid:82)\n\n2\n\n\fScoring functions. The set of all scoring functions is denoted by S here. Any scoring function\ns \u2208 S de\ufb01nes a total preorder on the space X : \u2200(x, x(cid:48)) \u2208 X 2, x (cid:22)s x(cid:48) \u21d4 s(x) \u2264 s(x(cid:48)). We also\nset x \u227as x(cid:48) when s(x) < s(x(cid:48)) and x =s x(cid:48) when s(x) = s(x(cid:48)) for (x, x(cid:48)) \u2208 X 2.\n\n2.2 Bi/multi-partite ranking\nSuppose that Z is a binary label, taking its values in {\u22121, +1} say, assigned to the r.v. X. In bipartite\nranking, the goal is to pick s in S so that the larger s(X), the greater the probability that Y is equal\nto 1 ideally. In other words, the objective is to learn s(x) such that the r.v. s(X) given Y = +1\nis as stochastically larger1 as possible than the r.v. s(X) given Y = \u22121: the difference between\n\u00afGs(t) = P{s(X) \u2265 t | Y = +1} and \u00afHs(t) = P{s(X) \u2265 t | Y = \u22121} should be thus maximal\nfor all t \u2208 R. This can be naturally quanti\ufb01ed by means of the notion of ROC curve of a candidate\ns \u2208 S, i.e. the parametrized curve t \u2208 R (cid:55)\u2192 ( \u00afHs(t), \u00afGs(t)), which can be viewed as the graph\nof a mapping ROCs : \u03b1 \u2208 (0, 1) (cid:55)\u2192 ROCs(\u03b1), connecting possible discontinuity points by linear\nsegments (so that ROCs(\u03b1) = \u00afGs \u25e6 (1 \u2212 H\u22121\ns (1 \u2212 \u03b1),\nwhere Hs = 1 \u2212 \u00afHs). A basic Neyman Pearson\u2019s theory argument shows that the optimal elements\ns\u2217(x) related to this natural (functional) bipartite ranking criterion (i.e. scoring functions whose\nROC curve dominates any other ROC curve everywhere on (0, 1)) are transforms (T \u25e6 \u03b7)(x) of\nthe posterior probability \u03b7(x) = P{Z = +1 | X = x}, where T : SUPP(\u03b7(X)) \u2192 R is any\nstrictly increasing borelian mapping. Optimization of the curve in sup norm has been considered in\n[7] or in [8] for instance. However, given its functional nature, in practice the ROC curve of any\ns \u2208 S is often summarized by the area under it, which performance measure can be interpreted in a\nprobabilistic manner, as the theoretical rate of concording pairs\nAUC(s) = P{s(X) < s(X(cid:48)) | Z = \u22121, Z(cid:48) = +1} +\n\nP{s(X) = s(X(cid:48)) | Z = \u22121, Z(cid:48) = +1} ,\n(1)\nwhere (X(cid:48), Z(cid:48)) denoted an independent copy of (X, Z). A variety of algorithms aiming at max-\nimizing the AUC criterion or surrogate pairwise criteria have been proposed and studied in the\nliterature, among which [11], [15] or [3], whereas generalization properties of empirical AUC max-\nimizers have been studied in [5], [1] and [12]. An analysis of the relationship between the AUC and\nthe error rate is given in [9].\nExtension to the situation where the label Y takes at least three ordinal values (i.e. multipartite\nranking) has been also investigated, see e.g. [14] or [6]. In [16], it is shown that, in contrast to the\nbipartite setup, the existence of optimal solutions cannot be guaranteed in general and conditions on\n(X, Y )\u2019s distribution ensuring that optimal solutions do exist and that extensions of bipartite ranking\ncriteria such as the ROC manifold and the volume under it can be used for learning optimal scoring\nrules have been exhibited. An analogous analysis in the context of continuous ranking is carried out\nin the next section.\n\ns )(1 \u2212 \u03b1) when Hs has no \ufb02at part in H\u22121\n\n1\n2\n\n3 Optimal elements in ranking data with continuous labels\n\nIn this section, a natural de\ufb01nition of the set of optimal elements for continuous ranking is \ufb01rst\nproposed. Existence and characterization of such optimal scoring functions are next discussed.\n\n3.1 Optimal scoring rules for continuous ranking\nConsidering a threshold value y \u2208 [0, 1], a considerably weakened (and discretized) version of the\nproblem stated informally above would consist in \ufb01nding s so that the r.v. s(X) given Y > y is\nas stochastically larger than s(X) given Y < y as possible. This subproblem coincides with the\nbipartite ranking problem related to the pair (X, Zy), where Zy = 2I{Y > y} \u2212 1. As brie\ufb02y\nrecalled in subsection 2.2, the optimal set S\u2217\ny is composed of the scoring functions that induce the\nsame ordering as\n\n\u03b7y(X) = P{Y > y | X} = 1 \u2212 (1 \u2212 py)/(1 \u2212 py + py\u03a6y(X)),\nwhere py = 1 \u2212 FY (y) = P{Y > y} and \u03a6y(X) = (dFX|Y >y/dFX|Y y \u21d4 H(Y ) > H(y)).\n\nt\n\n3.2 Existence and characterization of optimal scoring rules\n\nWe now investigate conditions guaranteeing the existence of optimal scoring functions for the con-\ntinuous ranking problem.\nProposition 1. The following assertions are equivalent.\n\n1. For all 0 < y < y(cid:48) < 1, for all (x, x(cid:48)) \u2208 X 2: \u03a6y(x) < \u03a6y(x(cid:48)) \u21d2 \u03a6y(cid:48)(x) \u2264 \u03a6y(cid:48)(x(cid:48)).\n2. There exists an optimal scoring rule s\u2217 (i.e. S\u2217 (cid:54)= \u2205).\n3. The regression function m(x) = E[Y | X = x] is an optimal scoring rule.\n4. The collection of probability distributions FX|Y =y(dx) = fX|Y =y(x)\u03bbd(dx), y \u2208 (0, 1)\nsatis\ufb01es the monotone likelihood ratio property: there exist s\u2217 \u2208 S and, for all 0 < y <\ny(cid:48) < 1, an increasing function \u03d5y,y(cid:48) : R \u2192 R+ such that: \u2200x \u2208 Rd,\n\nfX|Y =y(cid:48)\nfX|Y =y\n\n(x) = \u03d5y,y(cid:48)(s\u2217(x)).\n\nRefer to the Appendix section for the technical proof. Truth should be said, assessing that Assertion\n1. is a very challenging statistical task. However, through important examples, we now describe (not\nuncommon) situations where the conditions stated in Proposition 1 are ful\ufb01lled.\nExample 2. We give a few important examples of probabilistic models ful\ufb01lling the properties listed\nin Proposition 1.\n\u2022 Regression model. Suppose that Y = m(X) + \u0001, where m : X \u2192 R is a borelian function and \u0001\nis a centered r.v. independent from X. One may easily check that m \u2208 S\u2217.\n\u2022 Exponential families. Suppose that fX|Y =y(x) = exp(\u03ba(y)T (x) \u2212 \u03c8(y))f (x) for all x \u2208 Rd,\nwhere f : Rd \u2192 R+ is borelian, \u03ba : [0, 1] \u2192 R is a borelian strictly increasing function and\n\nT : Rd \u2192 R is a borelian mapping such that \u03c8(y) = log(cid:82)\n\nx\u2208Rd exp(\u03ba(y)T (x))f (x)dx < +\u221e.\n\nWe point out that, although the regression function m(x) is an optimal scoring function when\nS\u2217 (cid:54)= \u2205, the continuous ranking problem does not coincide with distribution-free regression (notice\nincidentally that, in this case, any strictly increasing transform of m(x) belongs to S\u2217 as well). As\ndepicted by Fig. 2 the least-squares criterion is not relevant to evaluate continuous ranking perfor-\nmance and naive plug-in strategies should be avoided, see Remark 3 below. Dedicated performance\ncriteria are proposed in the next section.\n\n4\n\n\f4 Performance measures for continuous ranking\n\nWe now investigate quantitative criteria for assessing the performance in the continuous ranking\nproblem, which practical machine-learning algorithms may rely on. We place ourselves in the situ-\nation where the set S\u2217 is not empty, see Proposition 1 above.\nA functional performance measure. It follows from the view developped in the previous section\nthat, for any (s, s\u2217) \u2208 S \u00d7 S\u2217 and for all y \u2208 (0, 1), we have:\n\n\u2200\u03b1 \u2208 (0, 1), ROCs,y(\u03b1) \u2264 ROCs\u2217,y(\u03b1) = ROC\u2217\n\n(3)\ndenoting by ROCs,y the ROC curve of any s \u2208 S related to the bipartite ranking subproblem\n(X, Zy) and by ROC\u2217\ny the corresponding optimal ROC curve, i.e. the ROC curve of strictly increas-\ning transforms of \u03b7y(x). Based on this observation, it is natural to design a dedicated performance\nmeasure by aggregating these \u2019sub-criteria\u2019. Integrating over y w.r.t. a \u03c3-\ufb01nite measure \u00b5 with sup-\n\nport equal to [0, 1], this leads to the following de\ufb01nition IROC\u00b5,s(\u03b1) =(cid:82) ROCs,y(\u03b1)\u00b5(dy). The\n\nfunctional criterion thus de\ufb01ned inherits properties from the ROCs,y\u2019s (e.g. monotonicity, concav-\nity). In addition, the curve IROC\u00b5,s\u2217 with s\u2217 \u2208 S\u2217 dominates everywhere on (0, 1) any other curve\nIROC\u00b5,s for s \u2208 S. However, except in pathologic situations (e.g. when s(x) is constant), the curve\nIROC\u00b5,s is not invariant when replacing Y \u2019s distribution by that of a strictly increasing transform\nH(Y ). In order to guarantee that this desirable property is ful\ufb01lled (see Remark 1), one should\nintegrate w.r.t. Y \u2019s distribution (which boils down to replacing Y by the uniformly distributed r.v.\nFY (Y )).\nDe\ufb01nition 2. (INTEGRATED ROC/AUC CRITERIA) The integrated ROC curve of any scoring rule\ns \u2208 S is de\ufb01ned as: \u2200\u03b1 \u2208 (0, 1),\n\ny(\u03b1),\n\nIROCs(\u03b1) =\n\nROCs,y(\u03b1)FY(dy) = E [ROCs,Y(\u03b1)] .\n\nThe integrated AUC criterion is de\ufb01ned as the area under the integrated ROC curve: \u2200s \u2208 S,\n\nIAUC(s) =\n\nIROCs(\u03b1)d\u03b1.\n\n\u03b1=0\n\nThe following result reveals the relevance of the functional/summary criteria de\ufb01ned above for the\ncontinuous ranking problem. Additional properties of IROC curves are listed in the Supplementary\nMaterial.\nTheorem 1. Let s\u2217 \u2208 S. The following assertions are equivalent.\n\n1. The assertions of Proposition 1 are ful\ufb01lled and s\u2217 is an optimal scoring function in the\n\n(4)\n\n(5)\n\n(cid:90) 1\n\ny=0\n\n(cid:90) 1\n\nsense given by De\ufb01nition 1.\n\n2. For all \u03b1 \u2208 (0, 1), IROCs\u2217 (\u03b1) = E [ROC\u2217\nY], where AUC\u2217\n3. We have IAUCs\u2217 = E [AUC\u2217\n\nY(\u03b1)].\n\ny =(cid:82) 1\n\nIf S\u2217 (cid:54)= \u2205, then we have: \u2200s \u2208 S,\n\n\u03b1=0 ROC\u2217\n\ny(\u03b1)d\u03b1 for all y \u2208 (0, 1).\n\nY(\u03b1)] ,\n\nfor any \u03b1 \u2208 (0, 1, )\n\nIROCs(\u03b1) \u2264 IROC\u2217(\u03b1) def= E [ROC\u2217\nIAUC(s) \u2264 IAUC\u2217 def= E [AUC\u2217\nY] .\nIn addition, for any borelian and strictly increasing mapping H : (0, 1) \u2192 (0, 1), replacing Y by\nH(Y ) leaves the curves IROCs, s \u2208 S, unchanged.\nEquipped with the notion de\ufb01ned above, a scoring rule s1 is said to be more accurate than an-\nother one s2 if IROCs2(\u03b1) \u2264 IROCs1 (\u03b1) for all \u03b1 \u2208 (0, 1).The IROC curve criterion thus\nprovides a partial preorder on S. Observe also that, by virtue of Fubini\u2019s theorem, we have\n\nIAUC(s) = (cid:82) AUCy(s)FY(dy) for all s \u2208 S, denoting by AUCy(s) the AUC of s related to\n\nthe bipartite ranking subproblem (X, Zy). Just like the AUC for bipartite ranking, the scalar IAUC\ncriterion de\ufb01nes a full preorder on S for continuous ranking. Based on a training dataset Dn of inde-\npendent copies of (X, Y ), statistical versions of the IROC/IAUC criteria can be straightforwardly\ncomputed by replacing the distributions FY , FX|Y >t and FX|Y 0} +\n= P{s(X) < s(X(cid:48)) | Y < Y (cid:48)} +\n\n1\n2\n\nP{X =s X(cid:48)} ,\n\n1\n2\n\nP{s(X) = s(X(cid:48))}\n\n(7)\n\nwhere (X(cid:48), Y (cid:48)) denotes an independent copy of (X, Y ), observing that P{Y < Y (cid:48)} = 1/2. The\nempirical counterpart of (7) based on the sample Dn, given by\nI{(s(Xi) \u2212 s(Xj)) \u00b7 (Yi \u2212 Yj) > 0} +\n\nI{s(Xi) = s(Xj)}\n\n(cid:98)dn(s) =\n\n(cid:88)\n\n(cid:88)\n\n1\n\n2\n\nn(n \u2212 1)\n\ni 0}, denoting by (X(cid:48), Z(cid:48)) an independent copy of\n(X, Z).\nRemark 3. (CONNECTION TO DISTRIBUTION-FREE REGRESSION) Consider the nonparametric\nregression model Y = m(X) + \u0001, where \u0001 is a centered r.v. independent from X. In this case, it is\nwell-known that the regression function m(X) = E[Y | X] is the (unique) solution of the expected\nleast squares minimization. However, although m \u2208 S\u2217, the least squares criterion is far from\nappropriate to evaluate ranking performance, as depicted by Fig. 2. Observe additionally that, in\ncontrast to the criteria introduced above, increasing transformation of the output variable Y may\nhave a strong impact on the least squares minimizer: except for linear stransforms, E[H(Y ) | X] is\nnot an increasing transform of m(X).\nRemark 4. (ON DISCRETIZATION) Bi/multi-partite algorithms are not directly applicable to the\ncontinuous ranking problem. Indeed a discretization of the interval [0, 1] would be \ufb01rst required but\nthis would raise a dif\ufb01cult question outside our scope: how to choose this discretization based on\nthe training data? We believe that this approach is less ef\ufb01cient than ours which reveals problem-\nspeci\ufb01c criteria, namely IROC and IAUC.\n\n6\n\n\fFigure 1: A scoring function described by an oriented binary subtree T . For any element x \u2208 X , one\nmay compute the quantity sT (x) very fast in a top-down fashion by means of the heap structure:\nstarting from the initial value 2J at the root node, at each internal node Cj,k, the score remains\nunchanged if x moves down to the left sibling, whereas one subtracts 2J\u2212(j+1) from it if x moves\ndown to the right.\n\n5 Continuous Ranking through Oriented Recursive Partitioning\n\nIt is the purpose of this section to introduce the algorithm CRANK, a speci\ufb01c tree-structured learning\nalgorithm for continuous ranking.\n\n5.1 Ranking trees and Oriented Recursive Partitions\n\nDecision trees undeniably \ufb01gure among the most popular techniques, in supervised and unsuper-\nvised settings, refer to [2] or [13] for instance. This is essentially due to the visual model summary\nthey provide, in the form of a binary tree graphic that permits to describe predictions by means of\na hierachichal combination of elementary rules of the type \u201dX (j) \u2264 \u03ba\u201d or \u201dX (j) > \u03ba\u201d, comparing\nthe value taken by a (quantitative) component of the input vector X (the split variable) to a certain\nthreshold (the split value). In contrast to local learning problems such as classi\ufb01cation or regression,\npredictive rules for a global problem such as ranking cannot be described by a (tree-structured) par-\ntition of the feature space: cells (corresponding to the terminal leaves of the binary decision tree)\nmust be ordered so as to de\ufb01ne a scoring function. This leads to the de\ufb01nition of ranking trees\nas binary trees equipped with a \u201dleft-to-right\u201d orientation, de\ufb01ning a tree-structured collection of\nanomaly scoring functions, as depicted by Fig. 1. Binary ranking trees have been in the context of\nbipartite ranking in [7] or in [3] and in [16] in the context of multipartite ranking. The root node\nof a ranking tree TJ of depth J \u2265 0 represents the whole feature space X : C0,0 = X , while each\ninternal node (j, k) with j < J and k \u2208 {0, . . . , 2j \u2212 1} corresponds to a subset Cj,k \u2282 X , whose\nleft and right siblings respectively correspond to disjoint subsets Cj+1,2k and Cj+1,2k+1 such that\nCj,k = Cj+1,2k \u222aCj+1,2k+1. Equipped with the left-to-right orientation, any subtree T \u2282 TJ de\ufb01nes\na preorder on X : elements lying in the same terminal cell of T being equally ranked. The scoring\nfunction related to the oriented tree T can be written as:\n\n(cid:88)\n\n(cid:18)\n\n(cid:19)\n\nCj,k: terminal leaf of T\n\n2J\n\n1 \u2212 k\n2j\n\n\u00b7 I{x \u2208 Cj,k}.\n\n(9)\n\nsT (x) =\n\n5.2 The CRANK algorithm\n\nBased on Proposition 2, as mentioned in the Supplementary Material, one can try to build from\nthe training dataset Dn a ranking tree by recursive empirical Kendall \u03c4 maximization. We propose\nbelow an alternative tree-structured recursive algorithm, relying on a (dyadic) discretization of the\n\u2019size\u2019 variable Y . At each iteration, the local sample (i.e. the data lying in the cell described by the\ncurrent node) is split into two halves (the highest/smallest halves, depending on Y ) and the algorithm\ncalls a binary classi\ufb01cation algorithm A to learn how to divide the node into right/left children. The\ntheoretical analysis of this algorithm and its connection with approximation of IROC\u2217 are dif\ufb01cult\nquestions that will be adressed in future work.\nIndeed we found out that the IROC cannot be\n\n7\n\n\frepresented as a parametric curve contrary to the ROC, which renders proofs much more dif\ufb01cult\nthan in the bipartite case.\n\nTHE CRANK ALGORITHM\n\n1. Input. Training data Dn, depth J \u2265 1, binary classi\ufb01cation algorithm A.\n2. Initialization. Set C0,0 = X .\n3. Iterations. For j = 0, . . . , J \u2212 1 and k = 0, . . . , 2J \u2212 1,\n\nZi = 2I{Yi > yj,k} \u2212 1 to any data point i lying in Cj,k, i.e. such that Xi \u2208 Cj,k.\n{(Xi, Yi) : 1 \u2264 i \u2264 n, Xi \u2208 Cj,k}, producing a classi\ufb01er gj,k : Cj,k \u2192 {\u22121, +1}.\n\n(a) Compute a median yj,k of the dataset {Y1, . . . , , Yn} \u2229 Cj,k and assign the binary label\n(b) Solve the binary classi\ufb01cation problem related to the input space Cj,k and the training set\n(c) Set Cj+1,2k = {x \u2208 Cj,k, gj,k = +1} = Cj,k \\ Cj+1,2k+1.\n4. Output. Ranking tree TJ = {Cj,k : 0 \u2264 j \u2264 J, 0 \u2264 k < D}.\n\nOf course, the depth J should be chosen such that 2J \u2264 n. One may also consider continuing to\nsplit the nodes until the number of data points within a cell has reached a minimum speci\ufb01ed in\nadvance. In addition, it is well known that recursive partitioning methods fragment the data and the\nunstability of splits increases with the depth. For this reason, a ranking subtree must be selected. The\ngrowing procedure above should be classically followed by a pruning stage, where children of a same\nparent are progressively merged until the root T0 is reached and a subtree among the sequence T0 \u2282\n. . . \u2282 TJ with nearly maximal IAUC should be chosen using cross-validation. Issues related to the\nimplementation of the CRANK algorithm and variants (e.g. exploiting randomization/aggregation)\nwill be investigated in a forthcoming paper.\n\n6 Numerical Experiments\n\nIn order to illustrate the idea conveyed by Fig. 2 that the least squares criterion is not appropriate for\nthe continuous ranking problem we compared on a toy example CRANK with CART. Recall that\nthe latter is a regression decision tree algorithm which minimizes the MSE (Mean Squared Error).\nWe also runned an alternative version of CRANK which maximizes the empirical Kendall \u03c4 instead\nof the empirical IAUC: this method is refered to as KENDALL from now on. The experimental\nsetting is composed of a unidimensional feature space X = [0, 1] (for visualization reasons) and a\nsimple regression model without any noise: Y = m(X). Intuitively, a least squares strategy can\nmiss slight oscillations of the regression function, which are critical in ranking when they occur in\nhigh probability regions as they affect the order among the feature space. The results are presented\nin Table 1. See Supplementary Material for further details.\n\nCRANK\nKENDALL\nCART\n\nIAUC Kendall \u03c4\n0.95\n0.94\n0.61\n\n0.92\n0.93\n0.58\n\nMSE\n0.10\n0.10\n\n7.4 \u00d7 10\u22124\n\nTable 1: IAUC, Kendall \u03c4 and MSE empirical measures\n\n7 Conclusion\n\nThis paper considers the problem of learning how to order objects by increasing \u2019size\u2019, modeled as a\ncontinuous r.v. Y , based on indirect measurements X. We provided a rigorous mathematical formu-\nlation of this problem that \ufb01nds many applications (e.g. quality control, chemistry) and is referred\nto as continuous ranking. In particular, necessary and suf\ufb01cient conditions on (X, Y )\u2019s distribution\nfor the existence of optimal solutions are exhibited and appropriate criteria have been proposed for\nevaluating the performance of scoring rules in these situations. In contrast to distribution-free re-\ngression where the goal is to recover the local values taken by the regression function, continuous\n\n8\n\n\franking aims at reproducing the preorder it de\ufb01nes on the feature space as accurately as possible.\nThe numerical results obtained via the algorithmic approaches we proposed for optimizing the cri-\nteria aforementioned highlight the difference in nature between these two statistical learning tasks.\n\nAcknowledgments\n\nThis work was supported by the industrial chair Machine Learning for Big Data from T\u00b4el\u00b4ecom\nParisTech and by a public grant (Investissement d\u2019avenir project, reference ANR-11-LABX-0056-\nLMH, LabEx LMH).\n\nReferences\n[1] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization bounds for the\n\narea under the ROC curve. J. Mach. Learn. Res., 6:393\u2013425, 2005.\n\n[2] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi\ufb01cation and Regression Trees.\n\nWadsworth and Brooks, 1984.\n\n[3] G. Cl\u00b4emenc\u00b8on, M. Depecker, and N. Vayatis. Ranking Forests. J. Mach. Learn. Res., 14:39\u201373,\n\n2013.\n\n[4] S. Cl\u00b4emenc\u00b8on, G. Lugosi, and N.Vayatis. Ranking and scoring using empirical risk minimiza-\n\ntion. In Proceedings of COLT 2005, volume 3559, pages 1\u201315. Springer., 2005.\n\n[5] S. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and empirical risk minimization of u-\n\nstatistics. The Annals of Statistics, 36:844\u2013874, 2008.\n\n[6] S. Cl\u00b4emenc\u00b8on and S. Robbiano. The TreeRank Tournament algorithm for multipartite ranking.\n\nJournal of Nonparametric Statistics, 25(1):107\u2013126, 2014.\n\n[7] S. Cl\u00b4emenc\u00b8on and N. Vayatis. Tree-based ranking methods. IEEE Transactions on Information\n\nTheory, 55(9):4316\u20134336, 2009.\n\n[8] S. Cl\u00b4emenc\u00b8on and N. Vayatis. The RankOver algorithm: overlaid classi\ufb01cation rules for opti-\n\nmal ranking. Constructive Approximation, 32:619\u2013648, 2010.\n\n[9] Corinna Cortes and Mehryar Mohri. Auc optimization vs. error rate minimization. In Advances\n\nin neural information processing systems, pages 313\u2013320, 2004.\n\n[10] L. Devroye, L. Gy\u00a8or\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer,\n\n1996.\n\n[11] Y. Freund, R. D. Iyer, R. E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for\n\ncombining preferences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\n\n[12] Aditya Krishna Menon and Robert C Williamson. Bipartite ranking: a risk-theoretic perspec-\n\ntive. Journal of Machine Learning Research, 17(195):1\u2013102, 2016.\n\n[13] J.R. Quinlan. Induction of Decision Trees. Machine Learning, 1(1):1\u201381, 1986.\n\n[14] S. Rajaram and S. Agarwal. Generalization bounds for k-partite ranking. In NIPS 2005 Work-\n\nshop on Learn to rank, 2005.\n\n[15] A. Rakotomamonjy. Optimizing Area Under Roc Curve with SVMs. In Proceedings of the\n\nFirst Workshop on ROC Analysis in AI, 2004.\n\n[16] S. Robbiano S. Cl\u00b4emenc\u00b8on and N. Vayatis. Ranking data with ordinal labels: optimality and\n\npairwise aggregation. Machine Learning, 91(1):67\u2013104, 2013.\n\n9\n\n\f", "award": [], "sourceid": 2407, "authors": [{"given_name": "St\u00e9phan", "family_name": "Cl\u00e9men\u00e7on", "institution": "Telecom ParisTech"}, {"given_name": "Mastane", "family_name": "Achab", "institution": "T\u00e9l\u00e9com ParisTech"}]}