{"title": "Log-Linear Models for Label Ranking", "book": "Advances in Neural Information Processing Systems", "page_first": 497, "page_last": 504, "abstract": "", "full_text": "Log-Linear Models for Label Ranking\n\nOfer Dekel\n\nComputer Science & Eng.\n\nHebrew University\n\nChristopher D. Manning\nComputer Science Dept.\n\nStanford University\n\nYoram Singer\n\nComputer Science & Eng.\n\nHebrew University\n\noferd@cs.huji.ac.il\n\nmanning@cs.stanford.edu\n\nsinger@cs.huji.ac.il\n\nAbstract\n\nLabel ranking is the task of inferring a total order over a prede\ufb01ned set of\nlabels for each given instance. We present a general framework for batch\nlearning of label ranking functions from supervised data. We assume that\neach instance in the training data is associated with a list of preferences\nover the label-set, however we do not assume that this list is either com-\nplete or consistent. This enables us to accommodate a variety of ranking\nproblems. In contrast to the general form of the supervision, our goal is\nto learn a ranking function that induces a total order over the entire set\nof labels. Special cases of our setting are multilabel categorization and\nhierarchical classi\ufb01cation. We present a general boosting-based learning\nalgorithm for the label ranking problem and prove a lower bound on the\nprogress of each boosting iteration. The applicability of our approach is\ndemonstrated with a set of experiments on a large-scale text corpus.\n\n1\n\nIntroduction\n\nThis paper discusses supervised learning of label rankings \u2013 the task of associating in-\nstances with a total order over a prede\ufb01ned set of labels. The ordering should be performed\nin accordance with some notion of relevance of the labels. That is, a label deemed relevant\nto an instance should be ranked higher than a label which is considered less relevant. With\neach training instance we receive supervision given as a set of preferences over the labels.\nConcretely, the supervision we receive with each instance is given in the form of a prefer-\nence graph: a simple directed graph for which the labels are the graph vertices. A directed\nedge from a label y to another label y 0 denotes that according to the supervision, y is more\nrelevant to the instance than y 0. We do not impose any further constraints on the structure\nof the preference graph.\n\nThe approach we employ distills and generalizes several learning settings. The simplest\nsetting is multiclass categorization in which each instance is associated with a single label\nout of k possible labels. Such a setting was discussed for instance in [10] where a boost-\ning algorithm called AdaBoost.MR (MR stands for Multiclass Ranking) for solving this\nproblem was described and analyzed. Using the graph representation for multiclass prob-\nlems, the preference graph induced by the supervision has k vertices and k (cid:0) 1 edges. A\ndirected edge points from the (single) relevant label to each of the k (cid:0) 1 irrelevant labels\n(Fig. 1a). An interesting and practical generalization of multiclass problems is multilabel\nproblems [10, 6, 4], in which a set of relevant labels (rather than a single label) is associ-\nated with each instance. In this case the supervision is represented by a directed bipartite\n\n\f1\n\n1\n\n2\n\n2\n\n3\n\n4\n\n5\n\n3\n\n4\n\n5\n\n(a)\n\n(b)\n\n1\n\n2\n\n4\n\n3\n\n5\n\n(c)\n\n2\n\n1\n\n3\n\n4\n\n5\n\n(d)\n\nFigure 1: The supervision provided to the algorithm associates every training instance with\na preference graph. Different graph topologies de\ufb01ne different learning problems. Exam-\nples that \ufb01t naturally in our generalized setting: (a) multiclass single-label categorization\nwhere 1 is the correct label. (b) multiclass multilabel categorization where f1; 2g is the set\nof correct labels. (c) A multi-layer graph that encodes three levels of label \u201cgoodness\u201d, use-\nful for instance in hierarchical multiclass settings. (d) a general (possibly cyclic) preference\ngraph with no prede\ufb01ned structure.\n\ngraph where the relevant labels constitute one side of the graph and the irrelevant labels\nthe other side and there is a directed edge from each relevant label to each irrelevant la-\nbel. (Fig. 1b). Similar settings are also encountered in information retrieval and language\nprocessing tasks. In these settings the set of labels contains linguistic structures such as\ntags and parses [1, 12] and the goal is to produce a total order over, for instance, candidate\nparses. The supervision might consist of information that distinguishes three goodness lev-\nels (Fig. 1c); for instance, the Penn Treebank [13] has notations to mark not only the most\nlikely correct parse implicitly opposed to incorrect parses, but also to mark other possibly\ncorrect parses involving different phrasal attachments (additional information that almost\nall previous work in parsing has ignored). Additionally, one can more fully rank the quality\nof the many candidate parses generated for a sentence based on how many constituents\nor dependencies each shares with the correct parse \u2013 much more directly and effectively\napproaching the metrics on which parser quality is usually assessed. For concreteness, we\nuse the term label ranking for all of these problems.\n\nOur learning framework decomposes each preference graph into subgraphs, where the\ngraph decomposition procedure may take a general form and can change as a function\nof the instances. Ranking algorithms, especially in multilabel categorization problems, of-\nten reduce the ranking task into multiple binary decision problems by enumerating over all\npairs of labels [7, 6, 4]. Such a reduction can easily be accommodated within our frame-\nwork by decomposing the preference graph into elementary subgraphs, each consisting of a\nsingle edge. Another approach is to compare a highly preferred label (such as the correct or\nbest parse of a sentence) with less preferred labels. Such approaches can be analyzed within\nour framework by de\ufb01ning a graph decomposition procedure that generates a subgraph for\neach relevant label and the neighboring labels that it is preferred over. Returning to mul-\ntilabel settings, this decomposition amounts to a loss that counts the number of relevant\nlabels which are wrongly ranked below irrelevant ones.\n\nThe algorithmic core of this paper is based on boosting-style algorithms for exponential\nmodels [2, 8]. Speci\ufb01cally, the boosting-style updates we employ build upon the construc-\ntion used in [2] for solving multiclass problems. Our framework employing graph decom-\nposition can also be used in other settings such as element ranking via projections [3, 11].\nFurthermore, settings in which a semi-metric is de\ufb01ned over the label-set can also be re-\nduced to the problem of label ranking, such as the parse ordering case mentioned above or\nwhen the labels are arranged in a hierarchical structure. We employ such a reduction in the\ncategory ranking experiments described in Sec. 4.\n\nThe paper is organized as follows: a formal description of our setting is given in Sec. 2. In\nSec. 3 we present an algorithm for learning label ranking functions. We demonstrate the\nmerits of our approach on the task of category-ranking in Sec. 4 and conclude in Sec. 5.\n\n\f2 Problem Setting\n\nand aim to learn a linear combination of the form f (x; y) = Pn\n\nLet X be an instance domain and let Y be a set of labels, possibly of in\ufb01nite cardinality. A\nlabel ranking for an instance x 2 X is a total order over Y, where y (cid:31) y 0 implies that y\nis preferred over y0 as a label for x. A label ranking function f : X (cid:2) Y ! R induces a\nlabel ranking for x 2 X by y (cid:31) y 0 () f (x; y) > f (x; y0). Overloading our notation,\nwe denote the label ranking induced by f for x by f (x).\nWe assume that we are provided with a set of base label-ranking functions, h1; : : : ; hn,\nj=1 (cid:21)jhj(x; y). We are\nalso provided with a training set S = f(xi; Gi)gm\ni=1 where every example is comprised\nof an instance xi 2 X and a preference graph Gi. As de\ufb01ned in the previous section,\na preference graph is a directed graph G = (V; E), for which the set of vertices V is\nde\ufb01ned to be the set of labels Y and E is some \ufb01nite set of directed edges. Every edge\nin a directed graph e 2 E is associated with an initial vertex, init(e) 2 V , and a terminal\nvertex, term(e) 2 V . The existence of a directed edge between two labels in a preference\ngraph indicates that init(e) is preferred over term(e) and should be ranked higher. We\nrequire preference graphs to be simple, namely to have no more than a single edge between\nany pair of vertices and to not contain any self-loops. However, we impose no additional\nconstraints on the supervision, namely, the set of edges in a preference graph may be sparse\nand may even include cycles. This form of supervision was chosen for its generality and\n\ufb02exibility. If Y is very large (possibly in\ufb01nite), it would be unreasonable to require that the\ntraining data contain a complete total order over Y for every instance.\nInformally, our goal is for the label ranking induced by f to be as consistent as possible with\nall of the preference graphs given in S. We say that f (xi) disagrees with a preference graph\n\nGi = (Vi; Ei) if there exists an edge e 2 Ei for which f(cid:0)xi; init(e)(cid:1) (cid:20) f(cid:0)xi; term(e)(cid:1).\n\nFormally, we de\ufb01ne a function (cid:14) that indicates when such a disagreement occurs\n\n(cid:14)(f (x); G) =(cid:26) 1\n\n0\n\nif 9e 2 E s.t. f(cid:0)x; init(e)(cid:1) (cid:20) f(cid:0)x; term(e)(cid:1)\n\notherwise .\n\nA simple measure of empirical ranking accuracy immediately follows from the de\ufb01nition\nof (cid:14): We de\ufb01ne the 0 (cid:0) 1 error attained by a ranking function f on a training set S to be\nthe number of training examples for which f (xi) disagrees with Gi, namely,\n\n\"0(cid:0)1(f; S) =\n\n(cid:14)(f (xi); Gi) :\n\nm\n\nXi=1\n\nThe 0 (cid:0) 1 error may be natural for certain ranking problems, however in general it is a\nrather crude measure of ranking inaccuracy, as it is invariant to the exact number of edges\nin Gi with which f (xi) disagrees. Many ranking problems require a more re\ufb01ned notion of\nranking accuracy. Thus, we de\ufb01ne the disagreement error attained by f (xi) with respect\nto Gi to be the fraction of edges in Ei with which f (xi) disagrees. The disagreement\nerror attained on the entire training set is the sum of disagreement errors over all training\nexamples. Formally, we de\ufb01ne the disagreement error attained on S as\n\n\"dis(f; S) =\n\nm\n\nXi=1 (cid:12)(cid:12)(cid:8)e 2 Ei s.t. f(cid:0)x; init(e)(cid:1) (cid:20) f(cid:0)x; term(e)(cid:1)(cid:9)(cid:12)(cid:12)\n\n:\n\nBoth the 0 (cid:0) 1 error and the disagreement error are reasonable measures of ranking inaccu-\nracy. It turns out that both are instances of a more general notion of ranking error of which\nadditional meaningful instances exist. The de\ufb01nition of this generalized error is slightly\nmore involved but enables us to present a uni\ufb01ed account of different measures of error.\n\nThe missing ingredient needed to de\ufb01ne the generalized error is a graph decomposition\nprocedure A that we assume is given together with the training data. A takes as its input\n\n(cid:12)(cid:12)Ei(cid:12)(cid:12)\n\n\f2\n\n2\n\n2\n\n1\n\n1\n\n1\n\n3\n\n4\n\n3\n\n4\n\n3\n\n4\n\n5\n\n5\n\n5\n\nA1\n7(cid:0)!\n\nA2\n7(cid:0)!\n\nA3\n7(cid:0)!\n\n1\n\n2\n\n2\n\n3\n\n1\n\n1\n\n4\n\n1\n\n4\n\n1\n\n5\n\n5\n\n1\n\n5\n\n2\n\n2\n\n3\n\n2\n\n3\n\n2\n\n3\n\n3\n\n5\n\n5\n\n5\n\n3\n\n1\n\n1\n\n1\n\n3\n\n4\n\n5\n\n4\n\n4\n\n3\n\n5\n\n2\n\n2\n\n1\n\n5\n\n5\n\n(cid:0)\"dis = 3\n8(cid:1)\n\n(cid:0)\"Dom = 2\n4(cid:1)\n\n(cid:0)\"dom = 3\n5(cid:1)\n\nFigure 2: Applying different graph decomposition procedures induces different error func-\ntions: A1 induces \"dis, A2 induces \"Dom and A3 induces \"dom. The errors above are with\nrespect to the order 1 (cid:31) 2 (cid:31) 3 (cid:31) 4 (cid:31) 5. Dashed edges without arrowheads disagree with\nthis total order, and the errors are the fraction of subgraphs that contain disagreeing edges.\na preference graph Gi and returns a set of si subgraphs of Gi, denoted fGi;1; : : : ; Gi;si g,\nwhere Gi;k = (Vi; Ei;k). Each subgraph Gi;k is itself a preference graph and therefore\n(cid:14)(f (xi); Gi;k) is well de\ufb01ned. We now de\ufb01ne the generalized error attained by f (xi) with\nrespect to Gi as the fraction of subgraphs in A(Gi) with which f (xi) disagrees. The\ngeneralized error attained on S is the sum of generalized errors over all training instances.\nFormally, the generalized ranking error is de\ufb01ned as\n\n\"gen(f; S; A) =\n\n(cid:14)(f (xi); Gi;k) where fGi;1; : : : ; Gi;si g = A(Gi) :\n\n(1)\n\n1\nsi\n\nm\n\nXi=1\n\nsi\n\nXk=1\n\nPreviously used losses for label ranking are special cases of the generalized error and are\nderived by choosing an appropriate decomposition procedure A. For instance, when A is\nde\ufb01ned to be the identity transformation on graphs (A(G) = fGg), then the generalized\nranking error is reduced to the 0 (cid:0) 1 error. Alternatively, for a graph G with s edges, we\ncan de\ufb01ne A to return s different subgraphs of G, each consisting of a single edge from G\n(Fig. 2 top) and the generalized ranking error reduces to the disagreement error.\n\nAn additional meaningful measure of error is the domination error. A vertex is said to\ndominate the set of neighboring vertices that are connected to its outgoing edges. We\nwould like every vertex in the preference graph to be ranked above all of its dominated\nneighbors. The domination error attained by f (xi) with respect to Gi is the fraction of\nvertices with outgoing edges which are not ranked above all of their dominated neighbors.\nFormally, let A be the procedure that takes a preference graph G = (V; E) and returns a\nsubgraph for each vertex with outgoing edges, each such subgraph consisting of a dominat-\ning vertex, its dominated neighbors and edges between them (Fig. 2 middle). Now de\ufb01ne\n\"Dom(f; S) = \"gen(f; S; A) : Minimizing the domination error is useful for solving multil-\nabel classi\ufb01cation problems. In these problems Y is of \ufb01nite cardinality and every instance\nxi is associated with a set of correct labels Yi (cid:18) Y. In order to reduce this problem to a\nranking problem, we construct preference graphs Gi = (Y; Ei), where Ei contains edges\nfrom every vertex in Yi to every vertex in Y n Yi. In this case, the domination loss simply\ncounts the number of labels in Yi that are not ranked above all of the labels in Y n Yi.\nA \ufb01nal interesting measure of error is the dominated error, denoted \"dom. The dominated\nerror is proportional to the number of labels with incoming edges that are not ranked below\nall of the labels that dominate them. Its graph decomposition procedure is depicted at the\nbottom of Fig. 2. Additional instances of the generalized ranking error exist, and can be\ntailored to \ufb01t most ranking problems. In the next section we set aside the speci\ufb01cs of the\ndecomposition procedure and derive a minimization procedure for the generalized error.\n\n\fINPUT: training data S = f(xi; Gi)gm\n\ni=1 s.t. xi 2 X and Gi is a preference graph,\n\na decomposition procedure A and a set of base ranking functions fh1; : : : ; hng.\n\nINITIALIZE: (cid:21)1 = (0; 0; : : : ; 0)\n\nITERATE: For\n\nt = 1; 2; : : :\n\nexp ((cid:21)t (cid:1) (cid:25)i;e)\n\n(cid:25)i;e;j = hj(cid:0)xi; term(e)(cid:1) (cid:0) hj(cid:0)xi; init(e)(cid:1)\n(cid:26) = maxi;ePj j(cid:25)i;e;jj\nqt;i;e = Xk:e2Ei;k\nt;j = Xi;e:(cid:25)i;e;j >0\nln W +\nt;j!\n\n1 +Pe02Ei;k\n\nqt;i;e (cid:25)i;e;j\n\nt;j\nW (cid:0)\n\n(cid:3)t;j =\n\nW +\n\n1\n2\n\nsi\n\nexp ((cid:21)t (cid:1) (cid:25)i;e0 )\n\n[1 (cid:20) i (cid:20) m; e 2 Ei; 1 (cid:20) j (cid:20) n]\n\n[1 (cid:20) i (cid:20) m; e 2 Ei]\n\nW (cid:0)\n\nt;j = Xi;e:(cid:25)i;e;j <0\n\n(cid:0)qt;i;e (cid:25)i;e;j\n\nsi\n\n[1 (cid:20) j (cid:20) n]\n\n[1 (cid:20) j (cid:20) n]\n\n(cid:21)t+1 = (cid:21)t (cid:0)\n\n(cid:3)t\n(cid:26)\n\nFigure 3: A boosting based algorithm for generalized label ranking.\n\n3 Minimizing the Generalized Ranking Error\n\nOur goal is to minimize the generalized error for a given training set S and graph decompo-\nsition procedure A. This task generalizes standard classi\ufb01cation problems which are known\nto be NP-complete. Hence we do not attempt to minimize the error directly but rather min-\nimize a smooth, strictly convex, upper bound on \"gen. The disagreement of f (xi) and a\npreference graph Gi;k = (Vi;k; Ei;k) can be upper bounded by\n\n(cid:14)(f; xi; Gi;k) (cid:20) log20\n\n@1 + Xe2Ei;k\n\nexp(cid:16)f(cid:0)xi; term(e)(cid:1) (cid:0) f(cid:0)xi; init(e)(cid:1)(cid:17)1\nA\n\nDenoting the right hand side of the above as L(f (xi); Gi;k), we de\ufb01ne the loss attained by\nf on the entire training set S to be\n\nL(f; S; A) =\n\nL(f (xi); Gi;k) where Gi;1; : : : ; Gi;si = A(Gi) :\n\nFrom the de\ufb01nition of the generalized error in Eq. (1), we conclude the upper bound\n\"gen(f; S; A) (cid:20) L(f; S; A) : A boosting-based algorithm that globally minimizes the\nloss is given in Fig. 3. On every iteration, a weight qt;i;e is calculated for every edge in\nthe training data, and the algorithm focuses on satisfying each edge with proportion to its\nweight. This set of weights plays the role of the distribution vector common in boosting\nalgorithms for classi\ufb01cation. The following theorem bounds the decrease in loss on every\niteration of the algorithm by a non-negative auxiliary function.\n\nTheorem 1 Let S = f(xi; Gi)gm\ni=1 be a training set such that every xi 2 X and every\nGi is a preference graph. Let A be a graph decomposition procedure that de\ufb01nes for each\npreference graph Gi a set of subgraphs fGi;1; : : : ; Gi;si g = A(Gi). Denote by ft the\n\nranking function obtained at iteration t of the algorithm given in Fig. 3 (ft =Pj (cid:21)t;jhj).\n\nUsing the notation de\ufb01ned in Fig. 3, the decrease in loss on iteration t is bounded by\n\nL(ft; S; A) (cid:0) L(ft+1; S; A) (cid:21)\n\n1\n(cid:26)\n\nn\n\nXj=1(cid:18)qW +\n\nt;j(cid:19)2\nt;j (cid:0)qW (cid:0)\n\n:\n\n1\nsi\n\nm\n\nXi=1\n\nsi\n\nXk=1\n\n\fProof De\ufb01ne (cid:1)t;i;k to be the difference between the loss attained by ft and the loss attained\nby ft+1 on (xi; Gi;k), that is (cid:1)t;i;k = L(ft(xi); Gi;k) (cid:0) L(ft+1(xi); Gi;k), and de\ufb01ne\n\nUsing the inequality (cid:0) log(1 (cid:0) a) (cid:21) a (which holds when log(1 (cid:0) a) is de\ufb01ned), we get\n(cid:30)t;i;k (cid:0) (cid:30)t+1;i;k\n\n(cid:30)t;i;k = Pe2Ei;k\n(cid:1)t;i;k = log(cid:0)1 + (cid:30)t;i;k(cid:1) (cid:0) log(cid:0)1 + (cid:30)t+1;i;k(cid:1) = (cid:0) log(cid:18)1 (cid:0)\n\nexp((cid:21)t (cid:1) (cid:25)i;e). We can now rewrite L(ft(xi); Gi;k) as log(cid:0)1 + (cid:30)t;i;k(cid:1).\n(cid:19)\n\nexp((cid:21)t (cid:1) (cid:25)i;e) (cid:0) exp((cid:21)t+1 (cid:1) (cid:25)i;e)\n\n(cid:30)t;i;k (cid:0) (cid:30)t+1;i;k\n\n1 + (cid:30)t;i;k\n\nexp((cid:21)t (cid:1) (cid:25)i;e0 )\n\n:\n\n(3)\n\n(cid:21)\n\n1 + (cid:30)t;i;k\n\n= Xe2Ei;k\n\n1 + Pe02Ei;k\n\nThe algorithm sets (cid:21)t+1 = (cid:21)t (cid:0) (1=(cid:26))(cid:3)t and therefore exp((cid:21)t+1 (cid:1) (cid:25)i;e) in Eq. (3) can be\nreplaced by exp((cid:21)t (cid:1) (cid:25)i;e) exp((cid:0)(1=(cid:26))(cid:3)t (cid:1) (cid:25)i;e), yielding:\n\nSumming both sides of the above over the subgraphs in A(Gi), and plugging in qt;i;e,\n\nexp((cid:21)t (cid:1) (cid:25)i;e0)!(cid:18)1 (cid:0) exp(cid:18)(cid:0)\n\n1\n(cid:26)\n\n(cid:3)t (cid:1) (cid:25)i;e(cid:19)(cid:19) :\n\nexp((cid:21)t (cid:1) (cid:25)i;e)\n\n(cid:1)t;i;k (cid:21) Xe2Ei;k \n1 + Pe02Ei;k\n0\n(cid:1)t;i;k (cid:21) Xe2Ei\n@ Xk:e2Ei;k\nqt;i;e(cid:18)1 (cid:0) exp(cid:18)(cid:0)\n= Xe2Ei\n\n1 + Pe02Ei;k\n\nsi\n\nXk=1\n\nexp((cid:21)t (cid:1) (cid:25)i;e)\n\nexp((cid:21)t (cid:1) (cid:25)i;e0 )1\n(cid:3)t (cid:1) (cid:25)i;e(cid:19)(cid:19) :\n\n1\n(cid:26)\n\nWe now rewrite (1=(cid:26))(cid:3)t (cid:1) (cid:25)i;e in more convenient form\n\n(cid:0)\n\n1\n(cid:26)\n\n(cid:3)t (cid:1) (cid:25)i;e = (cid:0)\n\n1\n(cid:26)\n\n(cid:3)t;j(cid:25)i;e;j =\n\nn\n\nXj=1\n\nn\n\nXj=1\n\nA(cid:18)1 (cid:0) exp(cid:18)(cid:0)\n\n1\n(cid:26)\n\n(cid:3)t (cid:1) (cid:25)i;e(cid:19)(cid:19)\n\n(4)\n\n(j(cid:25)i;e;jj=(cid:26)) ((cid:0)sign((cid:25)i;e;j)(cid:3)t;j) :\n\n(5)\n\nThe rationale behind this rewriting is that we now think of (j(cid:25)i;e;1j=(cid:26)) ; : : : ; (j(cid:25)i;e;nj=(cid:26)) as\ncoef\ufb01cients in a subconvex combination of ((cid:0)sign((cid:25)i;e;1)(cid:3)t;1) ; : : : ; ((cid:0)sign((cid:25)i;e;n)(cid:3)t;n),\n\nsince 8j (j(cid:25)i;e;jj=(cid:26)) (cid:21) 0 and from the de\ufb01nition of (cid:26), Pj (j(cid:25)i;e;1j=(cid:26)) (cid:20) 1. Plugging\n\nEq. (5) into Eq. (4) and using the concavity of the function 1 (cid:0) exp((cid:1)) in Eq. (4), we obtain\n\nsi\n\nXk=1\n\n(cid:1)t;i;k (cid:21) Xe2Ei\n(cid:21) Xe2Ei;k\n\nn\n\nXj=1\n\n(j(cid:25)i;e;jj=(cid:26)) ((cid:0)sign((cid:25)i;e;j)(cid:3)t;j)1\nqt;i;e0\n@1 (cid:0) exp0\n1\n@\nA\nA\nqt;i;e(j(cid:25)i;e;jj=(cid:26))(cid:0)1 (cid:0) exp ((cid:0)sign((cid:25)i;e;j)(cid:3)t;j)(cid:1) :\n\nXj=1\n\nn\n\nFinally, we sum both sides of the above over all of S and plug in W +, W (cid:0) and (cid:3) to get\n\nL(ft; S; A) (cid:0) L(ft+1; S; A) =\n\nn\n\nsi\n\nn\n\nm\n\nsi\n\n(cid:1)t;i;k\n\nqt;i;ej(cid:25)i;e;jj\n\nXi=1\nXk=1\nXi=1 Xe2Ei\nXj=1\n(cid:0)1 (cid:0) exp ((cid:0)sign((cid:25)i;e;j)(cid:3)t;j)(cid:1)\n@1 (cid:0) qW +\n@1 (cid:0) qW (cid:0)\n1\n1\n3\n2\nt;j0\nt;j0\nXj=1\n4W +\nA + W (cid:0)\n5\nA\nqW +\nqW (cid:0)\nt;j(cid:19)2\nXj=1(cid:18)qW +\nt;j (cid:0)qW (cid:0)\n\nt;j\n\nt;j\n\nt;j\n\nt;j\n\n:\n\nn\n\nn\n\n(cid:21)\n\n=\n\n=\n\n1\n(cid:26)\n\n1\n(cid:26)\n\n1\n(cid:26)\n\n\fThm. 1 proves that the losses attained on each iteration form a monotonically non-\nincreasing sequence of positive numbers, that must therefore converge. However, we are\nt=1 converges\ninterested in proving a stronger claim, namely that the vector sequence ((cid:21)t)1\nto a globally optimal weight-vector (cid:21)?. Since the loss is a convex function, it suf\ufb01ces to\nshow that the vector sequence converges to a stationary point of the loss. It is easily veri\ufb01ed\nthat the non-negative auxiliary function which bounds the decrease in loss equals zero only\nt=1 indeed converges to (cid:21)? if\nat stationary points of the loss. This fact implies that ((cid:21)t)1\nthe set of all feasible values for (cid:21) is compact and the loss has a unique global minimum.\nCompactness of the feasible set and uniqueness of the optimum can be explicitly enforced\nby adding a form of natural regularization to the boosting algorithm. The speci\ufb01cs of this\ntechnique exceed the scope of this paper and are discussed in [5].\nIn all, the boosting\nalgorithm of Fig. 3 converges to the globally optimal weight-vector (cid:21)?.\n\n4 Experiments\n\n\"0(cid:0)1\n0.63\n0.73\n0.59\n0.59\n\n\"dis\n0.068\n0.063\n0.049\n0.067\n\n\"Dom\n0.42\n0.51\n0.35\n0.41\n\n\"dom\n0.12\n0.14\n0.10\n0.10\n\n0 (cid:0) 1\ndis\nDom\ndom\n\nTo demonstrate our framework, we chose to\nlearn a category ranking problem on a sub-\nset of the Reuters Corpus, Vol. 1 [14]. The\nfull Reuters corpus is comprised of approx-\nimately 800; 000 textual news articles, col-\nlected over a period of 12 months in 1996\u2013\n1997. Most of the articles are labeled by\none or more categories. For the purpose of\nthese experiments, we limited ourselves to\nthe subset of articles collected during Jan-\nuary 1997: approximately 66; 000 articles\nlabeled by 103 different categories.\nAn interesting aspect of the Reuters corpus is that the categories are arranged in a hierar-\nchy. The set of possible labels contains both general categories and more speci\ufb01c ones,\nwhere the speci\ufb01c categories re\ufb01ne the general categories. This concept is best explained\nwith an example: three of the categories in the corpus are Economics, Government Fi-\nnance and Government Borrowing. It would certainly be correct to categorize an article on\ngovernment borrowing as either government \ufb01nance or economics, however these general\ncategories are less speci\ufb01c and do not describe the article as well. Furthermore, misclassi-\nfying such an article as government revenue is by far better than misclassifying it as sports.\nIn summary, the category hierarchy induces a preference over the set of labels. We exploit\nthis property to generate supervision for the label ranking problem at hand.\n\nFigure 4: The test error averaged over 5-\nfold cross validation. The rows correspond\nto different optimization problems: mini-\nmizing \"0(cid:0)1, \"dis, \"Dom and \"dom. Errors\nare measured using all 4 error measures.\n\nFormally, we view every category as a vertex in a rooted tree, where the tree root corre-\nsponds to a general abstract category that is relevant to all of the articles in the corpus and\nevery category is a speci\ufb01c instance of its parent in the tree. The labels associated with an\narticle constitute a set of paths from the tree root to a set of leaves. The original corpus is\nsomewhat inconsistent in that not all paths end in a leaf, but rather end in some inner vertex.\nTo \ufb01x this inconsistency, we added a dummy child vertex to every inner vertex and diverted\nall paths that originally end in this inner vertex to its new child. Our learning problem then\nbecomes the problem of ranking leaves. The severity of wrongly categorizing an article\nby a leaf is proportional to the graph distance between this leaf and the closest correct leaf\ngiven in the corpus. The preference graph that encodes this preference is a multi-layer\ngraph where the top layer contains all of the correct labels, the second layer contains all of\ntheir sibling vertices in the tree and so on. Every vertex in the multi-layer preference graph\nhas outgoing edges to all vertices in lower layers, but there are no edges between vertices\nin the same layer. For practical purposes, we conducted experiments using only 3-layer\npreference graphs generated by collapsing all of the layers below 3 to a single layer.\n\n\fAll of the experiments were carried out using 5-fold cross validation. The word counts\nfor each article were used to construct base ranking functions in the following way: for\nevery word w and every category y, let w(xi) denote the number of appearances of w in\nthe article xi. Then, de\ufb01ne\n\nhw;y(xi; yi) =(cid:26) log(w(xi)) + 1\n\n0\n\nif w(xi) > 0 and yi = y\notherwise .\n\n(6)\n\nFor each training set, we \ufb01rst applied a heuristic feature selection method common in boost-\ning applications [10] to select some 3200 informative words. These words then de\ufb01ne\n103 (cid:1) 3200 base ranking functions as shown in Eq. (6). Next, we ran our learning algorithm\nusing each of the 4 graph decomposition procedures discussed above: zero-one, disagree-\nment, domination and dominated. After learning each problem, we calculated all four error\nmeasures on the test data. The results are presented in Fig. 4. Two points are worth noting.\nFirst, these results are not comparable with previous results for multilabel problems using\nthis corpus, since label ranking is a more dif\ufb01cult task. For instance, an average preference\ngraph in the test data has 820 edges, and the error for such a graph equals zero only if every\nsingle edge agrees with the ranking function. Second, the experiments clearly indicate that\nthe results obtained by minimizing the domination loss are better than the other ranking\nlosses, no matter what error is used for evaluation. In particular, employing the domination\nloss yields signi\ufb01cantly better results than using the disagreement loss which has been the\ncommonly used decomposition method in categorization problems [7, 10, 6, 4].\n\n5 Summary\n\nWe presented a general framework for label ranking problems by means of preference\ngraphs and the graph decomposition procedure. This framework was shown to generalize\nother decision problems, most notably multilabel categorization. We then described and\nanalyzed a boosting algorithm that works with any choice of graph decomposition. We\nare currently exporting the approach to learning in inner product spaces, where different\ngraph decomposition procedures result in different bindings of slack variables. Another\ninteresting question is whether the graph decomposition approach can be combined with\nprobabilistic models for orderings [9] to achieve algorithmic ef\ufb01ciency.\n\nReferences\n[1] M. Collins and N. Duffy. New ranking algorithms for parsing and tagging: Kernels over discrete\n\nstructures, and the voted perceptron. In 30th Annual Meeting of the ACL, 2002.\n\n[2] M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman dis-\n\ntances. Machine Learning, 47(2/3):253\u2013285, 2002.\n\n[3] K. Crammer and Y. Singer. Pranking with ranking. NIPS 14, 2001.\n[4] K. Crammer and Y. Singer. A new family of online algorithms for category ranking. Jornal of\n\nMachine Learning Research, 3:1025\u20131058, 2003.\n\n[5] O. Dekel, S. Shalev-Shwartz, and Y. Singer. Smooth epsilon-insensitive regression by loss\n\nsymmetrization. COLT 16, 2003.\n\n[6] A. Elisseeff and J. Weston. A kernel method for multi-labeled classi\ufb01cation. NIPS 14, 2001.\n[7] Y. Freund, R. Iyer, R. E.Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. In Machine Learning: Proc. of the Fifteenth International Conference, 1998.\n\n[8] G. Lebanon and J. Lafferty. Boosting and ML for exponential models. NIPS 14, 2001.\n[9] G. Lebanon and J. Lafferty. Conditional models on the ranking poset. NIPS 15, 2002.\n[10] R. E. Schapire and Y. Singer. BoosTexter: A boosting-based system for text categorization.\n\nMachine Learning, 32(2/3), 2000.\n\n[11] A. Shashua and A. Levin. Ranking with large margin principle. NIPS 15, 2002.\n[12] K. Toutanova and C. D. Manning. Feature selection for a rich HPSG grammar using decision\ntrees. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL), 2002.\n\n[13] The Penn Treebank Project. http://www.cis.upenn.edu/(cid:24)treebank/.\n[14] Reuters Corpus Vol. 1. http://about.reuters.com/researchandstandards/corpus/.\n\n\f", "award": [], "sourceid": 2531, "authors": [{"given_name": "Ofer", "family_name": "Dekel", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}, {"given_name": "Christopher", "family_name": "Manning", "institution": null}]}