{"title": "Predtron: A Family of Online Algorithms for General Prediction Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1009, "page_last": 1017, "abstract": "Modern prediction problems arising in multilabel learning and learning to rank pose unique challenges to the classical theory of supervised learning. These problems have large prediction and label spaces of a combinatorial nature and involve sophisticated loss functions. We offer a general framework to derive mistake driven online algorithms and associated loss bounds. The key ingredients in our framework are a general loss function, a general vector space representation of predictions, and a notion of margin with respect to a general norm. Our general algorithm, Predtron, yields the perceptron algorithm and its variants when instantiated on classic problems such as binary classification, multiclass classification, ordinal regression, and multilabel classification. For multilabel ranking and subset ranking, we derive novel algorithms, notions of margins, and loss bounds. A simulation study confirms the behavior predicted by our bounds and demonstrates the flexibility of the design choices in our framework.", "full_text": "Predtron: A Family of Online Algorithms for General\n\nPrediction Problems\n\nPrateek Jain\n\nMicrosoft Research, INDIA\n\nprajain@microsoft.com\n\nNagarajan Natarajan\n\nUniversity of Texas at Austin, USA\n\nnaga86@cs.utexas.edu\n\nAmbuj Tewari\n\nUniversity of Michigan, Ann Arbor, USA\n\ntewaria@umich.edu\n\nAbstract\n\nModern prediction problems arising in multilabel learning and learning to rank\npose unique challenges to the classical theory of supervised learning. These prob-\nlems have large prediction and label spaces of a combinatorial nature and involve\nsophisticated loss functions. We offer a general framework to derive mistake\ndriven online algorithms and associated loss bounds. The key ingredients in our\nframework are a general loss function, a general vector space representation of\npredictions, and a notion of margin with respect to a general norm. Our general\nalgorithm, Predtron, yields the perceptron algorithm and its variants when instan-\ntiated on classic problems such as binary classi\ufb01cation, multiclass classi\ufb01cation,\nordinal regression, and multilabel classi\ufb01cation. For multilabel ranking and sub-\nset ranking, we derive novel algorithms, notions of margins, and loss bounds. A\nsimulation study con\ufb01rms the behavior predicted by our bounds and demonstrates\nthe \ufb02exibility of the design choices in our framework.\n\n1\n\nIntroduction\n\nClassical supervised learning problems, such as binary and multiclass classi\ufb01cation, share a number\nof characteristics. The prediction space (the space in which the learner makes predictions) is often\nthe same as the label space (the space from which the learner receives supervision). Because di-\nrectly learning discrete valued prediction functions is hard, one learns real-valued or vector-valued\nfunctions. These functions generate continuous predictions that are converted into discrete ones\nvia simple mappings, e.g., via the \u2018sign\u2019 function (binary classi\ufb01cation) or the \u2018argmax\u2019 function\n(multiclass classi\ufb01cation). Also, the most commonly used loss function is simple, viz. the 0-1 loss.\nIn contrast, modern prediction problems, such as multilabel learning, multilabel ranking, and subset\nranking do not share these characteristics. In order to handle these problems, we need a more general\nframework that offers more \ufb02exibility. First, it should allow for the possibility of having different\nlabel space and prediction space. Second, it should allow practitioners to use creative, new ways\nto map continuous, vector-valued predictions to discrete ones. Third, it should permit the use of\ngeneral loss functions.\nExtensions of the theory of classical supervised learning to modern predictions problems have be-\ngun. For example, the work on calibration dimension [1] can be viewed as extending one aspect of\nthe theory, viz. that of calibrated surrogates and consistent algorithms based on convex optimiza-\ntion. This paper deals with the extension of another interesting part of classical supervised learning:\nmistake driven algorithms such as perceptron (resp. winnow) and their analyses in terms of `2 (resp.\n`1) margins [2, Section 7.3].\n\n1\n\n\fWe make a number of contributions. First, we provide a general framework (Section 2) whose\ningredients include an arbitrary loss function and an arbitrary representation of discrete predic-\ntions in a continuous space. The framework is abstract enough to be of general applicability but\nit offers enough mathematical structure so that we can derive a general online algorithm, Predtron\n(Algorithm 1), along with an associated loss bound (Theorem 1) under an abstract margin condi-\ntion (Section 2.2). Second, we show that our framework uni\ufb01es several perception-like algorithms\nfor classical problems such as binary classi\ufb01cation, multiclass classi\ufb01cation, ordinal regression, and\nmultilabel classi\ufb01cation (Section 3). Even for these classical problems, we get some new results, for\nexample, when the loss function treats labels asymmetrically or when there exists a \u2018reject\u2019 option\nin classi\ufb01cation. Third, we apply our framework to two modern prediction problems: subset rank-\ning (Section 4) and multilabel ranking (Section 5). In both of these problems, the prediction space\n(rankings) is different from the supervision space (set of labels or vector of relevance scores). For\nthese two problems, we propose interesting, novel notions of correct prediction with a margin and\nderive mistake bounds under a loss derived from NDCG, a ranking measure that pays more attention\nto the performance at the top of a ranked list. Fourth, our techniques based on online convex opti-\nmization (OCO) can effortlessly incorporate notions of margins w.r.t. non-Euclidean norms, such as\n`1 norm, group norm, and trace norm. Such \ufb02exibility is important in modern prediction problems\nwhere the learned parameter can be a high dimensional vector or a large matrix with low group or\ntrace norm. Finally, we test our theory in a simulation study (Section 6) dealing with the subset\nranking problem showing how our framework can be adapted to a speci\ufb01c prediction problem. We\ninvestigate different margin notions as we vary two key design choices in our abstract framework:\nthe map used to convert continuous predictions into discrete ones, and the choice of the norm used\nin the de\ufb01nition of margin.\n\nRelated Work. Our general algorithm is related to the perceptron and online gradient descent al-\ngorithms used in structured prediction [3, 4]. But, to the best of knowledge, our emphasis on keeping\nlabel and prediction spaces possibly distinct, our use of a general representation of predictions, and\nour investigation of generalized notions of margins are all novel. The use of simplex coding in mul-\nticlass problems [5] inspired the use of maximum similarity/minimum distance decoding to obtain\ndiscrete predictions from continuous ones. Our proofs use results about Online Gradient Descent\nand Online Mirror Descent from the Online Convex Optimization literature [6].\n\n2 Framework and Main Result\n\nThe key ingredients in classic supervised learning are an input space, an output space and a loss\nfunction. In this paper, the input space X2 Rp will always be some subset of a \ufb01nite dimensional\nEuclidean space. Our algorithms maintain prediction functions as a linear combination of the seen\ninputs. As a result, they easily kernelize and the theory extends, in a straightforward way to the case\nwhen the input space is a, possibly in\ufb01nite dimensional, reproducing kernel Hilbert space (RKHS).\n\n2.1 Labels, Prediction, and Scores\n\nWe will distinguish between the label space and the prediction space. The former is the space where\nthe training labels come from whereas the latter is the space where the learning algorithm has to\nmake predictions in. Both spaces will be assumed to be \ufb01nite. Therefore, without any loss of\ngenerality, we can identify the label space with [`] = {1, . . . ,`} and the prediction space with [k]\nwhere `, k are positive, but perhaps very large, integers. A given loss function L : [k] \u21e5 [`] ! R+\nmaps a prediction 2 [k] and a label y 2 [`] to a non-negative loss L(, y). The loss L can\nequivalently be thought of as a k \u21e5 ` matrix with loss values as entries. De\ufb01ne the set of correct\npredictions for a label y as \u2303y = {y 2 [k]\n: L(y, y) = 0}. We assume that, for every label\ny, the set \u2303y is non-empty. That is, every column of the loss matrix has a zero entry. Also, let\ncL = minL(,y)>0 L(, y) and CL = max,y L(, y) be the minimum (non-zero) and maximum\nentries in the loss matrix.\nIn an online setting, the learner will see a stream of examples (X\u2327 , Y\u2327 ) 2X\u21e5 [`]. Learner will\npredict scores using a linear predictor W 2 Rd\u21e5p. However, the predicted scores W X\u2327 will be\nin Rd, not in the prediction space [k]. So, we need a function pred : Rd ! [k] to convert scores\ninto actual predictions. We will assume that there is a unique representation rep() 2 Rd of each\n\n2\n\n\fprediction such that k rep()k2 = 1 for all . Given this, a natural transformation of scores into\nprediction is given by the following maximum similarity decoding:\nhrep(), ti ,\n\n(1)\n\npred(t) 2 argmax\n2[k]\n\nwhere ties in the \u201cargmax\u201d can be broken arbitrarily. There are some nice consequences of the\nde\ufb01nition of pred above. First, because k rep()k2 = 1, maximum similarity decoding is equivalent\nto nearest neighbor decoding: pred(t) 2 argmin k rep() tk2. Second, we have a homogeneity\nproperty: pred(ct) = pred(t) if c > 0. Third, rep serves as an \u201cinverse\u201d of pred in the following\nsense. We have, pred(rep()) = for all . Moreover, rep(pred(t)) is more similar to t than the\nrepresentation of any other prediction :\n\n8t 2 Rd, 2 [k], hrep(pred(t)), ti hrep(), ti .\n\nIn view of these facts, we will use pred1() and rep() interchangeably. Using pred, the loss\nfunction L can be extended to a function de\ufb01ned on Rd \u21e5 [k] as:\n\nL(t, y) = L(pred(t), y).\n\nWith a little abuse of notation, we will continue to denote this new function also by L.\n\n2.2 Margins\n\nWe say that a score t is compatible with a label y if the set of \u2019s that achieve the maximum in the\n\nde\ufb01nition (1) of pred is exactly \u2303y. That is, argmax2[k] \u2326pred1(), t\u21b5 =\u2303 y. Hence, for any\ny 2 \u2303y, /2 \u2303y, we have\u2326pred1(y), t\u21b5 >\u2326pred1(), t\u21b5. The notion of margin makes this\n\nrequirement stronger. We say that a score t has a margin > 0 on label y, iff t is compatible with\ny and\n\n8y 2 \u2303y, /2 \u2303y, \u2326pred1(y), t\u21b5 \u2326pred1(), t\u21b5 + \n\nNote that margin scales with t: if t has margin on y then ct has margin c on y for any positive c.\nIf we are using linear predictions t = W X, we say that W has margin on (X, y) iff t = W X has\nmargin on y. We say that W has margin on a dataset (X1, y1), . . . , (Xn, yn) iff W has margin \non (X\u2327 , y\u2327 ) for all \u2327 2 [n]. Finally, a dataset (X1, y1), . . . , (Xn, yn) is said to be linearly separable\nwith margin if there is a unit norm1 W ? such that W ? has margin on (X1, y1), . . . , (Xn, yn).\n\n2.3 Algorithm\n\nJust like the classic perceptron algorithm, our generalized perceptron algorithm (Algorithm 1) is\nmistake driven. That is, it only updates on round when a mistake, i.e., a non-zero loss, is incurred.\nOn a mistake round, it makes a rank-one update of the form W\u2327 +1 = W\u2327 g\u2327 \u00b7 X>\u2327 where g\u2327 2\nRd, X\u2327 2 Rp. Therefore, W\u2327 always has a representation of the formPi giX>i . The prediction\non a fresh input X is given byPi gi hXi, Xi which means the algorithm, just like the original\nperceptron, can be kernelized.\nWe will give a loss bound for the algorithm using tools from Online Convex Optimization (OCO).\nDe\ufb01ne the function : Rd \u21e5 [`] ! R as\n\n(t, y) = max\n\n2[k] L(, y) \u2326pred1(y) pred1(), t\u21b5\n\n(2)\n\nwhere y 2 \u2303y is an arbitrary member of \u2303y. For any y, (\u00b7, y) is a point-wise maximum of linear\nfunctions and hence convex. Also, is non-negative: choose = y to lower bound the maximum.\nThe inner product part vanishes and the loss L(y, y) vanishes too because y 2 \u2303y. Given the\nde\ufb01nition of , Algorithm 1 can be described succinctly as follows. At round \u2327, if L(W\u2327 X\u2327 , Y\u2327 ) >\n0, then W\u2327 +1 = W\u2327 \u2318rW (W X\u2327 , Y\u2327 ), otherwise W\u2327 +1 = W\u2327 .\n\n1Here, we mean that the Frobenius norm kW ?kF equals 1. Of course, the notion of margin can be gener-\nalized to any norm including the entry-based `1 norm kWk1 and the spectrum-based `1 norm kWkS(1) (also\ncalled the nuclear or trace norm). See Appendix B.2.\n\n3\n\n\fAlgorithm 1 Predtron: Extension of the Perceptron Algorithm to General Prediction Problems\n1: W1 0\n2: for \u2327 = 1, 2, . . . do\nReceive X\u2327 2 Rp\n3:\nPredict \u2327 = pred(W\u2327 X\u2327 ) 2 [k]\n4:\nReceive label y\u2327 2 [`]\n5:\nif L(\u2327 , y\u2327 ) > 0 then\n6:\n7:\n(t, y) = (W\u2327 X\u2327 , y\u2327 )\n\u02dc\u2327 = argmax2[k] L(, y) \u2326pred1(y) pred1(), t\u21b5 2 [k]\n8:\nr\u2327 = (pred1(\u02dc\u2327 ) pred1(y)) \u00b7 X>\u2327 2 Rd\u21e5p\n9:\n10:\nW\u2327 +1 = W\u2327 \u2318r\u2327\n11:\n12:\nW\u2327 +1 = W\u2327\nend if\n13:\n14: end for\n\nelse\n\nTheorem 1. Suppose the dataset (X1, y1), . . . , (Xn, yn) is linearly separable with margin . Then\nthe sequence W\u2327 generated by Algorithm 1 with \u2318 = cL/(4R2) satis\ufb01es the loss bound,\n\nL(W\u2327 X\u2327 , y\u2327 ) \uf8ff\n\n4R2C2\nL\ncL2\n\nnX\u2327 =1\n\nwhere kX\u2327k2 \uf8ff R for all \u2327.\nNote that the bound above assumes perfect linear separability. However, just the classic perceptron,\nthe bound will degrade gracefully when the best linear predictor does not have enough margin on\nthe data set.\nThe Predtron algorithm has some interesting variants, two of which we consider in the appendix. A\nloss driven version, Predtron.LD, enjoys a loss bound that gets rid of the CL/cL factor in the bound\nabove. A version, Predtron.Link, that uses link functions to deal with margins de\ufb01ned with respect\nto non-Euclidean norms is also considered.\n\n3 Relationship to Existing Results\n\nIt is useful to discuss a few concrete applications of the abstract framework introduced in the last\nsection. Several existing loss bounds can be readily derived by applying our bound for the general-\nized perceptron algorithm in Theorem 1. In some cases, our framework yields a different algorithm\nthan existing counterparts, yet admitting identical loss bounds, up to constants.\n\nBinary Classi\ufb01cation. We begin with the classical perceptron algorithm for binary classi\ufb01cation\n(i.e., ` = 2) [7]: L0-1(, y) = 1 if 6= y or 0 otherwise. Letting rep() be +1 for the positive\nclass and 1 for the negative class, predictor vector W\u2327 2 R1\u21e5p, and thus pred(t) = sign(t),\nAlgorithm 1 reduces to the original perceptron algorithm; Theorem 1 yields identical mistake bound\non a linearly separable dataset with margin (if the classical margin is , ours works out to be\n2), i.e. Pn\n2 . We can also easily incorporate asymmetric losses. Let\nL\u21b5(, y) = \u21b5y, if 6= y and 0 otherwise. We then have the following result.\nCorollary 2. Consider the perceptron with weighted loss L\u21b5. Assume \u21b51 \u21b52 without loss of\ngenerality. Then the sequence W\u2327 generated by Algorithm 1 satis\ufb01es the weighted mistake bound,\n\n\u2327 =1 L0-1(W\u2327 X\u2327 , y\u2327 ) \uf8ff R2\n\nL\u21b5(W\u2327 X\u2327 , y\u2327 ) \uf8ff\n\n4R2\u21b52\n1\n22 .\n\u21b52\n\nnX\u2327 =1\n\nWe are not aware of such results for weighted loss. Previous work [8] studies perceptrons\nwith uneven margins, and the loss bound there only implies a bound on the unweighted loss:\nIn a technical note, R\u00a8atsch and Kivinen [9] provide a mistake bound of the\n\n\u2327 =1 L0-1(t\u2327 , y\u2327 ).\n\nPn\n\n4\n\n\f\u2327 =1 L\u21b5(W\u2327 X\u2327 , y\u2327 ) \uf8ff R2\n\n42 , but for the speci\ufb01c choice of weights \u21b51 = a2\n\nform (without proof):Pn\nand \u21b52 = (1 a)2 for any a 2 [0, 1].\nAnother interesting extension is obtained by allowing the predictions to have a REJECT option. De-\n\ufb01ne LREJ(REJECT, y) = y and LREJ(, y) = L0-1(, y) otherwise. Assume 1 1 2 > 0 with-\nout loss of generality. Choosing the standard basis vectors in R2 to be rep() for the positive and the\nnegative classes, and rep(REJECT) = 1p2P2{1,2} rep(), we obtainPn\n\u2327 =1 LREJ(W\u2327 X\u2327 , y\u2327 ) \uf8ff\n\n(See Appendix C.1).\n\n4R22\n1\n22\n2\n\nMulticlass Classi\ufb01cation. Each instance is assigned exactly one of m classes (i.e., ` = m).\nExtending binary classi\ufb01cation, we choose the standard basis vectors in Rm to be rep() for\nthe m classes. The learner predicts score t 2 Rm using the predictor W 2 Rm\u21e5p. So,\npred(t) = argmaxi ti. Let wj denote the jth row of W (corresponding to label j). The de\ufb01ni-\ntion of margin becomes:\n\nwhich is identical to the multiclass margin studied earlier [10]. For the multiclass 0-1 loss L0-1, we\nrecover their bound, up to constants2. Moreover, our surrogate for L0-1:\n\nhwy, Xi max\n\nj6=y hwj, Xi \n\n(t, y) = max0, 1 + max\n\n6=y\n\nt ty,\n\nmatches the multiclass extension of the Hinge loss studied by [11]. Finally, note that it is straight-\nforward to obtain loss bounds for multiclass perceptron with REJECT option by naturally extending\nthe de\ufb01nitions of rep and LREJ for the binary case.\n\nOrdinal Regression. The goal is to assign ordinal classes (such as ratings) to a set of objects\n{X1, X2, . . .} described by their features Xi 2 Rp.\nIn many cases, precise rating information\nmay not be available, but only their relative ranks; i.e., the observations consist of object-rank pairs\n(X\u2327 , y\u2327 ) where y\u2327 2 [`]. Y is totally-ordered with \u201c>\u201d relation, which in turn induces a partial\nordering on the objects (Xj is preferred to Xj0 if yj > yj0, Xj and Xj0 are not comparable if\nyj = yj0). For the ranking loss L(, y) = | y|, the PRank perceptron algorithm [12] enjoys the\nboundPn\n\u2327 =1 L(\u2327\u2327 , y\u2327 ) \uf8ff (` 1)(R2 + 1)/\u02dc2, where \u02dc is a certain rank margin. By a reduction\nto multi-class classi\ufb01cation with ` classes, Algorithm 1 achieves the loss bound 4(` 1)2R2/2\n(albeit, for a different margin ).\n\nMultilabel Classi\ufb01cation. This setting generalizes multiclass classi\ufb01cation in that instances are\nassigned subsets of m classes rather than unique classes, i.e., ` = 2m. The loss function L of\ninterest may dictate the choice of rep and in turn pred. For example, consider the following subset\nlosses that treat labels as well as predictions as subsets: (i) Subset 0-1 loss: LIsErr(, y) = 1 if\n = y or 0 otherwise; (ii) Hamming loss: LHam(, y) = | [ y|| \\ y|, and (ii) Error set\nsize: LErrSetSize(, y) = {(r, s) 2 y \u21e5 ([m] \\ y) : r 62 , s 2 }. A natural choice of rep then\nis the subset indicator vector in {+1,1}d, where d = m = log `, which can be expressed as\nrep() = 1pmPj2 ej Pj62 ej (where ej\u2019s are the standard basis vectors in Rm). The learner\npredicts score t 2 Rm using a matrix W 2 Rm\u21e5p. Note that pred(t) = sign(t), where sign is\napplied component-wise. The number of predictions is 2m, but we show in Appendix C.2 that the\nsurrogate (2) and its gradient can be ef\ufb01ciently computed for all of the above losses.\n\n4 Subset Ranking\n\nIn subset ranking [13], the task is to learn to rank a number of documents in order of their relevance to\na query. We will assume, for simplicity, that the number of documents per query is constant that we\ndenote by m. The input space is a subset of Rm\u21e5p0 that we can identify with Rp for p = mp0. Each\nrow of an input matrix corresponds to a p0-dimensional feature vector derived jointly using the query\n2Perceptron algorithm in [10] is based on a slightly different loss de\ufb01ned as LErrSet(t, y) = 1 if |{r 6= y :\ntr ty}| > 0 or 0 otherwise (where t = W X). This loss upper bounds L0-1 (because of the way ties are\nhandled, there can be rounds when L0-1 is 0, but LErrSet is 1).\n\n5\n\n\freal valued function that is applied entry-wise to . The normalization Z =pPm\n\nand one of the documents associated with it. The predictions are all m! permutations of degree m.\nThe most natural (but by no means the only one) representation of permutations is to set rep() =\n/Z where (i) is the position of the document i in the predicted ranking and the normalization\nZ ensures that rep() is a unit vector. Note that the dimension d of this representation is equal to m.\nThe minus sign in this representation ensures that pred(t) outputs a permutation that corresponds to\nsorting the entries of t in decreasing order, a common convention in existing work. A more general\nrepresentation is obtained by setting rep() = f ()/Z where f : R ! R is a strictly decreasing\ni=1 f 2(i) ensures\nthat k rep()k2 = 1. To convert an input matrix X 2 Rp (p = mp0) into a score vector t 2 Rm,\nit seems that we need to learn a matrix W 2 Rm\u21e5mp0. However, a natural permutation invariance\nrequirement (if the documents associated are presented in a permuted fashion, the output scores\nshould also get permuted in the same way) reduces the dimensionality of W to p0 (see, e.g., [14] for\nmore details). Thus, given a vector w 2 Rp0 we get the score vector as t = Xw. The label space\nconsists of relevance score vectors y 2{ 0, 1, . . . , Ymax}m where Ymax is typically between 1 and\n4 (yielding 2 to 5 grades of relevance). Note that the prediction space (of size k = m!) is different\nfrom the label space (of size ` = (Ymax + 1)m).\nA variety of loss functions have been used in subset ranking. For multigraded relevance judgments,\n\nlog2(1+(i))/Z(y)\na very popular choice is NDCG which is de\ufb01ned as N DCG(, y) = Pm\nwhere Z(y) is a normalization constant ensuring NDCG stays bounded by 1. To convert it into a\nloss we de\ufb01ne LNDCG = 1 N DCG. Note that any permutation that sorts y in decreasing order\ngets zero LNDCG. One might worry that the computation of the surrogate de\ufb01ned in (2) and its\ngradient might require an enumeration of m! permutations. The next lemma allays such a concern.\nLemma 3. When L = LNDCG and rep() is chosen as above, the computation of the surrogate (2),\nas well as its gradient, can be reduced to solving a linear assignment problem and hence can be\ndone in O(m3) time.\n\n2y(i)1\n\ni=1\n\nWe now give a result explaining what it means for a score vector t to have a margin on y when we\nuse a representation of the form described above. Without loss of generality, we may assume that y\nis sorted in decreasing order of relevance judgements.\nLemma 4. Suppose rep() = f ()/Z for a strictly decreasing function f : R ! R and Z =\npPm\ni=1 f 2(i). Let y be a non-constant relevance judgement vector sorted in decreasing order.\nSuppose i1 < i2, . . . < iN , N 1 are the positions where the relevance drops by a grade or more\n(i.e., y(ij) < y(ij 1)). Then t has a margin on y iff t is compatible with y and, for j 2 [N ],\n\ntij1 tij +\n\nZ\n\nf (ij 1) f (ij)\nwhere we de\ufb01ne i0 = 1, iN +1 = m + 1 to handle boundary cases.\nNote that if we choose f (i) = i\u21b5,\u21b5 > 1 then f (ij 1) f (ij) = O(i\u21b51\n) for large ij. In\nthat case, the margin condition above requires less separation between documents with different\nrelevance scores down the list (when viewed in decreasing order of relevance scores) than at the top\nof the list. We end this section with a loss bound for LNDCG under a margin condition.\nCorollary 5. Suppose L = LNDCG and rep() is as in Lemma 4. Then, assuming the dataset is\nlinearly separable with margin , the sequence generated by Algorithm 1 with line 9 replaced by\n\nj\n\nsatis\ufb01es\n\nr\u2327 = X>\u2327 (pred1(\u02dc\u2327 ) pred1(y)) 2 Rp0\u21e51\nnX\u2327 =1\n\nLNDCG(X\u2327 w\u2327 , y\u2327 ) \uf8ff\n\n2Ymax+3 \u00b7 m2 log2\n\n2\n\n2(2m) \u00b7 R2\n\nwhere kX\u2327kop \uf8ff R.\nImagine a subset\nNote that the result above uses the standard `2-norm based notion of margin.\nranking problem, where only a small number of features are relevant.\nIt is therefore natural to\nconsider a notion of margin where the weight vector that ranks everything perfectly has low group `1\nnorm, instead of low `2 norm. The `1 margin also appears in the analysis of AdaBoost [2, De\ufb01nition\n\n6\n\n\f6.2]. We can use a special case of a more general algorithm given in the appendix (Appendix B.2,\nAlgorithm 3). Speci\ufb01cally, we replace line 10 with the step w\u2327 +1 = (r )1 (r (w\u2327 ) r\u2327 )\nwhere (w) = 1\nr. We set r = log(p0)/(log(p0) 1). The mapping r and its inverse can\nboth be easily computed (see, e.g., [6, p. 145]).\nCorollary 6. Suppose L = LNDCG and rep() is as in Lemma 4. Then, assuming the dataset is\nlinearly separable with margin by a unit `1 norm w? (kw?k1 = 1), the sequence generated by\nAlgorithm 3 with chosen as above (and line 9 modi\ufb01ed as in Corollary 5), satis\ufb01es\n\n2kwk2\n\nLNDCG(X\u2327 w\u2327 , y\u2327 ) \uf8ff\n\n9 \u00b7 2Ymax+3 \u00b7 m2 log2\n2(2m) \u00b7 R2 \u00b7 log p0\n2\n\nnX\u2327 =1\n\nwhere maxj=1,...,po kX\u2327,jk2 \uf8ff R and X\u2327,j denotes the jth column of X\u2327 .\n5 Multilabel Ranking\n\nAs discussed in Section 3, in multilabel classi\ufb01cation, both prediction space and label space are\n{0, 1}m with sizes k = ` = 2m. In multilabel ranking, however, the learner has to output rankings\nas predictions. So, as in the previous section, we have k = m! since the prediction can be any\none of m! permutations of the labels. As before, we choose rep() = f ()/Z and hence d = m.\nHowever, unlike the previous section, the input is no longer a matrix but a vector X 2 Rp. A\nprediction t 2 Rd is obtained as W X where W 2 Rm\u21e5p. Note the contrast with the last section:\nthere, inputs are matrices and a weight vector is learned; here, inputs are vectors and a weight matrix\nis learned. Since we output rankings, it is reasonable to use a loss that takes positions of labels into\naccount. We can use L = LNDCG. Algorithm 1 now immediately applies. Lemma 3 already showed\nthat is ef\ufb01ciently implementable. We have the following straightforward corollary.\nCorollary 7. Suppose L = LNDCG and rep() is as in Lemma 4. Then, assuming the dataset is\nlinearly separable with margin , the sequence generated by Algorithm 1 satis\ufb01es\n\nnX\u2327 =1\n\nLNDCG(X\u2327 w\u2327 , y\u2327 ) \uf8ff\n\n2Ymax+3 \u00b7 m2 log2\n\n2(2m) \u00b7 R2\n\n2\n\nwhere kX\u2327k2 \uf8ff R.\nThe bound above matches the corresponding bound, up to loss speci\ufb01c constants, for the multiclass\nmultilabel perceptron (MMP) algorithm studied by [15]. The de\ufb01nition of margin by [15] for MMP\nis different from ours since their algorithms are designed speci\ufb01cally for multilabel ranking. Just like\nthem, we can also consider other losses, e.g., precision at top K positions. Another perceptron style\nalgorithm for multilabel ranking adopts a pairwise approach of comparing two labels at a time [16].\nHowever, no loss bounds are derived.\nThe result above uses the standard Frobenius norm based margin. Imagine a multilabel problem,\nwhere only a small number of features are relevant across all labels. Then, it is natural to consider a\nnotion of margin where the matrix that ranks everything perfectly has low group (2, 1) norm, instead\nof low Frobenius norm, where kWk2,1 = Pp\nj=1 kWjk2 (Wj denotes a column of W ). We again\nuse a special case of Algorithm 3 (Appendix B.2). Speci\ufb01cally, we replace line 10 with the step\nW\u2327 +1 = (r )1 (r (W\u2327 ) r\u2327 ) where (W ) = 1\n2,r. Recall that the group (2, r)-norm is\nthe `r norm of the `2 norm of the columns of W . We set r = log(p)/(log(p) 1). The mapping\nr and its inverse can both be easily computed (see, e.g., [17, Eq. (2)]).\nCorollary 8. Suppose L = LNDCG and rep() is as in Lemma 4. Then, assuming the dataset is\nlinearly separable with margin by a unit group norm W ? (kW ?k2,1 = 1), the sequence generated\nby Algorithm 3 with chosen as above, satis\ufb01es\n\n2kWk2\n\nLNDCG(X\u2327 w\u2327 , y\u2327 ) \uf8ff\n\n9 \u00b7 2Ymax+3 \u00b7 m2 log2\n2(2m) \u00b7 R2 \u00b7 log p\n2\n\nnX\u2327 =1\nwhere kX\u2327k1 \uf8ff R.\n\n7\n\n\f)\n\nG\nC\nD\nN\n\u2212\n1\n(\n \ns\nt\n\ni\n\nn\no\nP\n\n \nt\ns\ne\nT\nn\no\n\n \n\n \ns\ns\no\nL\n\n0\n10\n\n\u22125\n\n10\n\n\u221210\n\n10\n\n \n\nSubset Ranking (m=20, p\n=30)\n0\n\n \n\npred\u22121(\u03c3(i))=1/i\npred\u22121(\u03c3(i))=\u2212i1.1\npred\u22121(\u03c3(i))=\u2212i2\n20\n\n40\n\n60\n\n(a)\n\nNo. of Training Points (n)\n\n80\n\n100\n\n)\n\nG\nC\nD\nN\n\u2212\n1\n(\n \ns\nt\n\ni\n\nn\no\nP\n\n \nt\ns\ne\nT\nn\no\n\n \n\n \ns\ns\no\nL\n\nSubset Ranking (n=30, p\n=30)\n0\n\n \n\npred\u22121(\u03c3(i))=1/i\npred\u22121(\u03c3(i))=\u2212i1.1\npred\u22121(\u03c3(i))=\u2212i2\n\n0\n10\n\n\u22125\n\n10\n\n\u221210\n\n10\n\n \n15\n\nNo. of documents in each instance (m)\n\n20\n\n25\n\n(b)\n\n)\n\nG\nC\nD\nN\n\u2212\n1\n(\n \ns\nt\n\ni\n\nn\no\nP\n\n \nt\ns\ne\nT\nn\no\n\n \n\n \ns\ns\no\nL\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n\nL\n vs L\n (s=50, n=50, m=20)\n2\n1\n\n \n\nPredtron\u2212L\n\nPredtron\u2212L\n\n2\n\n1\n\n500\n\n1000\n\n2000\nData dimensionality (p\n)\n0\n\n1500\n\n2500\n\n(c)\n\nFigure 1: Subset Ranking: NDCG loss for different pred1 choices with varying n (Plot (a)) and m\n(Plot (b)). As predicted by Lemmas 4 and 5, pred1(i) = i1.1 is more accurate than 1/i. (c):\nL1 vs L2 margin. LNDCG for two different Predtron algorithms based on L1 and L2 margin. Data\nis generated using L1 margin notion but with varying sparsity of the optimal scoring function w\u21e4.\n\n6 Experiments\n\nWe now present simulation results to demonstrate the application of our proposed Predtron frame-\nwork to subset ranking. We also demonstrate that empirical results match the trend predicted by\nour error bounds, hence hinting at tightness of our (upper) bounds. Due to lack of space, we focus\nonly on the subset ranking problem. Also, we would like to stress that we do not claim that the\nbasic version of Predtron itself (with \u2318 = 1) provides a state-of-the-art ranker. Instead, we wish to\ndemonstrate the applicability and \ufb02exibility of our framework in a controlled setting.\nWe generated n data points X\u2327 2 Rm\u21e5p0 using a Gaussian distribution with independent rows. The\nith row of X\u2327 represents a document and is sampled from a spherical Gaussian centered at \u00b5i. We\nselected a w\u21e4 2 Rp0 and also a set of thresholds [\u21e31, . . . ,\u21e3 m+1] to generate relevance scores; we\nset \u21e3j = 1\nj , 82 \uf8ff j \uf8ff m and \u21e31 = +1 and \u21e3m+1 = 1. We set relevance score y\u2327 (i) of the\nith document in the \u2327th document-set as: y\u2327 (i) = m j iff \u21e3j+1 \uf8ff hX\u2327 (i), w\u21e4i \uf8ff \u21e3j. That is,\ny\u2327 (i) 2 [m 1].\nWe measure performance of a given method using the NDCG loss LNDCG de\ufb01ned in Section 4.\nNote that LNDCG is less sensitive to errors in predictions for the less relevant documents in the list.\nOn the other hand, our selection of thresholds \u21e3i\u2019s implies that the gap between scores of lower-\nranked documents is very small compared to the higher-ranked ones, and hence chances of making\nmistakes lower down the list is higher.\nFigure 1 (a) shows LNDCG (on a test set) for our Predtron algorithm (see Section 4) but with different\npred1 functions. For pred1((i)) = f2() = i1.1, f2(i1)f2(i) is monotonically increasing\nwith i. On the other hand, for pred1((i)) = f1() = 1/i, f1(i 1) f1(i) is monotonically\ndecreasing with i. Lemma 4 shows that the mistake bound (in terms of LNDCG) of Predtron is better\nwhen pred1 function is selected to be f2((i)) = i1.1 (as well as for f3((i)) = i2) instead of\nf1((i)) = 1/i. Clearly, Figure 1 (a) empirically validates this mistake bound with LNDCG going\nto almost 0 for f2 and f3 with just 60 training points, while f1 based Predtron has large loss even\nwith n = 100 training points.\nNext, we \ufb01x the number of training instances to be n = 30 and vary the number of documents m.\nAs the gap between \u21e3i\u2019s decreases for larger i, increasing m implies reducing the margin. Naturally,\nPredtron with the above mentioned inverse functions has monotonically increasing loss (see Figure 1\n(b)). However, f2 and f3 provide zero-loss solutions for larger m when compared to f1.\nFinally, we conduct an experiment to show that by selecting appropriate notion of margin, Predtron\ncan obtain more accurate solutions. To this end, we generate data from [1, 1]p0 and select a sparse\nw\u21e4. Now, Predtron with `2-margin notion, i.e., standard gradient descent has pp0 dependency in\nthe error bounds while the `1-margin (see Corollary 6) has only s log(p0) dependence. This error\ndependency is also revealed by Figure 1 (c), where increasing p0 with \ufb01xed s leads to minor increase\nin the loss for `1-based Predtron but leads to signi\ufb01cantly higher loss for `2-based Predtron.\n\nAcknowledgments\n\nA. Tewari acknowledges the support of NSF under grant IIS-1319810.\n\n8\n\n\fReferences\n[1] Harish G. Ramaswamy and Shivani Agarwal. Classi\ufb01cation calibration dimension for general multiclass\n\nlosses. In Advances in Neural Information Processing Systems, pages 2078\u20132086, 2012.\n\n[2] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT\n\npress, 2012.\n\n[3] Michael Collins. Discriminative training methods for hidden Markov models: Theory and experiments\nwith perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural\nlanguage processing-Volume 10, pages 1\u20138, 2002.\n\n[4] Nathan D. Ratliff, J Andrew Bagnell, and Martin Zinkevich.\n\n(Approximate) subgradient methods for\nstructured prediction. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 380\u2013\n387, 2007.\n\n[5] Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, and Jean-Jeacques Slotine. Multiclass learning with\n\nsimplex coding. In Advances in Neural Information Processing Systems, pages 2789\u20132797, 2012.\n\n[6] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Ma-\n\nchine Learning, 4(2):107\u2013194, 2011.\n\n[7] Albert B.J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the\n\nMathematical Theory of Automata, volume 12, pages 615\u2013622, 1962.\n\n[8] Yaoyong Li, Hugo Zaragoza, Ralf Herbrich, John Shawe-Taylor, and Jaz S. Kandola. The perceptron\nalgorithm with uneven margins. In Proceedings of the Nineteenth International Conference on Machine\nLearning, pages 379\u2013386, 2002.\n\n[9] Gunnar Ratsch and Jyrki Kivinen. Extended classi\ufb01cation with modi\ufb01ed Perceptron, 2002. Presented\nat the NIPS 2002 Workshop: Beyond Classi\ufb01cation and Regression: Learning Rankings, Preferences,\nEquality Predicates, and Other Structures; abstract available at http://www.cs.cornell.edu/\npeople/tj/ranklearn/raetsch_kivinen.pdf.\n\n[10] Koby Crammer and Yoram Singer. Ultraconservative online algorithms for multiclass problems. The\n\nJournal of Machine Learning Research, 3:951\u2013991, 2003.\n\n[11] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector\n\nmachines. The Journal of Machine Learning Research, 2:265\u2013292, 2002.\n\n[12] Koby Crammer and Yoram Singer. Pranking with ranking. Advances in Neural Information Procession\n\nSystems, 14:641\u2013647, 2002.\n\n[13] David Cossock and Tong Zhang. Statistical analysis of bayes optimal subset ranking. IEEE Transactions\n\non Information Theory, 54(11):5140\u20135154, 2008.\n\n[14] Ambuj Tewari and Sougata Chaudhuri. Generalization error bounds for learning to rank: Does the length\nof document lists matter? In Proceedings of the 32nd International Conference on Machine Learning,\nvolume 37 of JMLR Workshop and Conference Proceedings, 2015.\n\n[15] Koby Crammer and Yoram Singer. A family of additive online algorithms for category ranking. The\n\nJournal of Machine Learning Research, 3:1025\u20131058, 2003.\n\n[16] Eneldo Loza Menc\u00b4\u0131a and Johannes Furnkranz. Pairwise learning of multilabel classi\ufb01cations with per-\n\nceptrons. In IEEE International Joint Conference on Neural Networks, pages 2899\u20132906, 2008.\n\n[17] Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Regularization techniques for learning with\n\nmatrices. Journal of Machine Learning Research, 13:1865\u20131890, 2012.\n\n9\n\n\f", "award": [], "sourceid": 635, "authors": [{"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}, {"given_name": "Nagarajan", "family_name": "Natarajan", "institution": "UT Austin"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}