{"title": "Adversarial Surrogate Losses for Ordinal Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 563, "page_last": 573, "abstract": "Ordinal regression seeks class label predictions when the penalty incurred for mistakes increases according to an ordering over the labels. The absolute error is a canonical example. Many existing methods for this task reduce to binary classification problems and employ surrogate losses, such as the hinge loss. We instead derive uniquely defined surrogate ordinal regression loss functions by seeking the predictor that is robust to the worst-case approximations of training data labels, subject to matching certain provided training data statistics. We demonstrate the advantages of our approach over other surrogate losses based on hinge loss approximations using UCI ordinal prediction tasks.", "full_text": "Adversarial Surrogate Losses for Ordinal Regression\n\nRizal Fathony\n\nMohammad Bashiri\n\nBrian D. Ziebart\n\nDepartment of Computer Science\nUniversity of Illinois at Chicago\n\nChicago, IL 60607\n\n{rfatho2, mbashi4, bziebart}@uic.edu\n\nAbstract\n\nOrdinal regression seeks class label predictions when the penalty incurred for\nmistakes increases according to an ordering over the labels. The absolute error\nis a canonical example. Many existing methods for this task reduce to binary\nclassi\ufb01cation problems and employ surrogate losses, such as the hinge loss. We\ninstead derive uniquely de\ufb01ned surrogate ordinal regression loss functions by\nseeking the predictor that is robust to the worst-case approximations of training\ndata labels, subject to matching certain provided training data statistics. We\ndemonstrate the advantages of our approach over other surrogate losses based on\nhinge loss approximations using UCI ordinal prediction tasks.\n\n1\n\nIntroduction\n\nFor many classi\ufb01cation tasks, the discrete class labels being predicted have an inherent order (e.g.,\npoor, fair, good, very good, and excellent labels). Confusing two classes that are distant from one\nanother (e.g., poor instead of excellent) is more detrimental than confusing two classes that are nearby.\nThe absolute error, |\u02c6y \u2212 y| between label prediction (\u02c6y \u2208 Y) and actual label (y \u2208 Y) is a canonical\nordinal regression loss function. The ordinal regression task seeks class label predictions for new\ndatapoints that minimize losses of this kind.\nMany prevalent methods reduce the ordinal regression task to subtasks solved using existing super-\nvised learning techniques. Some view the task from the regression perspective and learn both a linear\nregression function and a set of thresholds that de\ufb01ne class boundaries [1\u20135]. Other methods take a\nclassi\ufb01cation perspective and use tools from cost-sensitive classi\ufb01cation [6\u20138]. However, since the\nabsolute error of a predictor on training data is typically a non-convex (and non-continuous) function\nof the predictor\u2019s parameters for each of these formulations, surrogate losses that approximate the\nabsolute error must be optimized instead. Under both perspectives, surrogate losses for ordinal regres-\nsion are constructed by transforming the surrogate losses for binary zero-one loss problems\u2014such as\nthe hinge loss, the logistic loss, and the exponential loss\u2014to take into account the different penalties\nof the ordinal regression problem. Empirical evaluations have compared the appropriateness of\ndifferent surrogate losses, but these still leave the possibility of undiscovered surrogates that align\nbetter with the ordinal regression loss.\nTo address these limitations, we seek the most robust [9] ordinal regression predictions by focusing\non the following adversarial formulation of the ordinal regression task: what predictor best minimizes\nabsolute error in the worst case given partial knowledge of the conditional label distribution? We\nanswer this question by considering the Nash equilibrium for a game de\ufb01ned by combining the loss\nfunction with Lagrangian potential functions [10]. We derive a surrogate loss function for empirical\nrisk minimization that realizes this same adversarial predictor. We show that different types of\navailable knowledge about the conditional label distribution lead to thresholded regression-based\npredictions or classi\ufb01cation-based predictions. In both cases, the surrogate loss is novel compared to\nexisting surrogate losses. We also show that our surrogate losses enjoy Fisher consistency, a desirable\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ftheoretical property guaranteeing that minimizing the surrogate loss produces Bayes optimal decisions\nfor the original loss in the limit. We develop two different approaches for optimizing the loss: a\nstochastic optimization of the primal objective and a quadratic program formulation of the dual\nobjective. The second approach enables us to ef\ufb01ciently employ the kernel trick to provide a richer\nfeature representation without an overly burdensome time complexity. We demonstrate the bene\ufb01ts\nof our adversarial formulation over previous ordinal regression methods based on hinge loss for a\nrange of prediction tasks using UCI datasets.\n\n2 Background and Related Work\n\n2.1 Ordinal Regression Problems\n\n\uf8f9\uf8fa\uf8fb\n\n3\n2\n1\n0\n\n\uf8ee\uf8ef\uf8f00\n\n1\n2\n3\n\nTable 1: Ordinal re-\ngression loss matrix.\n\nX,Y \u223cP ; \u02c6Y |X\u223c \u02c6P [L \u02c6Y ,Y ] = (cid:80)\n\nOrdinal regression is a discrete label prediction problem characterized by\nan ordered penalty for making mistakes: loss(\u02c6y1, y) < loss(\u02c6y2, y) if y <\n\u02c6y1 < \u02c6y2 or y > \u02c6y1 > \u02c6y2. Though many loss functions possess this property,\nthe absolute error |\u02c6y \u2212 y| is the most widely studied. We similarly restrict\nour consideration to this loss function in this paper. The full loss matrix\nL for absolute error with four labels is shown in Table 1. The expected\nloss incurred using a probabilistic predictor \u02c6P (\u02c6y|x) evaluated on true data\nx,y,\u02c6y P (x, y) \u02c6P (\u02c6y|x)L\u02c6y,y. The supervised\ndistribution P (x, y) is: E\nlearning objective for this problem setting is to construct a probabilistic predictor \u02c6P (\u02c6y|x) in a way\nthat minimizes this expected loss using training samples distributed according to the empirical\ndistribution \u02dcP (x, y), which are drawn from the unknown true data generating distribution, P (x, y).\nA na\u00efve ordinal regression approach relaxes the task to a continuous prediction problem, minimizes\nthe least absolute deviation [11], and then rounds predictions to nearest integral label [12]. More\nsophisticated methods range from using a cumulative link model [13] that assumes the cumulative\nconditional probability P (Y \u2264 j|x) follows a link function, to Bayesian non-parametric approaches\n[14] and many others [15\u201322]. We narrow our focus over this broad range of methods found in the\nrelated work to those that can be viewed as empirical risk minimization methods with piece-wise\nconvex surrogates, which are more closely related to our approach.\n\n1\n0\n1\n2\n\n2\n1\n0\n1\n\n2.2 Threshold Methods for Ordinal Regression\n\nclassi\ufb01cation problems: lossAT( \u02c6f , y) =(cid:80)y\u22121\n\nThreshold methods are one popular family of techniques that treat the ordinal response variable,\n\u02c6f (cid:44) w \u00b7 x, as a continuous real-valued variable and introduce |Y| \u2212 1 thresholds \u03b81, \u03b82, ..., \u03b8|Y|\u22121\nthat partition the real line into |Y| segments: \u03b80 = \u2212\u221e < \u03b81 < \u03b82 < ... < \u03b8|Y|\u22121 < \u03b8|Y| = \u221e\n[4]. Each segment corresponds to a label with \u02c6yi assigned label j if \u03b8j\u22121 < \u02c6f \u2264 \u03b8j. There are two\ndifferent approaches for constructing surrogate losses based on the threshold methods to optimize the\nchoice of w and \u03b81, . . . , \u03b8|Y|\u22121: one is based on penalizing all thresholds involved when a mistake is\nmade and one is based on only penalizing the most immediate thresholds.\nAll thresholds methods penalize every erroneous threshold using a surrogate loss, \u03b4, for sets of binary\nk=y \u03b4(\u03b8k \u2212 \u02c6f ). Shashua and Levin\n[1] studied the hinge loss under the name of support vector machines with a sum-of margin strategy,\nwhile Chu and Keerthi [2] proposed a similar approach under the name of support vector ordinal\nregression with implicit constraints (SVORIM). Lin and Li [3] proposed ordinal regression boosting,\nan all thresholds method using the exponential loss as a surrogate. Finally, Rennie and Srebro [4]\nproposed a unifying approach for all threshold methods under a variety of surrogate losses.\nRather than penalizing all erroneous thresholds when an error is made, immediate thresholds methods\nonly penalize the threshold of the true label and the threshold immediately beneath the true label:\nlossIT( \u02c6f , y) = \u03b4(\u2212(\u03b8y\u22121\u2212 \u02c6f )) + \u03b4(\u03b8y \u2212 \u02c6f ).1 Similar to the all thresholds methods, immediate thresh-\nold methods have also been studied in the literature under different names. For hinge loss surrogates,\nShashua and Levin [1] called the model support vector with \ufb01xed-margin strategy while Chu and\nKeerthi [2] use the term support vector ordinal regression with explicit constraints (SVOREX). For\n\nk=1 \u03b4(\u2212(\u03b8k \u2212 \u02c6f )) +(cid:80)|Y|\n\n1For the boundary labels, the method de\ufb01nes \u03b4(\u2212(\u03b80 \u2212 \u02c6f )) = \u03b4(\u03b8y+1 \u2212 \u02c6f ) = 0.\n\n2\n\n\fthe exponential loss, Lin and Li [3] introduced ordinal regression boosting with left-right margins.\nRennie and Srebro [4] also proposed a unifying framework for immediate threshold methods.\n\n2.3 Reduction Framework from Ordinal Regression to Binary Classi\ufb01cation\n\nLi and Lin [5] proposed a reduction framework to convert ordinal regression problems to binary\nclassi\ufb01cation problems by extending training examples. For each training sample (x, y), the reduction\nframework creates |Y| \u2212 1 extended samples (x(j), y(j)) and assigns weight wy,j to each extended\nsample. The binary label associated with the extended sample is equivalent to the answer of the\nquestion: \u201cis the rank of x greater than j?\u201d The reduction framework allows a choice for how extended\nsamples x(j) are constructed from original samples x and how to perform binary classi\ufb01cation. If\nthe threshold method is used to construct the extended sample and SVM is used as the binary\nclassi\ufb01cation algorithm, the classi\ufb01er can be obtained by solving a family of quadratic optimization\nproblems that includes SVORIM and SVOREX as special instances.\n\n2.4 Cost-sensitive Classi\ufb01cation Methods for Ordinal Regression\n\nRather than using thresholding or the reduction framework, ordinal regression can also be cast as a\nspecial case of cost-sensitive multiclass classi\ufb01cation. Two of the most popular classi\ufb01cation-based\nordinal regression techniques are extensions of one-versus-one (OVO) and one-versus-all (OVA) cost-\nsensitive classi\ufb01cation [6, 7]. Both algorithms leverage a transformation that converts a cost-sensitive\nclassi\ufb01cation problem to a set of weighted binary classi\ufb01cation problems. Rather than reducing\nto binary classi\ufb01cation, Tu and Lin [8] reduce cost-sensitive classi\ufb01cation to one-sided regression\n(OSR), which can be viewed as an extension of the one-versus-all (OVA) technique.\n\n2.5 Adversarial Prediction\n\nFoundational results establish a duality between adversarial logarithmic loss minimization and\nconstrained maximization of the entropy [23]. This takes the form of a zero-sum game between\na predictor seeking to minimize expected logarithmic loss and an adversary seeking to maximize\nthis same loss. Additionally, the adversary is constrained to choose a distribution that matches\ncertain sample statistics. Ultimately, through the duality to maximum entropy, this is equivalent\nto maximum likelihood estimation of probability distributions that are members of the exponential\nfamily [23]. Gr\u00fcnwald and Dawid [9] emphasize this formulation as a justi\ufb01cation for the principle of\nmaximum entropy [24] and generalize the adversarial formulation to other loss functions. Extensions\nto multivariate performance measures [25] and non-IID settings [26] have demonstrated the versatility\nof this perspective.\nRecent analysis [27, 28] has shown that for the special case of zero-one loss classi\ufb01cation, this\nadversarial formulation is equivalent to empirical risk minimization with a surrogate loss function:\n\n(cid:88)\n\nj\u2208S\n\nAL0-1\n\nf (xi, yi) =\n\nmax\n\nS\u2286{1,...,|Y|},S(cid:54)=\u2205\n\n(\u03c8j,yi(xi) + |S| \u2212 1)/|S|,\n\n(1)\n\nwhere \u03c8j,yi(xi) is the potential difference \u03c8j,yi(xi) = fj(xi)\u2212 fyi(xi). This surrogate loss function\nprovides a key theoretical advantage compared to the Crammer-Singer hinge loss surrogate for\nmulticlass classi\ufb01cation [29]: it guarantees Fisher consistency [27] while Crammer-Singer\u2014despite\nits popularity in many applications, such as Structured SVM [30, 31]\u2014does not [32, 33]. We extend\nthis type of analysis to the ordinal regression setting with the absolute error as the loss function in\nthis paper, producing novel surrogate loss functions that provide better predictions than other convex,\npiece-wise linear surrogates.\n\n3 Adversarial Ordinal Regression\n\n3.1 Formulation as a zero-sum game\n\nWe seek the ordinal regression predictor that is the most robust to uncertainty given partial knowledge\nof the evaluating distribution\u2019s characteristics. This takes the form of a zero-sum game between a\npredictor player choosing a predicted label distribution \u02c6P (\u02c6y|x) that minimizes loss and an adversarial\n\n3\n\n\fplayer choosing an evaluation distribution \u02c7P (\u02c7y|x) that maximizes loss while closely matching the\nfeature-based statistics of the training data:\n\nmin\n\u02c6P (\u02c6y|x)\n\nmax\n\u02c7P (\u02c7y|x)\n\nE\nX\u223cP ; \u02c6Y |X\u223c \u02c6P ; \u02c7Y |X\u223c \u02c7P\n\nsuch that: E\n\nX\u223cP ; \u02c7Y |X\u223c \u02c7P [\u03c6(X, \u02c7Y )] = \u02dc\u03c6.\n\n(2)\n\n(cid:104)(cid:12)(cid:12)(cid:12) \u02c6Y \u2212 \u02c7Y\n\n(cid:12)(cid:12)(cid:12)(cid:105)\n\nX,Y \u223c \u02dcP [\u03c6(X, Y )], is measured from sample training data\n\nThe vector of feature moments, \u02dc\u03c6 = E\ndistributed according to the empirical distribution \u02dcP (x, y).\nAn ordinal regression problem can be viewed as a cost-sensitive loss with the entries of the cost\nmatrix de\ufb01ned by the absolute loss between the row and column labels (an example of the cost\nmatrix for the case of a problem with four labels is shown in Table 1). Following the construction of\nadversarial prediction games for cost-sensitive classi\ufb01cation [10], the optimization of Eq. (2) reduces\nto minimizing the equilibrium game values of a new set of zero-sum games characterized by matrix\nL(cid:48)\nxi,w:\n\n(cid:125)(cid:124)\n\n(cid:122)\n\n(cid:88)\n\ni\n\nmin\n\nw\n\n(cid:124)\n\nmax\n\u02c7pxi\n\nmin\n\u02c6pxi\n\n(cid:123)(cid:122)\n\nconvex optimization of w\n\nzero-sum game\nL(cid:48)\nxi,w \u02c7pxi\n\n\u02c6pT\nxi\n\n; L(cid:48)\n\nxi,w=\n\n(cid:123)\n(cid:125)\n\n\uf8ee\uf8ef\uf8ef\uf8f0\n\nf1 \u2212 fyi\nf1 \u2212 fyi + 1\n\n...\n\nf1 \u2212 fyi + |Y| \u2212 1\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\nf|Y| \u2212 fyi + |Y| \u2212 1\nf|Y| \u2212 fyi + |Y| \u2212 2\n\n...\n\nf|Y| \u2212 fyi\n\n\uf8f9\uf8fa\uf8fa\uf8fb , (3)\n\nwhere: w represents a vector of Lagrangian model parameters; fj = w \u00b7 \u03c6(xi, j) is a Lagrangian\npotential; \u02c6pxi is a vector representation of the conditional label distribution, \u02c6P ( \u02c6Y = j|xi), i.e.,\n\u02c6pxi = [ \u02c6P ( \u02c6Y = 1|xi) \u02c6P ( \u02c6Y = 2|xi) . . .]T; and \u02c7pxi is similarly de\ufb01ned. The matrix L(cid:48)\nxi,w =\n(|\u02c6y \u2212 \u02c7y| + f\u02c7y \u2212 fyi) is a zero-sum game matrix for each example. This optimization problem (Eq. (3))\nis convex in w and the inner zero-sum game can be solved using a linear program [10]. To address\n\ufb01nite sample estimation errors, the difference between expected and sample feature can be bounded\nin Eq. (2), ||E\nX\u223cP ; \u02c7Y |X\u223c \u02c7P [\u03c6(X, \u02c7Y )] \u2212 \u02dc\u03c6|| \u2264 \u0001, leading to Lagrangian parameter regularization in\nEq. (3) [34].\n\n3.2 Feature representations\n\nWe consider two feature representations corresponding to different training data summaries:\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\nyx\n\nI(y \u2264 1)\nI(y \u2264 2)\n\n...\n\nI(y \u2264 |Y| \u2212 1)\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 ; and\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\nI(y = 1)x\nI(y = 2)x\nI(y = 3)x\n\n...\n\nI(y = |Y|)x\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 .\n\n\u03c6th(x, y) =\n\n\u03c6mc(x, y) =\n\n(4)\n\nThe \ufb01rst, which we call the thresholded regression representation,\nhas size m +|Y|\u2212 1, where m is the dimension of our input space. It\ninduces a single shared vector of feature weights and a set of thresh-\nolds. If we denote the weight vector associated with the yx term as\nw and the terms associated with each sum of class indicator func-\ntions as \u03b81, \u03b82, . . ., \u03b8|Y|\u22121, then thresholds for switching between\nclass j and j + 1 (ignoring other classes) occur when w \u00b7 xi = \u03b8j.\nThe second feature representation, \u03c6mc, which we call the multi-\nclass representation, has size m|Y| and can be equivalently inter-\npreted as inducing a set of class-speci\ufb01c feature weights, fj = wj\u00b7xi.\nThis feature representation is useful when ordered labels cannot be\nthresholded according to any single direction in the input space, as\nshown in the example dataset of Figure 1.\n\n4\n\nFigure 1: Example where mul-\ntiple weight vectors are useful.\n\n\f3.3 Adversarial Loss from the Nash Equilibrium\n\nWe now present the main technical contribution of our paper: a surrogate loss function that, when\nminimized, produces a solution to the adversarial ordinal regression problem of Eq. (3).2\nTheorem 1. An adversarial ordinal regression predictor is obtained by choosing parameters w that\nminimize the empirical risk of the surrogate loss function:\n\nALord\n\nw (xi, yi) = max\n\nj,l\u2208{1,...,|Y|}\n\nfj + fl + j \u2212 l\n\n2\n\n\u2212 fyi = max\n\nj\n\nfj + j\n\n2\n\n+ max\n\nl\n\nfl \u2212 l\n2\n\n\u2212 fyi ,\n\n(5)\n\nwhere fj = w \u00b7 \u03c6(xi, j) for all j \u2208 {1, . . . ,|Y|}.\nProof sketch. Let j\u2217, l\u2217 be the solution of argmaxj,l\u2208{1,...,|Y|} fj +fl+j\u2212l\n, we show that the Nash\nequilibrium value of a game matrix that contains only row j\u2217 and l\u2217 and column j\u2217 and l\u2217 from\nmatrix L(cid:48)\nxi,w\nto the game matrix does not change the game value. Given the resulting closed form solution of the\ngame (instead of a minimax), we can recast the adversarial framework for ordinal regression as an\nempirical risk minimization with the proposed loss.\n\n. We then show that adding other rows and columns in L(cid:48)\n\nxi,w is exactly fj\u2217 ,+fl\u2217 +j\u2217\u2212l\u2217\n\n2\n\n2\n\nWe note that the ALord\nw surrogate is the maximization over pairs of different potential functions\nassociated with each class (including pairs of identical class labels) added to the distance between the\npair. For both of our feature representations, we make use of the fact that maximization over each\nelement of the pair can be independently realized, as shown on the right-hand side of Eq. (5).\n\nThresholded regression surrogate loss\nIn the thresholded regression feature representation, the parameter contains a single shared vector of\nfeature weights w and |Y|\u2212 1 terms \u03b8k associated with thresholds. Following Eq. (5), the adversarial\nordinal regression surrogate loss for this feature representation can be written as:\n\nj(w \u00b7 xi + 1) +(cid:80)\n\nALord-th(xi, yi) = max\n\nj\n\nk\u2265j \u03b8k\n\n+ max\n\nl\n\n2\n\nl(w \u00b7 xi \u2212 1) +(cid:80)\n\nk\u2265l \u03b8k\n\n2\n\n\u2212 yiw \u00b7 xi \u2212 (cid:88)\n\nk\u2265yi\n\n\u03b8k.\n\n(6)\n\nThis loss has a straight-forward interpreta-\ntion in terms of the thresholded regression\nperspective, as shown in Figure 2:\nit is\nbased on averaging the thresholded label\npredictions for potentials w \u00b7 xi + 1 and\nw \u00b7 xi \u2212 1. This penalization of the pair of\nthresholds differs from the thresholded sur-\nrogate losses of related work, which either\npenalize all violated thresholds or penalize\nonly the thresholds adjacent to the actual\nclass label.\nUsing a binary search procedure over\n\u03b81, . . . , \u03b8|Y|\u22121, the largest lower bounding\nthreshold for each of these potentials can\nbe obtained in O(log |Y|) time.\n\nMulticlass ordinal surrogate loss\n\nFigure 2: Surrogate loss calculation for datapoint xi\n(projected to w \u00b7 xi) with a label prediction of 4 for pre-\ndictive purposes, the surrogate loss is instead obtained\nusing potentials for the classes based on w\u00b7xi +1 (label\n5) and w \u00b7 xi \u2212 1 (label 2) averaged together.\n\nIn the multiclass feature representation, we have a set of speci\ufb01c feature weights wj for each label\nand the adversarial multiclass ordinal surrogate loss can be written as:\n\nALord-mc(xi, yi) =\n\nmax\n\nj,l\u2208{1,...,|Y|}\n\n2\n\nwj \u00b7 xi + wl \u00b7 xi + j \u2212 l\n\n\u2212 wyi \u00b7 xi.\n\n(7)\n\n2The detailed proof of this theorem and others are contained in the supplementary materials. Proof sketches\n\nare presented in the main paper.\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Loss function contour plots of ALord over the space of potential differences \u03c8j (cid:44) fj \u2212 fyi\nfor the prediction task with three classes when the true label is yi = 1 (a), yi = 2 (b), and yi = 3 (c).\n\nWe can also view this as the maximization over |Y|(|Y| + 1)/2 linear hyperplanes. For an ordinal\nregression problem with three classes, the loss has six facets with different shapes for each true label\nvalue, as shown in Figure 3. In contrast with ALord-th, the class label potentials for ALord-mc may\ndiffer from one another in more-or-less arbitrary ways. Thus, searching for the maximal j and l class\nlabels requires O(|Y|) time.\n\n3.4 Consistency Properties\n\nThe behavior of a prediction method in ideal learning settings\u2014i.e., trained on the true evaluation\ndistribution and given an arbitrarily rich feature representation, or, equivalently, considering the space\nof all measurable functions\u2014provides a useful theoretical validation. Fisher consistency requires that\nthe prediction model yields the Bayes optimal decision boundary [32, 33, 35] in this setting. Given\nthe true label conditional probability Pj(x) (cid:44) P (Y = j|x), a surrogate loss function \u03b4 is said to\nbe Fisher consistent with respect to the loss (cid:96) if the minimizer f\u2217 of the surrogate loss achieves the\nBayes optimal risk, i.e.,:\n\nEY |X\u223cP [\u03b4f (X, Y )|X = x]\nf\u2217 = argmin\n\u21d2 EY |X\u223cP [(cid:96)f\u2217 (X, Y )|X = x] = min\n\nf\n\nf\n\nEY |X\u223cP [(cid:96)f (X, Y )|X = x] .\n\n(8)\n\nRamaswamy and Agarwal [36] provide a necessary and suf\ufb01cient condition for a surrogate loss to be\nFisher consistent with respect to general multiclass losses, which includes ordinal regression losses.\nA recent analysis by Pedregosa et al. [35] shows that the all thresholds and the immediate thresholds\nmethods are Fisher consistent provided that the base binary surrogates losses they use are convex\nwith a negative derivative at zero.\nFor our proposed approach, the condition for Fisher consistency above is equivalent to:\n\n\u2217\n\nf\n\n= argmin\n\nf\n\ny\n\nPy\n\nmax\n\nj,l\n\nfj + fl + j \u2212 l\n\n2\n\n\u2212 fy\n\n\u21d2 argmax\n\nf\n\nj\n\nj (x) \u2286 argmin\n\u2217\n\nj\n\nPy |j \u2212 y| .\n\n(9)\n\n(cid:88)\n\n(cid:20)\n\n(cid:21)\n\n(cid:88)\n\ny\n\nSince adding a constant to all fj does not change the value of both ALord\nand argmaxj fj(x), we\nemploy the constraint maxj fj(x) = 0, to remove redundant solutions for the consistency analysis.\nWe establish an important property of the minimizer for ALord\nf\nTheorem 2. The minimizer vector f\u2217 of EY |X\u223cP\nproperty, i.e., it complements the absolute error by starting with a negative integer value, then\nincreasing by one until reaching zero, and then incrementally decreases again.\n\n(cid:2)ALord\nf (X, Y )|X = x(cid:3) satis\ufb01es the loss re\ufb02ective\n\nin the following theorem.\n\nf\n\nProof sketch. We show that for any f 0 that does not satisfy the loss re\ufb02ective property, we can\nconstruct f 1 using several steps that satisfy the loss re\ufb02ective property and has the expected loss value\nless than the expected loss of f 0.\n\n6\n\n\fExample vectors f\u2217 that satisfy Theorem 2 are [0,\u22121,\u22122]T, [\u22121, 0,\u22121]T and [\u22122,\u22121, 0]T for\nthree-class problems, and [\u22123,\u22122,\u22121, 0,\u22121] for \ufb01ve-class problems. Using the key property of the\nminimizer above, we establish the consistency of our loss functions in the following theorem.\nTheorem 3. The adversarial ordinal regression surrogate loss ALord from Eq. (5) is Fisher consistent.\nProof sketch. We only consider |Y| possible values of f that satisfy the loss re\ufb02ective property. For\nthe f that corresponds to class j, the value of the expected loss is equal to the Bayes loss if we predict\nj as the label. Therefore minimizing over f that satisfy the loss re\ufb02ective property is equivalent to\n\ufb01nding the Bayes optimal response.\n\n3.5 Optimization\n\n3.5.1 Primal Optimization\n\nTo optimize the regularized adversarial ordinal regression loss from the primal, we employ stochastic\naverage gradient (SAG) methods [37, 38], which have been shown to converge faster than standard\nstochastic gradient optimization. The idea of SAG is to use the gradient of each example from the last\niteration where it was selected to take a step. However, the na\u00efve implementation of SAG requires\nstoring the gradient of each sample, which may be expensive in terms of the memory requirements.\nFortunately, for our loss ALord\nw , we can drastically reduce this memory requirement by just storing a\npair of number, (j\u2217, l\u2217) = argmaxj,l\u2208{1,...,|Y|} fj +fl+j\u2212l\n, rather than storing the gradient for each\nsample. Appendix C explains the details of this technique.\n\n2\n\n3.5.2 Dual Optimization\nDual optimization is often preferred when optimizing piecewise linear losses, such as the hinge loss,\nsince it enables one to easily perform the kernel trick and obtain a non-linear decision boundary\nwithout heavily sacri\ufb01cing computational ef\ufb01ciency. Optimizing the regularized adversarial ordinal\nregression loss in the dual can be performed by solving the following quadratic optimization:\n\n(\u03b1i,j + \u03b2i,j) (\u03b1k,l + \u03b2k,l) (\u03c6(xi, j) \u2212 \u03c6(xi, yi)) \u00b7 (\u03c6(xk, l) \u2212 \u03c6(xl, yk))\n\n(cid:88)\n\nmax\n\u03b1,\u03b2\n\nj(\u03b1i,j \u2212 \u03b2i,j) \u2212 1\n2\nsubject to: \u03b1i,j \u2265 0; \u03b2i,j \u2265 0;\n\ni,j\n\n(cid:88)\n(cid:88)\n\ni,j,k,l\n\n(cid:88)\n\n\u03b1i,j = C\n2 ;\n\n\u03b2i,j = C\n\n2 ; i, k \u2208 {1, . . . , n}; j, l \u2208 {1, . . . ,|Y|}.\n\n(10)\n\nj\n\nj\n\nNote that our dual formulation only depends on the dot product of the features. Therefore, we can\nalso easily apply the kernel trick to our algorithm. Appendix D describes the derivation from the\nprimal optimization to the dual optimization above.\n\n4 Experiments\n\n4.1 Experiment Setup\n\nTable 2: Dataset properties.\n\nWe conduct our experiments on a benchmark\ndataset for ordinal regression [14], evaluate the\nperformance using mean absolute error (MAE),\nand perform statistical tests on the results of dif-\nferent hinge loss surrogate methods. The bench-\nmark contains datasets taken from the UCI Ma-\nchine Learning repository [39] ranging from rel-\natively small to relatively large in size. The char-\nacteristics of the datasets, including the number\nof classes, the training set size, the testing set\nsize, and the number of features, are described\nin Table 2.\nIn the experiment, we consider different methods using the original feature space and a kernelized\nfeature space using the Gaussian radial basis function kernel. The methods that we compare include\ntwo variations of our approach, the threshold based (ALord-th), and the multiclass-based (ALord-mc).\n\n#test #features\n2\n27\n60\n32\n6\n7\n13\n9\n10\n8\n21\n8\n\nDataset\ndiabetes\npyrimidines\ntriazines\nwisconsin\nmachinecpu\nautompg\nboston\nstocks\nabalone\nbank\ncomputer\ncalhousing\n\n#train\n30\n51\n130\n135\n146\n274\n354\n665\n2923\n5734\n5734\n14447\n\n#class\n5\n5\n5\n5\n10\n10\n5\n5\n10\n10\n10\n10\n\n13\n23\n56\n59\n63\n118\n152\n285\n1254\n2458\n2458\n6193\n\n7\n\n\fTable 3: The average of the mean absolute error (MAE) for each model. Bold numbers in each case\nindicate that the result is the best or not signi\ufb01cantly worse than the best (paired t-test with \u03b1 = 0.05).\n\nDataset\n\ndiabetes\npyrimidines\ntriazines\nwisconsin\nmachinecpu\nautompg\nboston\nstocks\nabalone\nbank\ncomputer\ncalhousing\naverage\n# bold\n\nThreshold-based models\n\nMulticlass-based models\n\nALord-th\n0.696\n0.654\n0.607\n1.077\n0.449\n0.551\n0.316\n0.324\n0.551\n0.461\n0.640\n1.190\n0.626\n5\n\nREDth\n0.715\n0.678\n0.683\n1.067\n0.456\n0.550\n0.304\n0.317\n0.547\n0.460\n0.635\n1.183\n0.633\n5\n\nAT\n0.731\n0.615\n0.649\n1.097\n0.458\n0.550\n0.306\n0.315\n0.546\n0.461\n0.633\n1.182\n0.629\n4\n\nIT\n\n0.827\n0.626\n0.654\n1.175\n0.467\n0.617\n0.298\n0.324\n0.571\n0.461\n0.683\n1.225\n0.661\n2\n\nALord-mc\n0.629\n0.509\n0.670\n1.136\n0.518\n0.599\n0.311\n0.168\n0.521\n0.445\n0.625\n1.164\n0.613\n5\n\nREDmc\n0.700\n0.565\n0.673\n1.141\n0.515\n0.602\n0.311\n0.175\n0.520\n0.446\n0.624\n1.144\n0.618\n5\n\nCSOSR CSOVO CSOVA\n\n0.715\n0.520\n0.677\n1.208\n0.646\n0.741\n0.353\n0.204\n0.545\n0.732\n0.889\n1.237\n0.706\n2\n\n0.738\n0.576\n0.738\n1.275\n0.602\n0.598\n0.294\n0.147\n0.558\n0.448\n0.649\n1.202\n0.652\n2\n\n0.762\n0.526\n0.732\n1.338\n0.702\n0.731\n0.363\n0.213\n0.556\n0.989\n1.055\n1.601\n0.797\n1\n\nThe baselines we use for the threshold-based models include a SVM-based reduction framework\nalgorithm (REDth) [5], an all threshold method with hinge loss (AT) [1, 2], and an immediate threshold\nmethod with hinge loss (IT) [1, 2]. For the multiclass-based models, we compare our method with an\nSVM-based reduction algorithm using multiclass features (REDmc) [5], with cost-sensitive one-sided\nsupport vector regression (CSOSR) [8], with cost-sensitive one-versus-one SVM (CSOVO) [7], and\nwith cost-sensitive one-versus-all SVM (CSOVA) [6]. For our Gaussian kernel experiment, we\ncompare our threshold based model (ALord-th) with SVORIM and SVOREX [2].\nIn our experiments, we \ufb01rst make 20 random splits of each dataset into training and testing sets. We\nperformed two stages of \ufb01ve-fold cross validation on the \ufb01rst split training set for tuning each model\u2019s\nregularization constant \u03bb. In the \ufb01rst stage, the possible values for \u03bb are 2\u2212i, i = {1, 3, 5, 7, 9, 11, 13}.\nUsing the best \u03bb in the \ufb01rst stage, we set the possible values for \u03bb in the second stage as 2 i\n2 \u03bb0, i =\n{\u22123,\u22122,\u22121, 0, 1, 2, 3}, where \u03bb0 is the best parameter obtained in the \ufb01rst stage. Using the selected\nparameter from the second stage, we train each model on the 20 training sets and evaluate the MAE\nperformance on the corresponding testing set. We then perform a statistical test to \ufb01nd whether the\nperformance of a model is different with statistical signi\ufb01cance from other models. We perform the\nGaussian kernel experiment similarly with model parameter C equals to 2i, i = {0, 3, 6, 9, 12} and\nkernel parameter \u03b3 equals to 2i, i = {\u221212,\u22129,\u22126,\u22123, 0} in the \ufb01rst stage. In the second stage, we\nset C equals to 2iC0, i = {\u22122,\u22121, 0, 1, 2} and \u03b3 equals to 2i\u03b30, i = {\u22122,\u22121, 0, 1, 2}, where C0\nand \u03b30 are the best parameters obtained in the \ufb01rst stage.\n\n4.2 Results\n\nWe report the mean absolute error (MAE) averaged over the dataset splits as shown in Table 3 and\nTable 4. We highlight the result that is either the best or not worse than the best with statistical\nsigni\ufb01cance (under paired t-test with \u03b1 = 0.05) in boldface font. We also provide the summary for\neach model in terms of the averaged MAE over all datasets and the number of datasets for which\neach model marked with boldface font in the bottom of the table.\nAs we can see from Table 3, in the experiment with the original feature space, threshold-based\nmodels perform well on relatively small datasets, whereas multiclass-based models perform well on\nrelatively large datasets. A possible explanation for this result is that multiclass-based models have\nmore \ufb02exibility in creating decision boundaries, hence they perform better if the training data size is\nsuf\ufb01cient. However, since multiclass-based models have many more parameters than threshold-based\nmodels (m|Y| parameters rather than m + |Y| \u2212 1 parameters), multiclass methods may need more\ndata, and hence, may not perform well on relatively small datasets.\nIn the threshold-based models comparison, ALord-th, REDth, and AT perform competitively on\nrelatively small datasets like triazines, wisconsin, machinecpu, and autompg. ALord-th has a\n\n8\n\n\fTable 4: The average of MAE for models with\nGaussian kernel.\n\nslight advantage over REDth on the overall accuracy, and a slight advantage over AT on the number\nof \u201cindistinguishably best\u201d performance on all datasets. We can also see that AT is superior to IT in\nthe experiments under the original feature space.\nAmong the multiclass-based models, ALord-mc\nand REDmc perform competitively on datasets\nlike abalone, bank, and computer, with a\nslight advantage of ALord-mc model on the over-\nall accuracy. In general, the cost-sensitive mod-\nels perform poorly compared with ALord-mc and\nREDmc. A notable exception is the CSOVO\nmodel which perform very well on the stocks\nand the boston datasets.\nIn the Gaussian kernel experiment, we can see\nfrom Table 4 that the kernelized version of\nALord-th performs signi\ufb01cantly better than the\nthreshold-based models SVORIM and SVOREX\nin terms of both the overall accuracy and the\nnumber of \u201cindistinguishably best\u201d performance\non all datasets. We also note that immediate-threshold-based model (SVOREX) performs better than\nall-threshold-based model (SVORIM) in our experiment using Gaussian kernel. We can conclude\nthat our proposed adversarial losses for ordinal regression perform competitively compared to the\nstate-of-the-art ordinal regression models using both original feature spaces and kernel feature spaces\nwith a signi\ufb01cant performance improvement in the Gaussian kernel experiments.\n\nSVORIM SVOREX\n0688\n0.550\n0.604\n1.049\n0.628\n0.593\n0.316\n0.100\n0.566\n4\n\nDataset\ndiabetes\npyrimidines\ntriazines\nwisconsin\nmachinecpu\nautompg\nboston\nstocks\naverage\n# bold\n\nALord-th\n0.696\n0.478\n0.609\n1.090\n0.452\n0.529\n0.278\n0.103\n0.531\n7\n\n0.665\n0.539\n0.612\n1.113\n0.652\n0.589\n0.324\n0.099\n0.574\n3\n\n5 Conclusion and Future Work\n\nIn this paper, we have proposed a novel surrogate loss for ordinal regression, a classi\ufb01cation problem\nwhere the discrete class labels have an inherent order and penalty for making mistakes based on that\norder. We focused on the absolute loss, which is the most widely used ordinal regression loss. In\ncontrast with existing methods, which typically reduce ordinal regression to binary classi\ufb01cation\nproblems and then employ surrogates for the binary zero-one loss, we derive a unique surrogate\nordinal regression loss by seeking the predictor that is robust to a worst case constrained approx-\nimation of the training data. We derived two versions of the loss based on two different feature\nrepresentation approaches: thresholded regression and multiclass representations. We demonstrated\nthe bene\ufb01t of our approach on a benchmark of datasets for ordinal regression tasks. Our approach\nperforms competitively compared to the state-of-the-art surrogate losses based on hinge loss. We\nalso demonstrated cases when the multiclass feature representations works better than thresholded\nregression representation, and vice-versa, in our experiments.\nOur future work will investigate less prevalent ordinal regression losses, such as the discrete quadratic\nloss and arbitrary losses that have v-shaped penalties. Furthermore, we plan to investigate the\ncharacteristics required of discrete ordinal losses for their optimization to have a compact analytical\nsolution. In terms of applications, one possible direction of future work is to combine our approach\nwith deep neural network models to perform end-to-end representation learning for ordinal regression\napplications like age estimation and rating prediction. In that setting, our proposed loss can be used\nin the last layer of a deep neural network to serve as the gradient source for the backpropagation\nalgorithm.\n\nAcknowledgments\n\nThis research was supported as part of the Future of Life Institute (futureo\ufb02ife.org) FLI-RFP-AI1\nprogram, grant#2016-158710 and by NSF grant RI-#1526379.\n\n9\n\n\fReferences\n\n[1] Amnon Shashua and Anat Levin. Ranking with large margin principle: Two approaches. In\n\nAdvances in Neural Information Processing Systems 15, pages 961\u2013968. MIT Press, 2003.\n\n[2] Wei Chu and S Sathiya Keerthi. New approaches to support vector ordinal regression. In\nProceedings of the 22nd international conference on Machine learning, pages 145\u2013152. ACM,\n2005.\n\n[3] Hsuan-Tien Lin and Ling Li. Large-margin thresholded ensembles for ordinal regression:\nTheory and practice. In International Conference on Algorithmic Learning Theory, pages\n319\u2013333. Springer, 2006.\n\n[4] Jason D. M. Rennie and Nathan Srebro. Loss functions for preference levels: Regression with\ndiscrete ordered labels. In Proceedings of the IJCAI Multidisciplinary Workshop on Advances\nin Preference Handling, pages 180\u2013186, 2005.\n\n[5] Ling Li and Hsuan-Tien Lin. Ordinal regression by extended binary classi\ufb01cation. Advances in\n\nneural information processing systems, 19:865, 2007.\n\n[6] Hsuan-Tien Lin. From ordinal ranking to binary classi\ufb01cation. PhD thesis, California Institute\n\nof Technology, 2008.\n\n[7] Hsuan-Tien Lin. Reduction from cost-sensitive multiclass classi\ufb01cation to one-versus-one\nbinary classi\ufb01cation. In Proceedings of the Sixth Asian Conference on Machine Learning, pages\n371\u2013386, 2014.\n\n[8] Han-Hsing Tu and Hsuan-Tien Lin. One-sided support vector regression for multiclass cost-\nIn Proceedings of the 27th International Conference on Machine\n\nsensitive classi\ufb01cation.\nLearning (ICML-10), pages 1095\u20131102, 2010.\n\n[9] Peter D. Gr\u00fcnwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrep-\n\nancy, and robust Bayesian decision theory. Annals of Statistics, 32:1367\u20131433, 2004.\n\n[10] Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classi\ufb01-\n\ncation. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2015.\n\n[11] Subhash C Narula and John F Wellington. The minimum sum of absolute errors regression:\nA state of the art survey. International Statistical Review/Revue Internationale de Statistique,\npages 317\u2013326, 1982.\n\n[12] Koby Crammer and Yoram Singer. Pranking with ranking. In Advances in Neural Information\n\nProcessing Systems 14, 2001.\n\n[13] Peter McCullagh. Regression models for ordinal data. Journal of the royal statistical society.\n\nSeries B (Methodological), pages 109\u2013142, 1980.\n\n[14] Wei Chu and Zoubin Ghahramani. Gaussian processes for ordinal regression. Journal of\n\nMachine Learning Research, 6(Jul):1019\u20131041, 2005.\n\n[15] Krzysztof Dembczy\u00b4nski, Wojciech Kot\u0142owski, and Roman S\u0142owi\u00b4nski. Ordinal classi\ufb01cation\nwith decision rules. In International Workshop on Mining Complex Data, pages 169\u2013181.\nSpringer, 2007.\n\n[16] Mark J Mathieson. Ordinal models for neural networks. Neural networks in \ufb01nancial engineer-\n\ning, pages 523\u2013536, 1996.\n\n[17] Shipeng Yu, Kai Yu, Volker Tresp, and Hans-Peter Kriegel. Collaborative ordinal regression.\nIn Proceedings of the 23rd international conference on Machine learning, pages 1089\u20131096.\nACM, 2006.\n\n[18] Jianlin Cheng, Zheng Wang, and Gianluca Pollastri. A neural network approach to ordinal\nregression. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational\nIntelligence). IEEE International Joint Conference on, pages 1279\u20131284. IEEE, 2008.\n\n[19] Wan-Yu Deng, Qing-Hua Zheng, Shiguo Lian, Lin Chen, and Xin Wang. Ordinal extreme\n\nlearning machine. Neurocomputing, 74(1):447\u2013456, 2010.\n\n10\n\n\f[20] Bing-Yu Sun, Jiuyong Li, Desheng Dash Wu, Xiao-Ming Zhang, and Wen-Bo Li. Kernel\nIEEE Transactions on Knowledge and Data\n\ndiscriminant learning for ordinal regression.\nEngineering, 22(6):906\u2013910, 2010.\n\n[21] Jaime S Cardoso and Joaquim F Costa. Learning to classify ordinal data: The data replication\n\nmethod. Journal of Machine Learning Research, 8(Jul):1393\u20131429, 2007.\n\n[22] Yang Liu, Yan Liu, and Keith CC Chan. Ordinal regression via manifold learning. In Pro-\nceedings of the Twenty-Fifth AAAI Conference on Arti\ufb01cial Intelligence, pages 398\u2013403. AAAI\nPress, 2011.\n\n[23] Flemming Tops\u00f8e. Information theoretical optimization techniques. Kybernetika, 15(1):8\u201327,\n\n1979.\n\n[24] Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620\u2013630,\n\n1957.\n\n[25] Hong Wang, Wei Xing, Kaiser Asif, and Brian Ziebart. Adversarial prediction games for\nmultivariate losses. In Advances in Neural Information Processing Systems, pages 2710\u20132718,\n2015.\n\n[26] Anqi Liu and Brian Ziebart. Robust classi\ufb01cation under sample selection bias. In Advances in\n\nNeural Information Processing Systems, pages 37\u201345, 2014.\n\n[27] Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian Ziebart. Adversarial multiclass classi\ufb01cation:\nA risk minimization perspective. In Advances in Neural Information Processing Systems 29,\npages 559\u2013567. 2016.\n\n[28] Farzan Farnia and David Tse. A minimax approach to supervised learning. In Advances in\n\nNeural Information Processing Systems, pages 4233\u20134241. 2016.\n\n[29] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-\n\nbased vector machines. The Journal of Machine Learning Research, 2:265\u2013292, 2002.\n\n[30] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large\nmargin methods for structured and interdependent output variables. In JMLR, pages 1453\u20131484,\n2005.\n\n[31] Thorsten Joachims. A support vector method for multivariate performance measures.\nProceedings of the International Conference on Machine Learning, pages 377\u2013384, 2005.\n\nIn\n\n[32] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classi\ufb01cation methods.\n\nThe Journal of Machine Learning Research, 8:1007\u20131025, 2007.\n\n[33] Yufeng Liu. Fisher consistency of multicategory support vector machines. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 291\u2013298, 2007.\n\n[34] Miroslav Dud\u00edk and Robert E Schapire. Maximum entropy distribution estimation with gener-\nalized regularization. In International Conference on Computational Learning Theory, pages\n123\u2013138. Springer, 2006.\n\n[35] Fabian Pedregosa, Francis Bach, and Alexandre Gramfort. On the consistency of ordinal\n\nregression methods. Journal of Machine Learning Research, 18(55):1\u201335, 2017.\n\n[36] Harish G Ramaswamy and Shivani Agarwal. Classi\ufb01cation calibration dimension for general\nmulticlass losses. In Advances in Neural Information Processing Systems, pages 2078\u20132086,\n2012.\n\n[37] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing \ufb01nite sums with the stochastic\n\naverage gradient. Mathematical Programming, pages 1\u201330, 2013.\n\n[38] Mark Schmidt, Reza Babanezhad, Aaron Defazio, Ann Clifton, and Anoop Sarkar. Non-uniform\n\nstochastic average gradient method for training conditional random \ufb01elds. 2015.\n\n[39] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/\n\nml.\n\n11\n\n\f", "award": [], "sourceid": 389, "authors": [{"given_name": "Rizal", "family_name": "Fathony", "institution": "University of Illinois at Chicago"}, {"given_name": "Mohammad Ali", "family_name": "Bashiri", "institution": "University of Illinois at Chicago"}, {"given_name": "Brian", "family_name": "Ziebart", "institution": "University of Illinois at Chicago"}]}