{"title": "Optimizing F-Measures by Cost-Sensitive Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 2123, "page_last": 2131, "abstract": "We present a theoretical analysis of F-measures for binary, multiclass and multilabel classification. These performance measures are non-linear, but in many scenarios they are pseudo-linear functions of the per-class false negative/false positive rate. Based on this observation, we present a general reduction of F-measure maximization to cost-sensitive classification with unknown costs. We then propose an algorithm with provable guarantees to obtain an approximately optimal classifier for the F-measure by solving a series of cost-sensitive classification problems. The strength of our analysis is to be valid on any dataset and any class of classifiers, extending the existing theoretical results on F-measures, which are asymptotic in nature. We present numerical experiments to illustrate the relative importance of cost asymmetry and thresholding when learning linear classifiers on various F-measure optimization tasks.", "full_text": "Optimizing F-Measures by Cost-Sensitive Classi\ufb01cation\n\nShameem A. Puthiya Parambath, Nicolas Usunier, Yves Grandvalet\nUniversit\u00b4e de Technologie de Compi`egne \u2013 CNRS, Heudiasyc UMR 7253\n\n{sputhiya,nusunier,grandval}@utc.fr\n\nCompi`egne, France\n\nAbstract\n\nWe present a theoretical analysis of F -measures for binary, multiclass and mul-\ntilabel classi\ufb01cation. These performance measures are non-linear, but in many\nscenarios they are pseudo-linear functions of the per-class false negative/false\npositive rate. Based on this observation, we present a general reduction of F -\nmeasure maximization to cost-sensitive classi\ufb01cation with unknown costs. We\nthen propose an algorithm with provable guarantees to obtain an approximately\noptimal classi\ufb01er for the F -measure by solving a series of cost-sensitive classi-\n\ufb01cation problems. The strength of our analysis is to be valid on any dataset and\nany class of classi\ufb01ers, extending the existing theoretical results on F -measures,\nwhich are asymptotic in nature. We present numerical experiments to illustrate\nthe relative importance of cost asymmetry and thresholding when learning linear\nclassi\ufb01ers on various F -measure optimization tasks.\n\n1\n\nIntroduction\n\nThe F1-measure, de\ufb01ned as the harmonic mean of the precision and recall of a binary decision\nrule [20], is a traditional way of assessing the performance of classi\ufb01ers. As it favors high and bal-\nanced values of precision and recall, this performance metric is usually preferred to (label-dependent\nweighted) classi\ufb01cation accuracy when classes are highly imbalanced and when the cost of a false\npositive relatively to a false negative is not naturally given for the problem at hand. The design of\nmethods to optimize F1-measure and its variants for multilabel classi\ufb01cation (the micro-, macro-,\nper-instance-F1-measures, see [23] and Section 2), and the theoretical analysis of the optimal clas-\nsi\ufb01ers for such metrics have received considerable interest in the last 3-4 years [6, 15, 4, 18, 5, 13],\nespecially because rare classes appear naturally on most multilabel datasets with many labels.\nThe most usual way of optimizing F1-measure is to perform a two-step approach in which \ufb01rst a\nclassi\ufb01er which output scores (e.g. a margin-based classi\ufb01er) is learnt, and then the decision thresh-\nold is tuned a posteriori. Such an approach is theoretically grounded in binary classi\ufb01cation [15] and\nfor micro- or macro-F1-measures of multilabel classi\ufb01cation [13] in that a Bayes-optimal classi\ufb01er\nfor the corresponding F1-measure can be obtained by thresholding posterior probabilities of classes\n(the threshold, however, depends on properties of the whole distribution and cannot be known in ad-\nvance). Thus, such arguments are essentially asymptotic since the validity of the procedure is bound\nto the ability to accurately estimate all the level sets of the posterior probabilities; in particular, the\nproof does not hold if one wants to \ufb01nd the optimal classi\ufb01er for the F1-measure over an arbitrary\nset of classi\ufb01ers (e.g. thresholded linear functions).\nIn this paper, we show that optimizing the F1-measure in binary classi\ufb01cation over any (possibly\nrestricted) class of functions and over any data distribution (population-level or on a \ufb01nite sample)\ncan be reduced to solving an (in\ufb01nite) series of cost-sensitive classi\ufb01cation problems, but the cost\nspace can be discretized to obtain approximately optimal solutions. For binary classi\ufb01cation, as\nwell as for multilabel classi\ufb01cation (micro-F1-measure in general and the macro-F1-measure when\ntraining independent classi\ufb01ers per class), the discretization can be made along a single real-valued\n\n1\n\n\fvariable in [0, 1] with approximation guarantees. Asymptotically, our result is, in essence, equivalent\nto prior results since Bayes-optimal classi\ufb01ers for cost-sensitive classi\ufb01cation are precisely given by\nthresholding the posterior probabilities, and we recover the relationship between the optimal F1-\nmeasure and the optimal threshold given by Lipton et al.\n[13]. Our reduction to cost-sensitive\nclassi\ufb01cation, however, is strictly more general. Our analysis is based on the pseudo-linearity of\nthe F1-scores (the level sets, as function of the false negative rate and the false positive rate are\nlinear) and holds in any asymptotic or non-asymptotic regime, with any arbitrary set of classi\ufb01ers\n(without the requirement to output scores or accurate posterior probability estimates). Our formal\nframework and the de\ufb01nition of pseudo-linearity is presented in the next section, and the reduction\nto cost-sensitive classi\ufb01cation is presented in Section 2.\nWhile our main contribution is the theoretical part, we also turn out to the practical suggestions of our\nresults. In particular, they suggest that, for binary classi\ufb01cation, learning cost-sensitive classi\ufb01ers\nmay be more effective than thresholding probabilities. This is in-line with Musicant et al. [14],\nalthough their argument only applies to SVM and does not consider the F1-measure itself but a\ncontinuous, non-convex approximation of it. Some experimental results are presented in Section 4,\nbefore the conclusion of the paper.\n\n2 Pseudo-Linearity and F -Measures\n\nOur results are mainly motivated by the maximization of F -measures for binary and multilabel\nclassi\ufb01cation. They are based on a general property of these performance metrics, namely their\npseudo-linearity with respect to the false negative/false positive probabilities.\nFor binary classi\ufb01cation, the results we prove in Section 3 are that in order to optimize the F -\nmeasure, it is suf\ufb01cient to solve a binary classi\ufb01cation problem with different costs allocated to false\npositive and false negative errors (Proposition 4). However, these costs are not known a priori, so in\npractice we need to learn several classi\ufb01ers with different costs, and choose the best one (according\nto the F -score) in a second step. Propositions 5 and 6 provide approximation guarantees on the\nF -score we can obtain by following this principle depending on the granularity of the search in the\ncost space.\nOur results are not speci\ufb01c to the F1-measure in binary classi\ufb01cation, and they naturally extend to\nother cases of F -measures with similar functional forms. For that reason, we present the results and\nprove them directly for the general case, following the framework that we describe in this section.\nWe \ufb01rst present the machine learning framework we consider, and then give the general de\ufb01nition of\npseudo-convexity. Then, we provide examples of F -measures for binary, multilabel and multiclass\nclassi\ufb01cation and we show how they \ufb01t into this framework.\n\n2.1 Notation and De\ufb01nitions\nWe are given (i) a measurable space X \u00d7Y, where X is the input space and Y is the (\ufb01nite) prediction\nset, (ii) a probability measure \u00b5 over X \u00d7 Y, and (iii) a set of (measurable) classi\ufb01ers H from the\ninput space X to Y. We distinguish here the prediction set Y from the label space L = {1, ..., L}: in\nbinary or single-label multi-class classi\ufb01cation, the prediction set Y is the label set L, but in multi-\nlabel classi\ufb01cation, Y = 2L is the powerset of the set of possible labels. In that framework, we\nassume that we have an i.i.d. sample drawn from an underlying data distribution P on X \u00d7 Y. The\nempirical distribution of this \ufb01nite training (or test) sample will be denoted \u02c6P. Then, we may take\n\u00b5 = P to get results at the population level (concerning expected errors), or we may take \u00b5 = \u02c6P\nto get results on a \ufb01nite sample. Likewise, H can be a restricted set of functions such as linear\nclassi\ufb01ers if X is a \ufb01nite-dimensional vector space, or may be the set of all measurable classi\ufb01ers\nfrom X to Y to get results in terms of Bayes-optimal predictors. Finally, when needed, we will use\nbold characters for vectors and normal font with subscript for indexing.\nThroughout the paper, we need the notion of pseudo-linearity of a function, which itself is de\ufb01ned\nfrom the notion of pseudo-convexity (see e.g. [3, De\ufb01nition 3.2.1]): a differentiable function F :\nD \u2282 Rd \u2192 R, de\ufb01ned on a convex open subset of Rd, is pseudo-convex if\n\n\u2200e, e(cid:48) \u2208 D , F (e) > F (e(cid:48)) \u21d2 (cid:104)\u2207F (e), e(cid:48) \u2212 e(cid:105) < 0 ,\n\nwhere (cid:104)., .(cid:105) is the canonical dot product on Rd.\n\n2\n\n\fMoreover, F is pseudo-linear if both F and \u2212F are pseudo-convex. The important property of\npseudo-linear functions is that their level sets are hyperplanes (intersected with the domain), and that\nsublevel and superlevel sets are half-spaces, all of these hyperplanes being de\ufb01ned by the gradient.\nIn practice, working with gradients of non-linear functions may be cumbersome, so we will use the\nfollowing characterization, which is a rephrasing of [3, Theorem 3.3.9]:\nTheorem 1 ([3]) A non-constant function F : D \u2192 R, de\ufb01ned and differentiable on the open convex\nset D \u2286 Rd, is pseudo-linear on D if and only if \u2200e \u2208 D , \u2207F (e) (cid:54)= 0 , and: \u2203a : R \u2192 Rd and\n\u2203b : R \u2192 R such that, for any t in the image of F :\n\nF (e) \u2265 t \u21d4 (cid:104)a(t), e(cid:105) + b(t) \u2264 0 and F (e) \u2264 t \u21d4 (cid:104)a(t) , e(cid:105) + b(t) \u2265 0 .\n\nPseudo-linearity is the main property of fractional-linear functions (ratios of linear functions). In-\ndeed, let us consider F : e \u2208 Rd (cid:55)\u2192 (\u03b1 + (cid:104)\u03b2, e(cid:105))/(\u03b3 + (cid:104)\u03b4, e(cid:105)) with \u03b1, \u03b3 \u2208 R and \u03b2 and \u03b4 in Rd. If\nwe restrict the domain of F to the set {e \u2208 Rd|\u03b3 + (cid:104)\u03b4, e(cid:105) > 0}, then, for all t in the image of F and\nall e in its domain, we have: F (e) \u2264 t \u21d4 (cid:104)t\u03b4 \u2212 \u03b2, e(cid:105) + t\u03b3 \u2212 \u03b1 \u2265 0 , and the analogous equiva-\nlence obtained by reversing the inequalities holds as well; the function thus satis\ufb01es the conditions\nof Theorem 1. As we shall see, many F -scores can be written as fractional-linear functions.\n\n2.2 Error Pro\ufb01les and F -Measures\n\nFor all classi\ufb01cation tasks (binary, multiclass and multilabel), the F -measures we consider are func-\ntions of per-class recall and precision, which themselves are de\ufb01ned in terms of the marginal prob-\nabilities of classes and the per-class false negative/false positive probabilities. The marginal proba-\nbilities of label k will be denoted by Pk, and the per-class false negative/false positive probabilities\nof a classi\ufb01er h are denoted by FNk(h) and FPk(h). Their de\ufb01nitions are given below:\n\n(binary/multiclass) Pk = \u00b5({(x, y)|y = k}), FNk(h) = \u00b5({(x, y)|y = k and h(x) (cid:54)= k}) ,\nFPk(h) = \u00b5({(x, y)|y (cid:54)= k and h(x) = k}) .\nPk = \u00b5({(x, y)|y \u2208 k}), FNk(h) = \u00b5({(x, y)|k \u2208 y and k (cid:54)\u2208 h(x)}) ,\nFPk(h) = \u00b5({(x, y)|y (cid:54)\u2208 k and k \u2208 h(x)}) .\n\n(multilabel)\n\nThese probabilities of a classi\ufb01er h are then summarized by the error pro\ufb01le E(h):\n\nE(h) =(cid:0)FN1(h) , FP1(h) , ..., FNL(h) , FPL(h)(cid:1) \u2208 R2L ,\n\nso that e2k\u22121 is the false negative probability for class k and e2k is the false positive probability.\n\nBinary Classi\ufb01cation In binary classi\ufb01cation, we have FN2 = FP1 and we write F -measures only\nby reference to class 1. Then, for any \u03b2 > 0 and any binary classi\ufb01er h, the F\u03b2-measure is\n\nF\u03b2(h) =\n\n(1 + \u03b22)(P1 \u2212 FN1(h))\n\n(1 + \u03b22)P1 \u2212 FN1(h) + FP1(h)\n\n.\n\nThe F1-measure, which is the most widely used, corresponds to the case \u03b2 = 1. We can immediately\nnotice that F\u03b2 is fractional-linear, hence pseudo-convex, with respect to FN1 and FP1. Thus, with\na slight (yet convenient) abuse of notation, we write the F\u03b2-measure for binary classi\ufb01cation as a\nfunction of vectors in R4 = R2L which represent error pro\ufb01les of classi\ufb01ers:\n(1 + \u03b22)(P1 \u2212 e1)\n(1 + \u03b22)P1 \u2212 e1 + e2\n\n\u2200e \u2208 R4, F\u03b2(e) =\n\n(binary)\n\n.\n\nMultilabel Classi\ufb01cation In multilabel classi\ufb01cation, there are several de\ufb01nitions of F -measures.\nFor those based on the error pro\ufb01les, we \ufb01rst have the macro-F -measures (denoted by M F\u03b2), which\nis the average over class labels of the F\u03b2-measures of each binary classi\ufb01cation problem associated\nto the prediction of the presence/absence of a given class:\n\n(multilabel\u2013M acro)\n\nM F\u03b2(e) =\n\n1\nL\n\n(1 + \u03b22)(P \u2212 e2k\u22121)\n(1 + \u03b22)P \u2212 e2k\u22121 + e2k\n\n.\n\nL(cid:88)\n\nk=1\n\n3\n\n\fM F\u03b2 is not a pseudo-linear function of an error pro\ufb01le e. However, if the multi-label classi\ufb01cation\nalgorithm learns independent binary classi\ufb01ers for each class (a method known as one-vs-rest or\nbinary relevance [23]), then each binary problem becomes independent and optimizing the macro-\nF -score boils down to independently maximizing the F\u03b2-score for L binary classi\ufb01cation problems,\nso that optimizing M F\u03b2 is similar to optimizing F\u03b2 in binary classi\ufb01cation.\nThere are also micro-F -measures for multilabel classi\ufb01cation. They correspond to F\u03b2-measures\nfor a new binary classi\ufb01cation problem over X \u00d7 L, in which one maps a multilabel classi\ufb01er\nh : X \u2192 Y (Y is here the power set of L) to the following binary classi\ufb01er \u02dch : X \u00d7 L \u2192 {0, 1}: we\nhave \u02dch(x, k) = 1 if k \u2208 h(x), and 0 otherwise. The micro-F\u03b2-measure, written as a function of an\nerror pro\ufb01le e and denoted by mF\u03b2(e), is the F\u03b2-score of \u02dch and can be written as:\n\n(1 + \u03b22)(cid:80)L\nk=1 Pk +(cid:80)L\n\n(1 + \u03b22)(cid:80)L\n\nk=1(Pk \u2212 e2k\u22121)\n\nk=1(e2k \u2212 e2k\u22121)\n\n.\n\n(multilabel\u2013micro)\n\nmF\u03b2(e) =\n\nThis function is also fractional-linear, and thus pseudo-linear as a function of e.\nA third notion of F\u03b2-measure can be used in multilabel classi\ufb01cation, namely the per-instance F\u03b2\nstudied e.g. by [16, 17, 6, 4, 5]. The per-instance F\u03b2 is de\ufb01ned as the average, over instances x, of\nthe binary F\u03b2-measure for the problem of classifying labels given x. This corresponds to a speci\ufb01c\nF\u03b2-maximization problem for each x and is not directly captured by our framework, because we\nwould need to solve different cost-sensitive classi\ufb01cation problems for each instance.\n\nMulticlass Classi\ufb01cation The last example we take is from multiclass classi\ufb01cation. It differs\nfrom multilabel classi\ufb01cation in that a single class must be predicted for each example. This restric-\ntion imposes strong global constraints that make the task signi\ufb01cantly harder. As for the multillabel\ncase, there are many de\ufb01nitions of F -measures for multiclass classi\ufb01cation, and in fact several\nde\ufb01nitions for the micro-F -measure itself. We will focus on the following one, which is used in in-\nformation extraction (e.g. in the BioNLP challenge [12]). Given L class labels, we will assume that\nlabel 1 corresponds to a \u201cdefault\u201d class, the prediction of which is considered as not important. In\ninformation extraction, the \u201cdefault\u201d class corresponds to the (majority) case where no information\nshould be extracted. Then, a false negative is an example (x, y) such that y (cid:54)= 1 and h(x) (cid:54)= y, while\na false positive is an example (x, y) such that y = 1 and h(x) (cid:54)= y. This micro-F -measure, denoted\nmcF\u03b2 can be written as:\n\n(1 + \u03b22)(1 \u2212 P1 \u2212(cid:80)L\n(1 + \u03b22)(1 \u2212 P1) \u2212(cid:80)L\n\nk=2 e2k\u22121)\nk=2 e2k\u22121 + e1\n\n.\n\n(multiclass\u2013micro)\n\nmcF\u03b2(e) =\n\nOnce again, this kind of micro-F\u03b2-measure is pseudo-linear with respect to e.\nRemark 2 (Training and generalization performance) Our results concern a \ufb01xed distribution\n\u00b5, while the goal is to \ufb01nd a classi\ufb01er with high generalization performance. With our notation, our\nresults apply to \u00b5 = P or \u00b5 = \u02c6P, and our implicit goal is to perform empirical risk minimization-\nP\ntype learning, that is, to \ufb01nd a classi\ufb01er with high value of F\n\u03b2\ncounterpart F \u02c6P\n\n(h)(cid:1) by maximizing its empirical\n\n(the superscripts here make the underlying distribution explicit).\n\n(cid:0)E\n\nE\u02c6P\n\n(h)\n\n(cid:16)\n\n(cid:17)\n\nP\n\n\u03b2\n\nRemark 3 (Expected Utility Maximization (EUM) vs Decision-Theoretic Approach (DTA))\nNan et al. [15] propose two possible de\ufb01nitions of the generalization performance in terms of\nF\u03b2-scores.\nIn the \ufb01rst framework, called EUM, the population-level F\u03b2-score is de\ufb01ned as the\nF\u03b2-score of the population-level error pro\ufb01les. In contrast, the Decision-Theoretic approach de\ufb01nes\nthe population-level F\u03b2-score as the expected value of the F\u03b2-score over the distribution of test sets.\nThe EUM de\ufb01nition of generalization performance matches our framework using \u00b5 = P: in that\nsense, we follow the EUM framework. Nonetheless, regardless of how we de\ufb01ne the generalization\nperformance, our results can be used to maximize the empirical value of the F\u03b2-score.\n\n3 Optimizing F -Measures by Reduction to Cost-Sensitive Classi\ufb01cation\n\nThe F -measures presented above are non-linear aggregations of false negative/positive probabilities\nthat cannot be written in the usual expected loss minimization framework; usual learning algorithms\nare thus, intrinsically, not designed to optimize this kind of performance metrics.\n\n4\n\n\fIn this section, we show in Proposition 4 that the optimal classi\ufb01er for a cost-sensitive classi\ufb01cation\nproblem with label dependent costs [7, 24] is also an optimal classi\ufb01er for the pseudo-linear F -\nmeasures (within a speci\ufb01c, yet arbitrary classi\ufb01er set H). In cost-sensitive classi\ufb01cation, each entry\nof the error pro\ufb01le is weighted by a non-negative cost, and the goal is to minimize the weighted\naverage error. Ef\ufb01cient, consistent algorithms exist for such cost-sensitive problems [1, 22, 21].\nEven though the costs corresponding to the optimal F -score are not known a priori, we show in\nProposition 5 that we can approximate the optimal classi\ufb01er with approximate costs. These costs,\nexplicitly expressed in terms of the optimal F -score, motivate a practical algorithm.\n\n(cid:10)a(cid:0)F (cid:63)(cid:1), e(cid:48)(cid:11) \u21d4 F (e) = F (cid:63)\n\n3.1 Reduction to Cost-Sensitive Classi\ufb01cation\nIn this section, F : D \u2282 Rd \u2192 R is a \ufb01xed pseudo-linear function. We denote by a : R \u2192 Rd the\nfunction mapping values of F to the corresponding hyperplane of Theorem 1. We assume that the\ndistribution \u00b5 is \ufb01xed, as well as the (arbitrary) set of classi\ufb01er H. We denote by E (H) the closure\nof the image of H under E, i.e. E (H) = cl({E(h) , h \u2208 H}) (the closure ensures that E (H) is\ncompact and that minima/maxima are well-de\ufb01ned), and we assume E (H) \u2286 D. Finally, for the\nsake of discussion with cost-sensitive classi\ufb01cation, we assume that a(t) \u2208 Rd\n+ for any e \u2208 E (H),\nthat is, lower values of errors entail higher values of F .\n\nF (e(cid:48)). We have: e \u2208 argmin\ne(cid:48)\u2208E(H)\n\nProposition 4 Let F (cid:63) = max\ne(cid:48)\u2208E(H)\n\nProof Let e(cid:63) \u2208 argmaxe(cid:48)\u2208E(H) F (e(cid:48)), and let a(cid:63) = a(F (e(cid:63))) = a(cid:0)F (cid:63)(cid:1). We \ufb01rst notice that\n\npseudo-linearity implies that the set of e \u2208 D such that (cid:104)a(cid:63), e(cid:105) = (cid:104)a(cid:63), e(cid:63)(cid:105) corresponds to the\nlevel set {e \u2208 D|F (e) = F (e(cid:63)) = F (cid:63)}. Thus, we only need to show that e(cid:63) is a minimizer of\ne(cid:48) (cid:55)\u2192 (cid:104)a(cid:63), e(cid:48)(cid:105) in E (H). To see this, we notice that pseudo-linearity implies\n\u2200e(cid:48) \u2208 D, F (e(cid:63)) \u2265 F (e(cid:48)) \u21d2 (cid:104)a(cid:63), e(cid:63)(cid:105) \u2264 (cid:104)a(cid:63), e(cid:48)(cid:105)\n\nfrom which we immediately get e(cid:63) \u2208 argmine(cid:48)\u2208E(H) (cid:104)a(cid:63), e(cid:48)(cid:105) since e(cid:63) maximizes F in E (H). (cid:3)\n\nThe proposition shows that a(cid:0)F (cid:63)(cid:1) are the costs that should be assigned to the error pro\ufb01le in order\nto \ufb01nd the F -optimal classi\ufb01er in H. Hence maximizing F amounts to minimizing(cid:10)a(cid:0)F (cid:63)(cid:1), E(h)(cid:11)\nwith respect to h, that is, amounts to solving a cost-sensitive classi\ufb01cation problem. The costs a(cid:0)F (cid:63)(cid:1)\n\nare, however, not known a priori (because F (cid:63) is not known in general). The following result shows\nthat having only approximate costs is suf\ufb01cient to have an approximately F -optimal solution, which\ngives us the main step towards a practical solution:\nProposition 5 Let \u03b50 \u2265 0 and \u03b51 \u2265 0, and assume that there exists \u03a6 > 0 such that for all\ne, e(cid:48) \u2208 E (H) satisfying F (e(cid:48)) > F (e), we have:\n\n(1)\nThen, let us take e(cid:63) \u2208 argmaxe(cid:48)\u2208E(H) F (e(cid:48)), and denote a(cid:63) = a(F (e(cid:63))). Let furthermore g \u2208 Rd\nand h \u2208 H satisfying the two following conditions:\n\nF (e(cid:48)) \u2212 F (e) \u2264 \u03a6(cid:104)a(F (e(cid:48))) , e \u2212 e(cid:48)(cid:105) .\n\n+\n\n(i) (cid:107) g \u2212 a(cid:63) (cid:107)2\u2264 \u03b50\n\n(cid:104)g, e(cid:48)(cid:105) + \u03b51 .\nF (E(h)) \u2265 F (e(cid:63)) \u2212 \u03a6 \u00b7 (2\u03b50M + \u03b51) , where M = max\ne(cid:48)\u2208E(H)\n\n(ii) (cid:104)g, E(h)(cid:105) \u2264 min\ne(cid:48)\u2208E(H)\n\n(cid:107) e(cid:48) (cid:107)2.\n\nWe have:\n\nProof Let e(cid:48) \u2208 E (H). By writing (cid:104)g, e(cid:48)(cid:105) = (cid:104)g \u2212 a(cid:63), e(cid:48)(cid:105) + (cid:104)a(cid:63), e(cid:48)(cid:105) and applying Cauchy-Schwarz\ninequality to (cid:104)g \u2212 a(cid:63), e(cid:48)(cid:105) we get (cid:104)g, e(cid:48)(cid:105) \u2264 (cid:104)a(cid:63), e(cid:48)(cid:105) + \u03b50M using condition (i). Consequently\n\nmin\n\ne(cid:48)\u2208E(H)\n\n(cid:104)g, e(cid:48)(cid:105) \u2264 min\ne(cid:48)\u2208E(H)\n\n(cid:104)a(cid:63), e(cid:48)(cid:105) + \u03b50M = (cid:104)a(cid:63), e(cid:63)(cid:105) + \u03b50M\n\n(2)\n\nWhere the equality is given by Proposition 4. Now, let e = E(h), assuming that classi\ufb01er h satis\ufb01es\ncondition (ii). Using (cid:104)a(cid:63), e(cid:105) = (cid:104)a(cid:63) \u2212 g, e(cid:105) + (cid:104)g, e(cid:105) and Cauchy-Shwarz, we obtain:\n\n(cid:104)a(cid:63), e(cid:105) \u2264 (cid:104)g, e(cid:105) + \u03b50M \u2264 min\ne(cid:48)\u2208E(H)\n\n(cid:104)g, e(cid:48)(cid:105) + \u03b51 + \u03b50M \u2264 (cid:104)a(cid:63), e(cid:63)(cid:105) + \u03b51 + 2\u03b50M ,\n\nwhere the \ufb01rst inequality comes from condition (ii) and the second inequality comes from (2). The\n(cid:3)\n\ufb01nal result is obtained by plugging this inequality into (1).\n\n5\n\n\fBefore discussing this result, we \ufb01rst give explicit values of a and \u03a6 for pseudo-linear F -measures:\n\nProposition 6 F\u03b2, mF\u03b2 and mcF\u03b2 de\ufb01ned in Section 2 satisfy the conditions of Proposition 5 with:\n\n(binary) F\u03b2 :\n\n\u03a6 =\n\n(multilabel\u2013micro) mF\u03b2 : \u03a6 =\n\n1\n\n\u03b22P1\n\n\u03b22(cid:80)L\n\n1\nk=1 Pk\n\nand ai(t) =\n\n(multiclass\u2013micro) mcF\u03b2 : \u03a6 =\n\n1\n\n\u03b22(1 \u2212 P1)\n\nand ai(t) =\n\nt\n0\n\nand a : t \u2208 [0, 1] (cid:55)\u2192 (1 + \u03b22\u2212 t, t, 0, 0) .\n\n(cid:26)1 + \u03b22 \u2212 t\n\uf8f1\uf8f2\uf8f31 + \u03b22 \u2212 t\n\nt\n\nif i is odd\nif i is even .\nif i is odd and i (cid:54)= 1\nif i = 1\notherwise\n\n.\n\nThe proof is given in the longer version of the paper, and the values of \u03a6 and a are valid for any set\nof classi\ufb01ers H. Note that the result on F\u03b2 for binary classi\ufb01cation can be used for the macro-F\u03b2-\nmeasure in multilabel classi\ufb01cation when training one binary classi\ufb01er per label. Also, the relative\ncosts (1+\u03b22\u2212t) for false negative and t for false positive imply that for the F1-measure, the optimal\nclassi\ufb01er is the solution of the cost-sensitive binary problem with costs (1 \u2212 F (cid:63)/2), F (cid:63)/2. If we\ntake H as the set of all measurable functions, the Bayes-optimal classi\ufb01er for this cost is to predict\nclass 1 when \u00b5(y = 1|x) \u2265 F (cid:63)/2 (see e.g. [22]). Our propositions thus extends this known result\n[13] to the non-asymptotic regime and to an arbitrary set of classi\ufb01ers.\n\n3.2 Practical Algorithm\n\nOur results suggests that the optimization of pseudo-linear F -measures should wrap cost-sensitive\nclassi\ufb01cation algorithms, used in an inner loop, by an outer loop setting the appropriate costs.\nIn practice, since the function a : [0, 1] \u2192 Rd, which assigns costs to probabilities of error, is\nLipschitz-continuous (with constant 2 on our examples), it is suf\ufb01cient to discretize the interval\n[0, 1] to have a set of evenly spaced values {t1, ..., tC} (say, tj+1 \u2212 tj = \u03b50/2) to obtain an \u03b50-cover\n{a(t1), ..., a(tC)} of the possible costs. Using the approximate guarantee of Proposition 5, learning\na cost-sensitive classi\ufb01er for each a(ti) and selecting the one with optimal F -measure a posteriori\nis suf\ufb01cient to obtain a M \u03a6(2\u03b50 + \u03b51)-optimal solution, where \u03b51 is the approximation guarantee of\nthe cost-sensitive classi\ufb01cation algorithm.\nThis meta-algorithm can be instantiated with any learning algorithm and different F -measures. In\nour experiments of Section 4, we \ufb01rst use it with cost-sensitive binary classi\ufb01cation algorithms: Sup-\nport Vector Machines (SVMs) and logistic regression, both with asymmetric costs [2], to optimize\nthe F1-measure in binary classi\ufb01cation and the macro-F1-score in multilabel classi\ufb01cation (training\none-vs-rest classi\ufb01ers). Musicant et al. [14] also advocated for SVMs with asymmetric costs for\nF1-measure optimization in binary classi\ufb01cation. However, their argument, speci\ufb01c to SVMs, is not\nmethodological but technical (relaxation of the maximization problem).\n\n4 Experiments\n\nThe goal of this section is to give illustration of the algorithms suggested by the theory. First, our re-\nsults suggest that cost-sensitive classi\ufb01cation algorithms may be preferable to the more usual proba-\nbility thresholding method. We compare cost-sensitive classi\ufb01cation, as implemented by SVMs with\nasymmetric costs, to thresholded logistic regression, with linear classi\ufb01ers. Besides, the structured\nSVM approach to F1-measure maximization SVMperf [11] provides another baseline. For complete-\nness, we also report results for thresholded SVMs, cost-sensitive logistic regression, and for the\nthresholded versions of SVMperf and the cost-sensitive algorithms (a thresholded algorithm means\nthat the decision threshold is tuned a posteriori by maximizing the F1-score on the validation set).\nCost-sensitive SVMs and logistic regression (LR) differ in the loss they optimize (weighted hinge\nloss for SVMs, weighted log-loss for LR), and even though both losses are calibrated in the cost-\nsensitive setting (that is, converging toward a Bayes-optimal classi\ufb01er as the number of examples and\nthe capacity of the class of function grow to in\ufb01nity) [22], they behave differently on \ufb01nite datasets\nor with restricted classes of functions. We may also note that asymptotically, the Bayes-classi\ufb01er for\n\n6\n\n\fbefore thresholding\n\nafter thresholding\n\n2\nx\n\n2\nx\n\nx1\n\nx1\n\nFigure 1: Decision boundaries for the galaxy dataset before and after thresholding the classi\ufb01er\nscores of SVMperf (dotted, blue), cost-sensitive SVM (dot-dashed, cyan), logistic regression (solid,\nred), and cost-sensitive logistic regression (dashed, green). The horizontal black dotted line is an\noptimal decision boundary.\n\na cost-sensitive binary classi\ufb01cation problem is a classi\ufb01er which thresholds the posterior probability\nof being class 1. Thus, all methods but SVMperf are asymptotically equivalent, and our goal here is\nto analyze their non-asymptotic behavior on a restricted class of functions.\nAlthough our theoretical developments do not indicate any need to threshold the scores of classi\ufb01ers,\nthe practical bene\ufb01ts of a post-hoc adjustment of these scores can be important in terms of F1-\nmeasure maximization. The reason is that the decision threshold given by cost-sensitive SVMs or\nlogistic regression might not be optimal in terms of the cost-sensitive 0/1-error, as already noted in\ncost-sensitive learning scenarios [10, 2]. This is illustrated in Figure 1, on the didactic \u201cGalaxy\u201d\ndistribution, consisting in four clusters of 2D-examples, indexed by z \u2208 {1, 2, 3, 4}, with prior\nprobability P(z = 1) = 0.01, P(z = 2) = 0.1, P(z = 3) = 0.001, and P(z = 4) = 0.889,\nwith respective class conditional probabilities P(y = 1|z = 1) = 0.9, P(y = 1|z = 2) = 0.09,\nP(y = 1|z = 3) = 0.9, and P(y = 1|z = 4) = 0. We drew a very large sample (100,000 examples)\nfrom the distribution, whose optimal F1-measure is 67.5%. Without tuning the decision threshold\nof the classi\ufb01ers, the best F1-measure among the classi\ufb01ers is 55.3%, obtained by SVMperf, whereas\ntuning thresholds enables to reach the optimal F1-measure for SVMperf and cost-sensitive SVM.\nOn the other hand, LR is severely affected by the non-linearity of the level sets of the posterior\nprobability distribution, and does not reach this limit (best F1-score of 48.9%). Note also that even\nwith this very large sample size, the SVM and LR classi\ufb01ers are very different.\nThe datasets we use are Adult (binary classi\ufb01cation, 32,561/16,281 train/test ex., 123 features),\nLetter (single label multiclass, 26 classes, 20,000 ex., 16 features), and two text datasets: the\n20 Newsgroups dataset News201 (single label multiclass, 20 classes, 15,935/3,993 train/test ex.,\n62,061 features, scaled version) and Siam2 (multilabel, 22 classes, 21,519/7,077 train/test ex.,\n30,438 features). All datasets except for News20 and Siam are obtained from the UCI repository3.\nFor each experiment, the training set was split at random, keeping 1/3 for the validation set used to\nselect all hyper-parameters, based on the maximization of the F1-measure on this set. For datasets\nthat do not come with a separate test set, the data was \ufb01rst split to keep 1/4 for test. The algorithms\nhave from one to three hyper-parameters: (i) all algorithms are run with L2 regularization, with a\nregularization parameter C \u2208 {2\u22126, 2\u22125, ..., 26}; (ii) for the cost-sensitive algorithms, the cost for\nfalse negatives is chosen in { 2\u2212t\n, t \u2208 {0.1, 0.2, ..., 1.9}} of Proposition 6 4; (iii) for the thresholded\nalgorithms, the threshold is chosen among all the scores of the validation examples.\n\nt\n\n1http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/multiclass.\n\nhtml#news20\n\n2http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/multilabel.\n\nhtml#siam-competition2007\n\n3https://archive.ics.uci.edu/ml/datasets.html\n4We take t greater than 1 in case the training asymmetry would be different from the true asymmetry [2].\n\n7\n\n\u22123\u22122\u2212101201234\u22123\u22122\u2212101201234\fTable 1: (macro-)F1-measures (in %). Options: T stands for thresholded, CS for cost-sensitive and\nCS&T for cost-sensitive and thresholded.\n\n\u2013\n\n67.3\n52.5\n59.5\n49.4\n\nCS\n67.9\n63.2\n81.7\n51.9\n\nBaseline SVMperf SVMperf SVM SVM SVM LR LR\nOptions\nCS\nT\n67.8 67.9\nAdult\n61.2 59.9\nLetter\n81.2 81.1\nNews20\n53.9 53.8\nSiam\n\nCS&T\n67.8\n63.8\n82.4\n54.9\n\nT\n67.9\n60.8\n78.7\n52.8\n\nT\n\n67.8\n63.1\n82.0\n52.6\n\nLR\nCS&T\n67.8\n62.1\n81.5\n54.4\n\nThe library LibLinear [9] was used to implement SVMs5 and Logistic Regression (LR). A constant\nfeature with value 100 was added to each dataset to mimic an unregularized offset.\nThe results, averaged over \ufb01ve random splits, are reported in Table 1. As expected, the difference\nbetween methods is less extreme than on the arti\ufb01cial \u201cGalaxy\u201d dataset. The Adult dataset is an\nexample where all methods perform nearly identically; the surrogate loss used in practice seems\nunimportant. On the other datasets, we observe that thresholding has a rather large impact, and\nespecially for SVMperf; this is also true for the other classi\ufb01ers: the unthresholded SVM and LR with\nsymmetric costs (unreported here) were not competitive as well. The cost-sensitive (thresholded)\nSVM outperforms all other methods, as suggested by the theory. It is probably the method of choice\nwhen predictive performance is a must.\nOn these datasets, thresholded LR behaves reasonably well considering its relatively low computa-\ntional cost. Indeed, LR is much faster than SVM: in their thresholded cost-sensitive versions, the\ntimings for LR on News20 and Siam datasets are 6,400 and 8,100 seconds, versus 255,000 and\n147,000 seconds for SVM respectively. Note that we did not try to optimize the running time in our\nexperiments. In particular, considerable time savings could be achieved by using warm-start.\n\n5 Conclusion\n\nWe presented an analysis of F -measures, leveraging the property of pseudo-linearity of some of\nthem to obtain a strong non-asymptotic reduction to cost-sensitive classi\ufb01cation. The results hold\nfor any dataset and for any class of function. Our experiments on linear functions con\ufb01rm theory, by\ndemonstrating the practical interest of using cost-sensitive classi\ufb01cation algorithms rather than using\na simple probability thresholding. However, they also reveal that, for F -measure maximization,\nthresholding the solutions provided by cost-sensitive algorithms further improves performances.\nAlgorithmically and empirically, we only explored the simplest case of our result (F\u03b2-measure in\nbinary classi\ufb01cation and macro-F\u03b2-measure in multilabel classi\ufb01cation), but much more remains to\nbe done. First, the strategy we use for searching the optimal costs is a simple uniform discretization\nprocedure, and more ef\ufb01cient exploration techniques could probably be developped. Second, al-\ngorithms for the optimization of the micro-F\u03b2-measure in multilabel classi\ufb01cation received interest\nrecently as well [8, 19], but are for now limited to the selection of threshold after any kind of train-\ning. New methods for that measure may be designed from our reduction; we also believe that our\nresult can lead to progresses towards optimizing the micro-F\u03b2 measure in multiclass classi\ufb01cation.\n\nAcknowledgments\n\nThis work was carried out and funded in the framework of the Labex MS2T. It was supported by\nthe Picardy Region and the French Government, through the program \u201cInvestments for the future\u201d\nmanaged by the National Agency for Research (Reference ANR-11-IDEX-0004-02).\n\nReferences\n[1] N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In\n\nW. Kim, R. Kohavi, J. Gehrke, and W. DuMouchel, editors, KDD, pages 3\u201311. ACM, 2004.\n\n[2] F. R. Bach, D. Heckerman, and E. Horvitz. Considering cost asymmetry in learning classi\ufb01ers. J. Mach.\n\nLearn. Res., 7:1713\u20131741, December 2006.\n\n5The maximum number of iteration for SVMs was set to 50,000 instead of the default 1,000.\n\n8\n\n\f[3] A. Cambini and L. Martein. Generalized Convexity and Optimization, volume 616 of Lecture Notes in\n\nEconomics and Mathematical Systems. Springer, 2009.\n\n[4] W. Cheng, K. Dembczynski, E. H\u00a8ullermeier, A. Jaroszewicz, and W. Waegeman. F-measure maximization\nin topical classi\ufb01cation. In J. Yao, Y. Yang, R. Slowinski, S. Greco, H. Li, S. Mitra, and L. Polkowski,\neditors, RSCTC, volume 7413 of Lecture Notes in Computer Science, pages 439\u2013446. Springer, 2012.\n\n[5] K. Dembczynski, A. Jachnik, W. Kotlowski, W. Waegeman, and E. H\u00a8ullermeier. Optimizing the F-\nmeasure in multi-label classi\ufb01cation: Plug-in rule approach versus structured loss minimization.\nIn\nS. Dasgupta and D. Mcallester, editors, Proceedings of the 30th International Conference on Machine\nLearning (ICML-13), volume 28, pages 1130\u20131138. JMLR Workshop and Conference Proceedings, May\n2013.\n\n[6] K. Dembczynski, W. Waegeman, W. Cheng, and E. H\u00a8ullermeier. An exact algorithm for F-measure\nmaximization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, and K. Q. Weinberger,\neditors, NIPS, pages 1404\u20131412, 2011.\n\n[7] C. Elkan. The foundations of cost-sensitive learning.\n\nIntelligence, volume 17, pages 973\u2013978, 2001.\n\nIn International Joint Conference on Arti\ufb01cial\n\n[8] R. E. Fan and C. J. Lin. A study on threshold selection for multi-label classi\ufb01cation. Technical report,\n\nNational Taiwan University, 2007.\n\n[9] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear\n\nclassi\ufb01cation. The Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[10] Y. Grandvalet, J. Mari\u00b4ethoz, and S. Bengio. A probabilistic interpretation of SVMs with an application to\n\nunbalanced classi\ufb01cation. In NIPS, 2005.\n\n[11] T. Joachims. A support vector method for multivariate performance measures. In Proceedings of the 22nd\n\nInternational Conference on Machine Learning, pages 377\u2013384. ACM Press, 2005.\n\n[12] J.-D. Kim, Y. Wang, and Y. Yasunori. The genia event extraction shared task, 2013 edition - overview.\nIn Proceedings of the BioNLP Shared Task 2013 Workshop, pages 8\u201315, So\ufb01a, Bulgaria, August 2013.\nAssociation for Computational Linguistics.\n\n[13] Z. C. Lipton, C. Elkan, and B. Naryanaswamy. Optimal thresholding of classi\ufb01ers to maximize F1 mea-\nsure. In T. Calders, F. Esposito, E. H\u00a8ullermeier, and R. Meo, editors, Machine Learning and Knowledge\nDiscovery in Databases, volume 8725 of Lecture Notes in Computer Science, pages 225\u2013239. Springer,\n2014.\n\n[14] D. R. Musicant, V. Kumar, and A. Ozgur. Optimizing F-measure with support vector machines.\n\nProceedings of the FLAIRS Conference, pages 356\u2013360, 2003.\n\nIn\n\n[15] Y. Nan, K. M. A. Chai, W. S. Lee, and H. L. Chieu. Optimizing F-measures: A tale of two approaches.\n\nIn ICML. icml.cc / Omnipress, 2012.\n\n[16] J. Petterson and T. S. Caetano. Reverse multi-label learning. In NIPS, volume 1, pages 1912\u20131920, 2010.\n[17] J. Petterson and T. S. Caetano. Submodular multi-label learning. In NIPS, pages 1512\u20131520, 2011.\n[18] I. Pillai, G. Fumera, and F. Roli. F-measure optimisation in multi-label classi\ufb01ers. In ICPR, pages 2424\u2013\n\n2427. IEEE, 2012.\n\n[19] I. Pillai, G. Fumera, and F. Roli. Threshold optimisation for multi-label classi\ufb01ers. Pattern Recogn.,\n\n46(7):2055\u20132065, July 2013.\n\n[20] C. J. V. Rijsbergen. Information Retrieval. Butterworth-Heinemann, Newton, MA, USA, 2nd edition,\n\n1979.\n\n[21] C. Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958\u2013992, 2012.\n[22] I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation,\n\n26(2):225\u2013287, 2007.\n\n[23] G. Tsoumakas and I. Katakis. Multi-label classi\ufb01cation: An overview. International Journal of Data\n\nWarehousing and Mining (IJDWM), 3(3):1\u201313, 2007.\n\n[24] Z.-H. Zhou and X.-Y. Liu. On multi-class cost-sensitive learning. Computational Intelligence, 26(3):232\u2013\n\n257, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1127, "authors": [{"given_name": "Shameem", "family_name": "Puthiya Parambath", "institution": "Universit\u00e9 de Technologie de Compi\u00e8gne"}, {"given_name": "Nicolas", "family_name": "Usunier", "institution": "Universit\u00e9 de Technologie de Compi\u00e8gne (UTC)"}, {"given_name": "Yves", "family_name": "Grandvalet", "institution": "Universit\u00e9 de Technologie de Compi\u00e9gne"}]}