{"title": "An Exact Algorithm for F-Measure Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 1404, "page_last": 1412, "abstract": "The F-measure, originally introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure remains a statistically and computationally challenging problem, since no closed-form maximizer exists. Current algorithms are approximate and typically rely on additional assumptions regarding the statistical distribution of the binary response variables. In this paper, we present an algorithm which is not only computationally efficient but also exact, regardless of the underlying distribution. The algorithm requires only a quadratic number of parameters of the joint distribution (with respect to the number of binary responses). We illustrate its practical performance by means of experimental results for multi-label classification.", "full_text": "An Exact Algorithm for F-Measure Maximization\n\nKrzysztof Dembczy\u00b4nski\n\nInstitute of Computing Science\nPozna\u00b4n University of Technology\n\nPozna\u00b4n, 60-695 Poland\n\nWillem Waegeman\n\nMathematical Modelling, Statistics\nand Bioinformatics, Ghent University\n\nGhent, 9000 Belgium\n\nkdembczynski@cs.put.poznan.pl\n\nwillem.waegeman@ugent.be\n\nWeiwei Cheng\n\nMathematics and Computer Science\n\nPhilipps-Universit\u00a8at Marburg\n\nMarburg, 35032 Germany\n\nEyke H\u00a8ullermeier\n\nMathematics and Computer Science\n\nPhilipps-Universit\u00a8at Marburg\n\nMarburg, 35032 Germany\n\ncheng@mathematik.uni-marburg.de\n\neyke@mathematik.uni-marburg.de\n\nAbstract\n\nThe F-measure, originally introduced in information retrieval, is nowadays rou-\ntinely used as a performance metric for problems such as binary classi\ufb01cation,\nmulti-label classi\ufb01cation, and structured output prediction. Optimizing this mea-\nsure remains a statistically and computationally challenging problem, since no\nclosed-form maximizer exists. Current algorithms are approximate and typically\nrely on additional assumptions regarding the statistical distribution of the binary\nresponse variables. In this paper, we present an algorithm which is not only com-\nputationally ef\ufb01cient but also exact, regardless of the underlying distribution. The\nalgorithm requires only a quadratic number of parameters of the joint distribu-\ntion (with respect to the number of binary responses). We illustrate its practical\nperformance by means of experimental results for multi-label classi\ufb01cation.\n\n1\n\nIntroduction\n\nWhile being rooted in information retrieval [1], the so-called F-measure is nowadays routinely used\nas a performance metric for different types of prediction problems, including binary classi\ufb01cation,\nmulti-label classi\ufb01cation (MLC), and certain applications of structured output prediction, like text\nchunking and named entity recognition. Compared to measures like error rate in binary classi\ufb01cation\nand Hamming loss in MLC, it enforces a better balance between performance on the minority and the\nmajority class, respectively, and, therefore, it is more suitable in the case of imbalanced data. Given a\nprediction h = (h1, . . . , hm) \u2208 {0, 1}m of an m-dimensional binary label vector y = (y1, . . . , ym)\n(e.g., the class labels of a test set of size m in binary classi\ufb01cation or the label vector associated with\na single instance in MLC), the F-measure is de\ufb01ned as follows:\n\nF (y, h) =\n\n\u2208 [0, 1] ,\n\n(1)\n\nwhere 0/0 = 1 by de\ufb01nition. This measure essentially corresponds to the harmonic mean of preci-\nsion prec and recall rec:\n\ni=1 yihi\n\n2(cid:80)m\ni=1 yi +(cid:80)m\n(cid:80)m\n(cid:80)m\n(cid:80)m\n\ni=1 yihi\ni=1 hi\n\n,\n\ni=1 hi\n\n(cid:80)m\n(cid:80)m\n\ni=1 yihi\ni=1 yi\n\n.\n\nprec(y, h) =\n\nrec(y, h) =\n\nOne can generalize the F-measure to a weighted harmonic average of these two values, but for the\nsake of simplicity, we stick to the unweighted mean, which is often referred to as the F1-score or the\nF1-measure.\n\n1\n\n\fDespite its popularity in experimental settings, only a few methods for training classi\ufb01ers that di-\nrectly optimize the F-measure have been proposed so far. In binary classi\ufb01cation, the existing al-\ngorithms are extensions of support vector machines [2, 3] or logistic regression [4]. However, the\nmost popular methods, including [5], rely on explicit threshold adjustment. Some algorithms have\nalso been proposed for structured output prediction [6, 7, 8] and MLC [9, 10, 11]. In these two\napplication domains, three different aggregation schemes of the F-measure can be distinguished,\nnamely the instance-wise, the micro-, and the macro-averaging. One should carefully distinguish\nthese versions, as algorithms optimized with a given objective are usually performing suboptimally\nfor other (target) evaluation measures.\nAll the above algorithms intend to optimize the F-measure during the training phase. Conversely,\nin this article we rather investigate an orthogonal problem of inference from a probabilistic model.\nModeling the ground-truth as a random variable Y , i.e., assuming an underlying probability distri-\nbution p(Y ) on {0, 1}m, the prediction h\n\u2217\nF that maximizes the expected F-measure is given by\n\n\u2217\nF = arg max\nh\nh\u2208{0,1}m\n\nEy\u223cp(Y ) [F (y, h)] = arg max\nh\u2208{0,1}m\n\np(Y = y) F (y, h).\n\n(2)\n\ni=1 pyi\n\nindependence of the Yi, i.e., p(Y = y) =(cid:81)m\n\nAs discussed in Section 2, this setting was mainly examined before by [12], under the assumption of\ni (1 \u2212 pi)1\u2212yi with pi = p(Yi =1). Indeed, \ufb01nding\nthe maximizer (2) is in general a dif\ufb01cult problem. Apparently, there is no closed-form expression,\nand a brute-force search is infeasible (it would require checking all 2m combinations of prediction\nvector h). At \ufb01rst sight, it also seems that information about the entire joint distribution p(Y ) is\nneeded to maximize the F-measure. Yet, as will be shown in this paper, the problem can be solved\nmore ef\ufb01ciently. In Section 3, we present a general algorithm for maximizing the F-measure that\nrequires only m2 + 1 parameters of the joint distribution. If these parameters are given, the exact\nsolution can be obtained in time o(m3). This result holds regardless of the underlying distribution.\nIn particular, unlike algorithms such as [12], we do not require independence of the binary response\nvariables (labels). While being natural for problems like binary classi\ufb01cation, this assumption is\nindeed not tenable in domains like MLC and structured output prediction. A discussion of existing\nmethods for F-measure maximization, along with results indicating their shortcomings, is provided\nin Section 2. An experimental comparison in the context of MLC is presented in Section 4.\n\n(cid:88)\n\ny\u2208{0,1}m\n\n2 Existing Algorithms for F-Measure Maximization\n\nCurrent algorithms for solving (2) make different assumptions to simplify the problem. First of\nall, the algorithms operate on a constrained hypothesis space, sometimes justi\ufb01ed by theoretical\narguments. Secondly, they guarantee optimality only for speci\ufb01c distributions p(Y ).\n\n2.1 Algorithms Based on Label Independence\n\nBy assuming independence of the random variables Y1, ..., Ym, the optimization problem (2) can be\nsubstantially simpli\ufb01ed. It has been shown independently in [13] and [12] that the optimal solution\nalways contains the labels with the highest marginal probabilities pi, or no labels at all. As a conse-\nquence, only a few hypotheses h (m+1 instead of 2m) need to be examined, and the computation\nof the expected F-measure can be performed in an ef\ufb01cient way.\nLewis [13] showed that the expected F-measure can be approximated by the following expression\nunder the assumption of independence:1\n\nEy\u223cp(Y ) [F (y, h)] (cid:39)\n\n(cid:40) (cid:81)m\n2(cid:80)m\ni=1(1 \u2212 pi),\n(cid:80)m\ni=1 pi+(cid:80)m\n\ni=1 pihi\n\ni=1 hi\n\nif h = 0\nif h (cid:54)= 0\n\n,\n\nThis approximation is exact for h = 0, while for h (cid:54)= 0, an upper bound of the error can easily be\ndetermined [13].\nJansche [12], however, has proposed an exact procedure, called maximum expected utility frame-\nwork (MEUF), that takes marginal probabilities p1, p2, . . . , pm as inputs and solves (2) in time\n\n1In the following, we denote 0 and 1 as vectors containing all zeros and ones, respectively.\n\n2\n\n\fO(m4). He noticed that (2) can be solved via outer and inner maximization. Namely, (2) can be\ntransformed into an inner maximization\n\nwhere Hk = {h \u2208 {0, 1}m |(cid:80)m\n\n= arg max\n\nEy\u223cp(Y ) [F (y, h)] ,\n\nh(k)\u2217\ni=1 hi = k}, followed by an outer maximization\n\nh\u2208Hk\n\n\u2217\nF =\nh\n\narg max\n\nh\u2208{h(0)\u2217\n\n,...,h(m)\u2217}\n\nEy\u223cp(Y ) [F (y, h)] .\n\n(3)\n\n(4)\n\nThe outer maximization (4) can be done by simply checking all m + 1 possibilities. The main effort\nis then devoted for solving the inner maximization (3). According to Theorem 2.1, to solve (3)\nfor a given k, we need to check only one vector h in which hi = 1 for the k labels with highest\nmarginal probabilities pi. The remaining problem is the computation of the expected F-measure in\n(3). This expectation cannot be computed naively, as the sum is over exponentially many terms.\nBut the F-measure is a function of integer counts that are bounded, so it can normally only assume\na much smaller number of distinct values. The cardinality of its domain is indeed exponential\nin m, but the cardinality of its range is polynomial in m, so the expectation can be computed in\npolynomial time. As a result, Jansche [12] obtains a procedure that is cubic in m for computing (3).\nHe also presents approximate variants of this procedure, reducing its complexity from cubic to\nquadratic or even to linear. The results of the quadratic-time approximation, according to [12], are\nalmost indistinguishable in practice from the exact algorithm; but still the overall complexity of the\napproach is O(m3).\nIf the independence assumption is violated, the above methods may produce predictions being far\naway from the optimal one. The following result shows this concretely for the method of Jansche.2\nProposition 2.1. Let hJ be a vector of predictions obtained by MEUF, then the worst-case regret\nconverges to one in the limit of m, i.e.,\n(EY\n\nF ) \u2212 F (Y, hJ)(cid:3)) = 1,\n\n(cid:2)F (Y, h\n\n\u2217\n\nlim\nm\u2192\u221e sup\n\np\n\nwhere the supremum is taken over all possible distributions p(Y ).\n\nAdditionally, one can easily construct families of probability distributions that obtain a relatively\nfast convergence rate as a function of m.\n\n2.2 Algorithms Based on the Multinomial Distribution\n\nity mass is distributed over vectors y containing only a single positive label, i.e., (cid:80)m\n\nSolving (2) becomes straightforward in the case of a speci\ufb01c distribution in which the probabil-\ni=1 yi = 1,\ncorresponding to the multinomial distribution. This was studied in [14] in the setting of so-called\nnon-deterministic classi\ufb01cation.\nTheorem 2.2 (Del Coz et al. [14]). Denote by y(i) a vector for which yi = 1 and all the other\nentries are zeros. Assume that p(Y ) is a joint distribution such that p(Y = y(i)) = pi. The\n\u2217\nmaximizer h\nF of (2) consists of the k labels with the highest marginal probabilities, where k is the\n\ufb01rst integer for which\n\nk(cid:88)\n\npj \u2265 (1 + k)pk+1;\n\nif there is no such integer, then h = 1.\n\nj=1\n\n2.3 Algorithms Based on Thresholding on Ordered Marginal Probabilities\n\nSince all the methods so far rely on the fact that the optimal solution contains ones for the labels with\nthe highest marginal probabilities (or consists of a vector of zeros), one may expect that thresholding\non the marginal probabilities (hi = 1 for pi \u2265 \u03b8, and hi = 0 otherwise) will provide a solution to\n2Some of the proofs have been attached to the paper as supplementary material and will also be provided\n\nlater with the extended version of the paper.\n\n3\n\n\f(2) in general. Obviously, to \ufb01nd an optimal threshold \u03b8, access to the entire joint distribution is\nneeded. However, this is not the main problem here, since in the next section, we will show that\nonly a polynomial number of parameters of the joint distribution is needed. What is more interesting\nis the observation that the F-maximizer is in general not consistent with the order of marginal label\nprobabilities. In fact, the regret can be substantial, as shown by the following result.\nProposition 2.3. Let hT be a vector of predictions obtained by putting a threshold on sorted\nmarginal probabilities in the optimal way, then the worst-case regret is lower bounded by\n\n(cid:2)F (Y, h\n\nF ) \u2212 F (Y, hT )(cid:3)) \u2265 max(0,\n\n\u2217\n\n(EY\n\nsup\np\n\n1\n6\n\n\u2212 2\n\nm + 4\n\n),\n\nwhere the supremum is taken over all possible distributions p(Y ).3\n\nThis is a rather surprising result in light of the existence of many algorithms that rely on \ufb01nding a\nthreshold for maximizing the F-measure [5, 9, 10]. While being justi\ufb01ed by Theorems 2.1 and 2.3\nfor speci\ufb01c applications, this approach does not yield optimal predictions in general.\n\n3 An Exact Algorithm for F-Measure Maximization\n\nWe now introduce an exact and ef\ufb01cient algorithm for computing the F-maximizer without using any\nadditional assumption on the probability distribution p(Y ). While adopting the idea of decomposing\nthe problem into an outer and an inner maximization, our algorithm differs from Jansche\u2019s in the way\nthe inner maximization is solved. As a key element, we consider equivalence classes for the labels\nin terms of the number of ones in the vectors h and y. The optimization of the F-measure can\nbe substantially simpli\ufb01ed by using these equivalence classes, since h and y then only appear in\nthe numerator of the objective function. First, we show that only m2 + 1 parameters of the joint\ndistribution p(Y ) are needed to compute the F-maximizer.\n\ni=1 yi. The solution of (2) can be computed by solely using p(Y = 0)\n\nTheorem 3.1. Let sy =(cid:80)m\n\nand the values of\n\nwhich constitute an m \u00d7 m matrix P.\n\npis = p(Yi = 1 , sy = s),\n\ni, s \u2208 {1, . . . , m} ,\n\nProof. The inner optimization problem (3) can be formulated as follows:\n\nh(k)\u2217\n\n= arg max\n\nh\u2208Hk\n\nEy\u223cp(Y ) [F (y, h)] = arg max\nh\u2208Hk\n\np(y)\n\ny\u2208{0,1}m\n\n(cid:88)\n\n2(cid:80)m\n\ni=1 yihi\nsy + k\n\n.\n\nThe sums can be swapped, resulting in\n\n(cid:88)\n\ny\u2208{0,1}m\n\nhi\n\np(y)yi\nsy + k\n\n.\n\n(5)\n\nFurthermore, one can sum up the probabilities p(y) for all ys with an equal value of sy. By using\n\nh(k)\u2217\n\n= arg max\n\n2\n\nh\u2208Hk\n\nm(cid:88)\npis = (cid:88)\n\ni=1\n\nyip(y) ,\n\ny\u2208{0,1}m:sy=s\n\nm(cid:88)\n\nm(cid:88)\n\nhi\n\ni=1\n\ns=1\n\none can transform (5) into the following expression:\n\nh(k)\u2217\n\n= arg max\n\nh\u2208Hk\n\n2\n\npis\ns + k\n\n(6)\n\nAs a result, one does not need the whole distribution to solve (3), but only the values of pis, which\ncan be given in the form of an m \u00d7 m matrix P with entries pis. For the special case of k = 0, we\nhave h(k)\u2217\n\n= 0 and Ey\u223cp(Y ) [F (y, 0)] = p(Y = 0).\n\n3Finding the exact value of the supremum is an interesting open question.\n\n4\n\n\fAlgorithm 1 General F-measure Maximizer\n\nINPUT: matrix P and probability p(Y = 0)\nde\ufb01ne matrix W with elements given by Eq. 7;\ncompute F = PW\nfor k = 1 to m do\n\nsolve the inner optimization problem (3) that can be reformulated as:\n\nh(k)\u2217\n\n= arg max\n\nh\u2208Hk\n\n2\n\nhifik\n\nm(cid:88)\n\ni=1\n\n(cid:105)\n\nm(cid:88)\n\nby setting hi=1 for top k elements in the k-th column of matrix F, and hi=0 for the rest;\nstore a value of\n\nEy\u223cp(Y )\n\nF (y, h(k)\u2217\n\n)\n\n= 2\n\nh(k)\u2217\n\ni\n\nfik;\n\n(cid:104)\n\nend for\nfor k = 0 take h(k)\u2217\nsolve the outer optimization problem (4):\n\n= 0, and Ey\u223cp(Y ) [F (y, 0)] = p(Y = 0);\n\ni=1\n\n\u2217\nF =\nh\n\narg max\n\n,...,h(m)\u2217}\n\nEy\u223cp(Y ) [F (y, h)] ;\n\nh\u2208{h(0)\u2217\n\u2217\n\u2217\nF and Ey\u223cp(Y ) [F (y, h\nreturn h\nF )];\n\nIf the matrix P is given, the solution of (2) is straight-forward. To simplify the notation, let us\nintroduce an m \u00d7 m matrix W with elements\n\nwsk =\n\n1\n\ns + k\n\n,\n\ns, k \u2208 {1, . . . , m} ,\n\n(7)\n\nThe resulting algorithm, referred to as General F-measure Maximizer (GFM), is summarized in\nAlgorithm 1 and its time complexity is analyzed in the following theorem.\nTheorem 3.2. Algorithm 1 solves problem (2) in time o(m3) assuming that the matrix P of m2\nparameters and p(Y = 0) are given.\n\nProof. We can notice in (6) that the sum s + k assumes at most m + 1 values (it varies from s to\ns + m). By introducing the matrix W with elements (7), we can simplify (6) to\n\nh(k)\u2217\n\n= arg max\n\nh\u2208Hk\n\n2\n\nm(cid:88)\n\ni=1\n\nhifik ,\n\n(8)\n\nwhere fik are elements of a matrix F = PW. To solve (8), it is enough to \ufb01nd the top k elements\n(i.e., the elements with the highest values) in the k-th column of matrix F, which can be carried\nout in linear time [15]. The solution of the outer optimization problem (4) is then straight-forward.\nConsequently, the complexity of the algorithm is dominated by a matrix multiplication that is solved\nnaively in O(m3), but faster algorithms working in O(m2.376) are known [16].4\n\nLet us brie\ufb02y discuss the properties of our algorithm in comparison to the other algorithms discussed\nin Section 2. First of all, MEUF is characterized by a much higher time complexity being O(m4)\nfor the exact version. The recommended approximate variant reduces this complexity to O(m3).\nIn turn, the GFM algorithm has a complexity of o(m3). In addition, let us also remark that this\ncomplexity can be further decreased if the number of distinct values of sy with non-zero probability\nmass is smaller than m.\nMoreover, the MEUF framework will not deliver an exact F-maximizer if the assumption of inde-\npendence is violated. On the other hand, MEUF relies on a smaller number of parameters (m values\n\n4The complexity of the Coppersmith-Winograd algorithm [16] is more of theoretical signi\ufb01cance, since\n\npractically this algorithm outperforms the na\u00a8\u0131ve method only for huge matrices.\n\n5\n\n\frepresenting marginal probabilities). Our approach needs m2 + 1 parameters, but then computes the\nmaximizer exactly. Since estimating a larger number of parameters is statistically more dif\ufb01cult, it\nis a priori unclear which method performs better in practice.\nOur algorithm can also be tailored for \ufb01nding an optimal threshold. It is then simpli\ufb01ed due to\nconstraining the number of hypotheses. Instead of \ufb01nding the top k elements in the k-th column,\ns=1 pis. As a result, there is\nno need to compute the entire matrix F; instead, only the elements that correspond to the k highest\nmarginal probabilities for each column k are needed. Of course, the thresholding can be further\nsimpli\ufb01ed by verifying only a small number t < m of thresholds.\n\nit is enough to rely on the order of the marginal probabilities pi =(cid:80)m\n\n4 Application of the Algorithm\n\nThe GFM algorithm can be used whenever an estimation of the distribution p(Y ) or, alternatively,\nestimates of the matrix P and probability p(Y = 0) are available. In this section, we focus on the\napplication of GFM in the multi-label setting. Thus, we consider the task of predicting a vector y =\n(y1, y2, . . . , ym) \u2208 {0, 1}m given another vector x = (x1, x2, . . . , xn) \u2208 Rn as input attributes. To\nthis end, we train a classi\ufb01er h(x) on a training set {(xi, yi)}N\ni=1 and perform inference for a given\ntest vector x so as to deliver an optimal prediction under the F-measure (1). Thus, we optimize\nthe performance for each instance individually (instance-wise F-measure), in contrast to macro- and\nmicro-averaging of the F-measure.\nWe follow an approach similar to Conditional Random Fields (CRFs) [17, 18], which estimates\nthe joint conditional distribution p(Y | x). This approach has the additional advantage that one can\neasily sample from the estimated distribution. The underlying idea is to repeatedly apply the product\nrule of probability to the joint distribution of the labels Y = (Y1, . . . , Ym):\n\nm(cid:89)\n\np(Y = y | x) =\n\np(Yk = yk | x, y1, . . . , yk\u22121)\n\n(9)\n\nk=1\n\nThis approach, referred to as Probabilistic Classi\ufb01er Chains (PCC), has proved to yield state-of-\nthe-art performance in MLC [19]. Learning in this framework can be considered as a procedure\nthat relies on constructing probabilistic classi\ufb01ers for estimating p(Yk = yk|x, y1, . . . , yk\u22121), inde-\npendently for each k = 1, . . . , m. To sample from the conditional joint distribution p(Y | x), one\nfollows the chain and picks the value of label yk by tossing a biased coin with probabilities given by\nthe k-th classi\ufb01er. Based on a sample of observations generated in this way, our GFM algorithm can\nbe used to perform the optimal inference under F-measure.\nIn the experiments, we train PCC by using linear regularized logistic regression. By plugging the\nlog-linear model into (9), it can be shown that pairwise dependencies between labels yi and yj can be\nmodeled. We tune the regularization parameter using 3-fold cross-validation. To perform inference,\nwe draw for each test example a sample of 200 observations from the estimated conditional distri-\nbution. We then apply \ufb01ve inference methods. The \ufb01rst one (H) estimates marginal probabilities\npi(x) and predicts 1 for labels with \u02c6pi(x) \u2265 0.5; this is an optimal strategy for the Hamming loss.\nThe second method (MEUF) uses the estimates \u02c6pi(x) for computing the F-measure by applying the\nMEUF method. If the labels are independent, this method computes the F-maximizer exactly. As a\nthird method, we use the approximate cubic-time variant of MEUF with the parameters suggested\nin the original paper [12]. Finally, we use GFM and its variant that \ufb01nds the optimal threshold\n(GFM-T).\nBefore showing the results of PCC on benchmark datasets, let us discuss results for two synthetic\nmodels, one with independent and another one with dependent labels. Plots and a description of the\nmodels are given in Fig. 1. As can be observed, MEUF performs the best for independent labels,\nwhile GFM approaches its performance if the sample size increases. This is coherent with our the-\noretical analysis, since GFM needs to estimate more parameters. However, in the case of dependent\nlabels, MEUF performs poorly, even for a larger sample size, since the underlying assumption is not\nsatis\ufb01ed. Interestingly, both approximate variants perform very similarly to the original algorithms.\nWe also see that GFM has a huge advantage over MEUF regarding the time complexity.5\n\n5All the computations are performed on a typical desktop machine.\n\n6\n\n\fFigure 1: The plots show the performance under the F-measure of the inference methods: GFM, its threshold-\ning variant GFM-T, MEUF, and its approximate version MEUF Approx. Left: the performance as a function\nof sample size generated from independent distribution with pi = 0.12 and m = 25 labels. Center: similarly\nas above, but the distribution is de\ufb01ned according to (9), where all p(Yi = yi | y1, . . . , yi\u22121) are de\ufb01ned by\nlogistic models with a linear part \u2212 1\nj=1 yj. Right: running times as a function of the number of\nlabels with a sample size of 200. All the results are averaged over 50 trials.\n\n2 (i\u22121)+(cid:80)i\u22121\n\nTable 1: Experimental results on four benchmark datasets. For each dataset, we give the number of labels (m)\nand the size of training and test sets (in parentheses: training/test set). A \u201c-\u201d symbol indicates that an algorithm\ndid not complete the computations in a reasonable amount of time (several days). In bold: the best results for a\ngiven dataset and performance measure.\n\nMETHOD\n\nHAMMING MACRO-F MICRO-F\n\nF\n\nLOSS\n\nINFERENCE\nTIME [S]\n\nHAMMING MACRO-F MICRO-F\n\nF\n\nLOSS\n\nINFERENCE\nTIME [S]\n\nSCENE: m = 6 (1211/1169)\n\nYEAST: m = 14 (1500/917)\n\nPCC H\nPCC GFM\nPCC GFM-T\nPCC MEUF APPROX.\nPCC MEUF\nBR\nBR MEUF APPROX.\nBR MEUF\n\n0.1030\n0.1341\n0.1343\n0.1323\n0.1323\n0.1023\n0.1140\n0.1140\n\n0.6673\n0.7159\n0.7154\n0.7131\n0.7131\n0.6591\n0.7048\n0.7048\n\n0.6675 0.5779\n0.6915 0.7101\n0.6908 0.7094\n0.6910 0.6977\n0.6910 0.6977\n0.6602 0.5542\n0.6948 0.6468\n0.6948 0.6468\n\n0.969\n0.985\n1.031\n1.406\n1.297\n1.125\n1.579\n2.094\n\n0.2046\n0.2322\n0.2324\n0.2295\n0.2292\n0.1987\n0.2248\n0.2263\n\n0.3633\n0.4034\n0.4039\n0.4030\n0.4034\n0.3349\n0.4098\n0.4096\n\n0.6391 0.6160\n0.6554 0.6479\n0.6553 0.6476\n0.6551 0.6469\n0.6557 0.6477\n0.6299 0.6039\n0.6601 0.6527\n0.6591 0.6523\n\n3.704\n3.796\n3.907\n10.000\n11.453\n0.640\n7.110\n10.031\n\nENRON: m = 53 (1123/579)\n\nMEDIAMILL: m = 101 (30999/12914)\n\nPCC H\nPCC GFM\nPCC GFM-T\nPCC MEUF APPROX.\nPCC MEUF\nBR\nBR MEUF APPROX.\nBR MEUF\n\n0.0471\n0.0521\n0.0521\n0.0523\n0.0523\n0.0468\n0.0513\n0.0513\n\n0.1141\n0.1618\n0.1619\n0.1612\n0.1612\n0.1049\n0.1554\n0.1554\n\n0.5185 0.4892\n0.5943 0.6006\n0.5948 0.6011\n0.5932 0.6007\n0.5932 0.6007\n0.5223 0.4821\n0.5969 0.5947\n0.5969 0.5947\n\n195.061\n194.889\n196.030\n1081.837\n6676.145\n8.594\n850.494\n7014.453\n\n0.0304\n0.0348\n0.0348\n0.0350\n\n0.0304\n0.3508\n\n-\n\n-\n\n0.0931\n0.1491\n0.1499\n0.1504\n\n0.1429\n0.1917\n\n-\n\n-\n\n1405.772\n0.5577 0.5429\n1420.663\n0.5849 0.5734\n0.5854 0.5737\n1464.147\n0.5871 0.5740 308582.019\n-\n0.5623 0.5462\n207.655\n0.5889 0.5744 258431.125\n-\n\n-\n\n-\n\n-\n\n-\n\nThe results on four commonly used benchmark datasets6 with known training and test sets are pre-\nsented in Table 1, which also includes some basic statistics of these datasets. We additionally present\nresults of the binary relevance (BR) approach which trains an independent classi\ufb01er for each label\n(we used the same base learner as in PCC). We also apply the MEUF method on marginals delivered\nby BR. This is the best we can do if only marginals are known. From the results of the F-measure, we\ncan clearly state that all approaches tailored for this measure obtain better results. However, there is\nno clear winner among them. It seems that in practical applications, the theoretical results concern-\ning the worst-case scenario do not directly apply. Also, the number of parameters to be estimated\ndoes not play an important role. However, GFM drastically outperforms MEUF in terms of com-\nputational complexity. For the Mediamill dataset, the MEUF algorithm in its exact version did not\ncomplete the computations in a reasonable amount of time. The running times for the approximate\nversion are already unacceptably high for this dataset.\nWe also report results for the Hamming loss, macro- and micro-averaging F-measure. We can see,\nfor example, that approaches appropriate for Hamming loss obtain the best results regarding this\nmeasure. The macro and micro F-measure are presented mainly as a reference. The former is\ncomputed by averaging the F-measure label-wise, while the latter concatenates all test examples and\ncomputes a single value over all predictions. These two variants of the F-measure are not directly\noptimized by the algorithms used in the experiment.\n\n6These datasets are taken from the MULAN (http://mulan.sourceforge.net/datasets.html) and LibSVM\n\n(http://www.csie.ntu.edu.tw/\u223ccjlin/libsvmtools/datasets/multilabel.html) repositories.\n\n7\n\n204060801000.140.160.18sample sizeF1llllllllllllGFMGFM\u2212TMEUFMEUF Approx204060801000.20.30.40.50.6sample sizeF1lllllllllll1020304050050100150# of labelstime [s]lllllllll\f5 Discussion\n\nThe GFM algorithm can be considered for maximizing the macro F-measure, for example, in a\nsimilar setting as in [10], where a speci\ufb01c Bayesian on-line model is used. In order to maximize the\nmacro F-measure, the authors sample from the graphical model to \ufb01nd an optimal threshold. The\nGFM algorithm may solve this problem optimally, since, as stated by the authors, the independence\nof labels is lost after integrating out the model parameters. Theoretically, one may also consider a\ndirect maximization of the micro F-measure with GFM, but the computational burden is rather high\nin this case.\nInterestingly, there are no other MLC algorithms that maximize the F-measure in an instance-wise\nmanner. We also cannot refer to other results already published in the literature, since usually only\nthe micro- and macro-averaged F-measures are reported [20, 11]. This is rather surprising, especially\nsince some closely related measures are often computed in the instance-wise manner in empirical\nstudies. For example, the Jaccard distance (sometimes referred to as accuracy [21]), which differs\nfrom the F-measure in an additional term in the denominator, is commonly used in such a way.\nThe situation is slightly different in structured output prediction, where algorithms for instance-wise\nmaximization of the F-measure do exist. These include, for example, struct SVM [6], SEARN [8],\nand a speci\ufb01c variant of CRFs [7]. Usually, these algorithms are based on additional assumptions,\nlike label independence in struct SVM. The GFM algorithm can also be easily tailored for maxi-\nmizing the instance-wise F-measure in structured output prediction, in a similar way as presented\nabove. If the structured output classi\ufb01er is able to model the joint distribution from which we can\neasily sample observations, then the use of the algorithm is straight-forward. An application of this\nkind is planned as future work.\nSurprisingly, in both papers [8] and [6], experimental results are reported in terms of micro F-\nmeasure, although the algorithms maximize the instance-wise F-measure on the training set. Need-\nless to say, one should not expect such an approach to result in optimal performance for the micro-\naveraged F-measure. Despite being related to each other, these two measures coincide only in the\ni=1(yi + hi) is constant for all test examples. The discrepancy between these\nmeasures strongly depends on the nature of the data and the classi\ufb01er used. For high variability in\ni=1(yi + hi), a signi\ufb01cant difference between the values of these two measures is to be expected.\nThe use of the GFM algorithm in binary classi\ufb01cation seems to be super\ufb02uous, since in this case,\nthe assumption of label independence is rather reasonable. MEUF seems to be the right choice\nfor probabilistic classi\ufb01ers, unless its application is prevented due to its computational complexity.\nThresholding methods [5] or learning algorithms optimizing the F-measure directly [2, 3, 4] are\nprobably the most appropriate solutions here.\n\nspeci\ufb01c case where(cid:80)m\n(cid:80)m\n\n6 Conclusions\n\nIn contrast to other performance measures commonly used in experimental studies, such as misclas-\nsi\ufb01cation error rate, squared loss, and AUC, the F-measure has been investigated less thoroughly\nfrom a theoretical point of view so far. In this paper, we analyzed the problem of optimal predic-\ntive inference from the joint distribution under the F-measure. While partial results were already\nknown from the literature, we completed the picture by presenting the solution for the general case\nwithout any distributional assumptions. Our GFM algorithm requires only a polynomial number\nof parameters of the joint distribution and delivers the exact solution in polynomial time. From a\ntheoretical perspective, GFM should be preferred to existing approaches, which typically perform\nthreshold maximization on marginal probabilities, often relying on the assumption of (conditional)\nindependence of labels.\n\nAcknowledgments. Krzysztof Dembczy\u00b4nski has started this work during his post-doctoral stay\nat Philipps-Universit\u00a8at Marburg supported by German Research Foundation (DFG) and \ufb01nalized it\nat Pozna\u00b4n University of Technology under the grant 91-515/DS of the Polish Ministry of Science\nand Higher Education. Willem Waegeman is supported as a postdoc by the Research Foundation\nof Flanders (FWO-Vlaanderen). The part of this work has been done during his visit at Philipps-\nUniversit\u00a8at Marburg. Weiwei Cheng and Eyke H\u00a8ullermeier are supported by DFG. We also thank\nthe anonymous reviewers for their valuable comments.\n\n8\n\n\fReferences\n[1] C. J. van Rijsbergen. Foundation of evaluation. Journal of Documentation, 30(4):365\u2013373,\n\n1974.\n\n[2] David R. Musicant, Vipin Kumar, and Aysel Ozgur. Optimizing F-measure with support vector\n\nmachines. In FLAIRS-16, 2003, pages 356\u2013360, 2003.\n\n[3] Thorsten Joachims. A support vector method for multivariate performance measures. In ICML\n\n2005, pages 377\u2013384, 2005.\n\n[4] Martin Jansche. Maximum expected F-measure training of logistic regression models.\n\nHLT/EMNLP 2005, pages 736\u2013743, 2005.\n\nIn\n\n[5] Sathiya Keerthi, Vikas Sindhwani, and Olivier Chapelle. An ef\ufb01cient method for gradient-\nIn Advances in Neural Information\n\nbased adaptation of hyperparameters in SVM models.\nProcessing Systems 19, 2007.\n\n[6] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large\nmargin methods for structured and interdependent output variables. J. Mach. Learn. Res.,\n6:1453\u20131484, 2005.\n\n[7] Jun Suzuki, Erik McDermott, and Hideki Isozaki. Training conditional random \ufb01elds with\n\nmultivariate evaluation measures. In ACL, pages 217\u2013224, 2006.\n\n[8] Hal Daum\u00b4e III, John Langford, and Daniel Marcu. Search-based structured prediction. Ma-\n\nchine Learning, 75:297\u2013325, 2009.\n\n[9] Rong-En Fan and Chih-Jen Lin. A study on threshold selection for multi-label classi\ufb01cation.\n\nTechnical report, Department of Computer Science, National Taiwan University, 2007.\n\n[10] Xinhua Zhang, Thore Graepel, and Ralf Herbrich. Bayesian online learning for multi-label\n\nand multi-variate performance measures. In AISTATS 2010, pages 956\u2013963, 2010.\n\n[11] James Petterson and Tiberio Caetano. Reverse multi-label learning. In Advances in Neural\n\nInformation Processing Systems 23, pages 1912\u20131920, 2010.\n\n[12] Martin Jansche. A maximum expected utility framework for binary sequence labeling. In ACL\n\n2007, pages 736\u2013743, 2007.\n\n[13] David Lewis. Evaluating and optimizing autonomous text classi\ufb01cation systems. In SIGIR\n\n1995, pages 246\u2013254, 1995.\n\n[14] Juan Jose del Coz, Jorge Diez, and Antonio Bahamonde. Learning nondeterministic classi\ufb01ers.\n\nJ. Mach. Learn. Res., 10:2273\u20132293, 2009.\n\n[15] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction\n\nto Algorithms, 2nd edition. MIT Press, 2001.\n\n[16] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions.\n\nJournal of Symbolic Computation, 3(9):251\u2013280, 1990.\n\n[17] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random \ufb01elds: Prob-\nabilistic models for segmenting and labeling sequence data. In ICML 2001, pages 282\u2013289,\n2001.\n\n[18] Nadia Ghamrawi and Andrew McCallum. Collective multi-label classi\ufb01cation. In CIKM 2005,\n\npages 195\u2013200, 2005.\n\n[19] Krzysztof Dembczy\u00b4nski, Weiwei Cheng, and Eyke H\u00a8ullermeier. Bayes optimal multilabel\n\nclassi\ufb01cation via probabilistic classi\ufb01er chains. In ICML 2010, pages 279\u2013286, 2010.\n\n[20] Piyush Rai and Hal Daum\u00b4e III. Multi-label prediction via sparse in\ufb01nite CCA. In Advances in\n\nNeural Information Processing Systems 22, pages 1518\u20131526, 2009.\n\n[21] Matthew R. Boutell, Jiebo Luo, Xipeng Shen, and Christopher M. Brown. Learning multi-label\n\nscene classi\ufb01cation. Pattern Recognition, 37(9):1757\u20131771, 2004.\n\n9\n\n\f", "award": [], "sourceid": 815, "authors": [{"given_name": "Krzysztof", "family_name": "Dembczynski", "institution": null}, {"given_name": "Willem", "family_name": "Waegeman", "institution": null}, {"given_name": "Weiwei", "family_name": "Cheng", "institution": null}, {"given_name": "Eyke", "family_name": "H\u00fcllermeier", "institution": null}]}