{"title": "Consistent Multilabel Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 3321, "page_last": 3329, "abstract": "Multilabel classification is rapidly developing as an important aspect of modern predictive modeling, motivating study of its theoretical aspects. To this end, we propose a framework for constructing and analyzing multilabel classification metrics which reveals novel results on a parametric form for population optimal classifiers, and additional insight into the role of label correlations. In particular, we show that for multilabel metrics constructed as instance-, micro- and macro-averages, the population optimal classifier can be decomposed into binary classifiers based on the marginal instance-conditional distribution of each label, with a weak association between labels via the threshold. Thus, our analysis extends the state of the art from a few known multilabel classification metrics such as Hamming loss, to a general framework applicable to many of the classification metrics in common use. Based on the population-optimal classifier, we propose a computationally efficient and general-purpose plug-in classification algorithm, and prove its consistency with respect to the metric of interest. Empirical results on synthetic and benchmark datasets are supportive of our theoretical findings.", "full_text": "Consistent Multilabel Classi\ufb01cation\n\nOluwasanmi Koyejo\u21e4\nDepartment of Psychology,\n\nStanford University\n\nsanmi@stanford.edu\n\nNagarajan Natarajan\u21e4\n\nDepartment of Computer Science,\n\nUniversity of Texas at Austin\nnaga86@cs.utexas.edu\n\nPradeep Ravikumar\n\nDepartment of Computer Science,\n\nUniversity of Texas at Austin\n\npradeepr@cs.utexas.edu\n\nInderjit S. Dhillon\n\nDepartment of Computer Science,\n\nUniversity of Texas at Austin\n\ninderjit@cs.utexas.edu\n\nAbstract\n\nMultilabel classi\ufb01cation is rapidly developing as an important aspect of modern\npredictive modeling, motivating study of its theoretical aspects. To this end, we\npropose a framework for constructing and analyzing multilabel classi\ufb01cation met-\nrics which reveals novel results on a parametric form for population optimal clas-\nsi\ufb01ers, and additional insight into the role of label correlations.\nIn particular,\nwe show that for multilabel metrics constructed as instance-, micro- and macro-\naverages, the population optimal classi\ufb01er can be decomposed into binary classi-\n\ufb01ers based on the marginal instance-conditional distribution of each label, with a\nweak association between labels via the threshold. Thus, our analysis extends the\nstate of the art from a few known multilabel classi\ufb01cation metrics such as Ham-\nming loss, to a general framework applicable to many of the classi\ufb01cation metrics\nin common use. Based on the population-optimal classi\ufb01er, we propose a compu-\ntationally ef\ufb01cient and general-purpose plug-in classi\ufb01cation algorithm, and prove\nits consistency with respect to the metric of interest. Empirical results on synthetic\nand benchmark datasets are supportive of our theoretical \ufb01ndings.\n\n1\n\nIntroduction\n\nModern classi\ufb01cation problems often involve the prediction of multiple labels simultaneously asso-\nciated with a single instance e.g. image tagging by predicting multiple objects in an image. The\ngrowing importance of multilabel classi\ufb01cation has motivated the development of several scalable\nalgorithms [8, 12, 18] and has led to the recent surge in theoretical analysis [1, 3, 7, 16] which helps\nguide and understand practical advances. While recent results have advanced our knowledge of\noptimal population classi\ufb01ers and consistent learning algorithms for particular metrics such as the\nHamming loss and multilabel F -measure [3, 4, 5], a general understanding of learning with respect\nto multilabel classi\ufb01cation metrics has remained an open problem. This is in contrast to the more\ntraditional settings of binary and multiclass classi\ufb01cation where several recently established results\nhave led to a rich understanding of optimal and consistent classi\ufb01cation [9, 10, 11]. This manuscript\nconstitutes a step towards establishing results for multilabel classi\ufb01cation at the level of generality\ncurrently enjoyed only in these traditional settings.\nTowards a generalized analysis, we propose a framework for multilabel sample performance metrics\nand their corresponding population extensions. A classi\ufb01cation metric is constructed to measure the\nutility1 of a classi\ufb01er, as de\ufb01ned by the practitioner or end-user. The utility may be measured using\n\n\u21e4Equal contribution.\n1Equivalently, we may de\ufb01ne the loss as the negative utility.\n\n1\n\n\fthe sample metric given a \ufb01nite dataset, and further generalized to the population metric with respect\nto a given data distribution (i.e. with respect to in\ufb01nite samples). Two distinct approaches have been\nproposed for studying the population performance of classi\ufb01er in the classical settings of binary\nand multiclass classi\ufb01cation, described by Ye et al. [17] as decision theoretic analysis (DTA) and\nempirical utility maximization (EUM). DTA population utilities measure the expected performance\nof a classi\ufb01er on a \ufb01xed-size test set, while EUM population utilities are directly de\ufb01ned as a function\nof the population confusion matrix. However, state-of-the-art analysis of multilabel classi\ufb01cation\nhas so-far lacked such a distinction. The proposed framework de\ufb01nes both EUM and DTA multilabel\npopulation utility as generalizations of the aforementioned classic de\ufb01nitions. Using this framework,\nwe observe that existing work on multilabel classi\ufb01cation [1, 3, 7, 16] have exclusively focused on\noptimizing the DTA utility of (speci\ufb01c) multilabel metrics.\nAveraging of binary classi\ufb01cation metrics remains one of the most widely used approaches for de\ufb01n-\ning multilabel metrics. Given a binary label representation, such metrics are constructed via aver-\naging with respect to labels (instance-averaging), with respect to examples separately for each label\n(macro-averaging), or with respect to both labels and examples (micro-averaging). We consider a\nlarge sub-family of such metrics where the underlying binary metric can be constructed as a frac-\ntion of linear combinations of true positives, false positives, false negatives and true negatives [9].\nExamples in this family include the ubiquitous Hamming loss, the averaged precision, the multil-\nabel averaged F -measure, and the averaged Jaccard measure, among others. Our key result is that a\nBayes optimal multilabel classi\ufb01er for such metrics can be explicitly characterized in a simple form\n\u2013 the optimal classi\ufb01er thresholds the label-wise conditional probability marginals, and the label de-\npendence in the underlying distribution is relevant to the optimal classi\ufb01er only through the threshold\nparameter. Further, the threshold is shared by all the labels when the metric is instance-averaged\nor micro-averaged. This result is surprising and, to our knowledge, a \ufb01rst result to be shown at this\nlevel of generality for multilabel classi\ufb01cation. The result also sheds additional insight into the role\nof label correlations in multilabel classi\ufb01cation \u2013 answering prior conjectures by Dembczy\u00b4nski et al.\n[3] and others.\nWe provide a plug-in estimation based algorithm that is ef\ufb01cient as well as theoretically consistent,\ni.e.\nthe true utility of the empirical estimator approaches the optimal (EUM) utility of the Bayes\nclassi\ufb01er (Section 4). We also present experimental evaluation on synthetic and real-world bench-\nmark multilabel datasets comparing different estimation algorithms (Section 5) for representative\nmultilabel performance metrics selected from the studied family. The results observed in practice\nare supportive of what the theory predicts.\n\n1.1 Related Work\n\nWe brie\ufb02y highlight closely related theoretical results in the multilabel learning literature. Gao and\nZhou [7] consider the consistency of multilabel learning with respect to DTA utility, with a focus\non two speci\ufb01c losses \u2013 Hamming and rank loss (the corresponding measures are de\ufb01ned in Section\n2). Surrogate losses are devised which result in consistent learning with respect to these metrics.\nIn contrast, we propose a plug-in estimation based algorithm which directly estimates the Bayes\noptimal, without going through surrogate losses. Dembczynski et al. [2] analyze the DTA population\noptimal classi\ufb01er for the multilabel rank loss, showing that the Bayes optimal is independent of label\ncorrelations in the unweighted case, and construct certain weighted univariate losses which are DTA\nconsistent surrogates in the more general weighted case. Perhaps the work most closely related\nto ours is by Dembczynski et al. [4] who propose a novel DTA consistent plug-in rule estimation\nbased algorithm for multilabel F -measure. Cheng et al. [1] consider optimizing popular losses in\nmultilabel learning such as Hamming, rank and subset 0/1 loss (which is the multilabel analog of\nthe classical 0-1 loss). They propose a probabilistic version of classi\ufb01er chains (\ufb01rst introduced by\nRead et al. [13]) for estimating the Bayes optimal with respect to subset 0/1 loss, though without\nrigorous theoretical justi\ufb01cation.\n\n2 A Framework for Multilabel Classi\ufb01cation Metrics\n\nConsider multilabel classi\ufb01cation with M labels, where each instance is denoted by x 2X . For\nconvenience, we will focus on the common binary encoding, where the labels are represented by\na vector y 2Y = {0, 1}M, so ym = 1 iff the mth label is associated with the instance, and\n\n2\n\n\fym = 0 otherwise. The goal is to learn a multilabel classi\ufb01er f : X 7! Y that optimizes a certain\nperformance metric with respect to P \u2013 a \ufb01xed data generating distribution over the domain X\u21e5Y ,\nusing a training set of instance-label pairs (x(n), y(n)), n = 1, 2, . . . , N drawn (typically assumed\niid.) from P. Let X and Y denote the random variables for instances and labels respectively, and let\n denote the performance (utility) metric of interest.\nMost classi\ufb01cation metrics can be represented as functions of the entries of the confusion matrix. In\ncase of binary classi\ufb01cation, the confusion matrix is speci\ufb01ed by four numbers, i.e., true positives,\ntrue negatives, false positives and false negatives. Similarly, we construct the following primitives\nfor multilabel classi\ufb01cation:\n\ncTP(f )m,n =Jfm(x(n)) = 1, y(n)\ncFP(f )m,n =Jfm(x(n)) = 1, y(n)\n\nm = 1K\nm = 0K\n\ncTN(f )m,n =Jfm(x(n)) = 0, y(n)\ncFN(f )m,n =Jfm(x(n)) = 0, y(n)\n\nm = 0K\nm = 1K\n\n(1)\n\nwhereJZK denotes the indicator function that is 1 if the predicate Z is true or 0 otherwise. It is clear\nthat most multilabel classi\ufb01cation metrics considered in the literature can be written as a function of\nthe M N primitives de\ufb01ned in (1).\nIn the following, we consider a construction which is of suf\ufb01cient generality to capture all multilabel\nmetrics in common use. Let Ak(f ) : {cTP(f )m,n,cFP(f )m,n,cTN(f )m,n,cFN(f )m,n}M,N\nm=1,n=1 7! R,\nk = 1, 2, . . . , K represent a set of K functions. Consider sample multilabel metrics constructed as\nk=1 7! [0,1). We note that the metric need not decompose over individual\nfunctions: : {Ak(f )}K\ninstances. Equipped with this de\ufb01nition of a sample performance metric , consider the population\nutility of a multilabel classi\ufb01er f de\ufb01ned as:\n\nU(f ; , P) = ( {E [ Ak(f ) ]}K\n\n(2)\nwhere the expectation is over iid draws from the joint distribution P. Note that this can be seen as\na multilabel generalization of the so-called Empirical Utility Maximization (EUM) style classi\ufb01ers\nstudied in binary [9, 10] and multiclass [11] settings.\nOur goal is to learn a multilabel classi\ufb01er that maximizes U(f ; , P) for general performance metrics\n . De\ufb01ne the (Bayes) optimal multilabel classi\ufb01er as:\n(3)\n\nf\u21e4 = argmax\n\nk=1),\n\nf :X!{ 0,1}M U(f ; , P).\n\n1\n\nLet U(f\u21e4 ; , P) = U\u21e4 . We say that \u02c6f is a consistent estimator of f\u21e4 if U(\u02c6f ; , P) p!U \u21e4 .\nExamples. The averaged accuracy (1 - Hamming loss) used in multilabel classi\ufb01cation corre-\nm=1PN\nn=1cFP(f )m,n + cFN(f )m,n and Ham(f ) =\nsponds to simply choosing: A1(f ) = 1\n1 A1(f ). The measure corresponding to rank loss2 can be obtained by choosing Ak(f ) =\nM 2PM\nfor k = 1, 2, . . . , N and Rank = 1 \nNPN\n\n1\nRemark 1. Existing results on multilabel classi\ufb01cation have focused on decision-theoretic analysis\n(DTA) style classi\ufb01ers, where the utility is de\ufb01ned as:\n\nm1=1PM\nk=1 Ak(f ). Note that the choice of {Ak}, and therefore , is not unique.\n\nM NPM\nm2=1\u21e3cFP(f )m1,k\u2318\u21e3cFN(f )m2,k\u2318,\n\nUDTA(f ; , P) = E\u21e5 ({Ak(f )}K\n\n(4)\nand the expectation is over iid samples from P. Furthermore, there are no theoretical results for\nconsistency with respect to general performance metrics in this setting (See Appendix B.2).\nFor the remainder of this manuscript, we refer to U(f ; P) as the utility de\ufb01ned in (2). We will also\ndrop the argument f (e.g. writecTP(f ) ascTP) when it is clear from the context.\nThe most popular class of multilabel performance metrics consists of averaged binary performance\nmetrics, that correspond to particular settings of {Ak(f )} using certain averages as described in the\nfollowing. For the remainder of this subsection, the metric : [0 , 1]4 ! [0,1) will refer to a binary\nclassi\ufb01cation metric as is typically applied to a binary confusion matrix.\n2A subtle but important aspect of the de\ufb01nition of rank loss in the existing literature, including [2] and [7],\n\n2.1 A Framework for Averaged Binary Multilabel Classi\ufb01cation Metrics\n\nk=1)\u21e4 ,\n\nis that the Bayes optimal is allowed to be a real-valued function and may not correspond to a label decision.\n\n3\n\n\fMicro-averaging: Micro-averaged multilabel performance metrics micro are de\ufb01ned by averag-\ning over both labels and examples. Let:\n\ncTP(f ) =\n\n1\n\nM N\n\nNXn=1\n\nMXm=1cTP(f )m,n,\n\ncFP(f ) =\n\n1\n\nM N\n\nNXn=1\n\nMXm=1cFP(f )m,n,\n\n(5)\n\ngiven by:\n\ncTN(f ) andcFN(f ) are de\ufb01ned similarly, then the micro-averaged multilabel performance metrics are\n\n(6)\n micro({Ak(f )}K\nThus, for micro-averaging, one applies a binary performance metric to the confusion matrix de\ufb01ned\nby the averaged quantities described in (5).\n\nk=1) := ( cTP,cFP,cTN,cFN).\n\nMacro-averaging: The metric macro measures average classi\ufb01cation performance across labels.\nDe\ufb01ne the averaged measures:\n\ncTPm(f ) =\n\n1\nN\n\nNXn=1cTP(f )m,n,\n\ncTNm(f ) and cFNm(f ) are de\ufb01ned similarly. The macro-averaged performance metric is given by:\n\n(7)\n\n macro({Ak(f )}K\n\nk=1) :=\n\n1\nM\n\nMXm=1\n\nInstance-averaging: The metric instance measures the average classi\ufb01cation performance across\nexamples. De\ufb01ne the averaged measures:\n\n1\nN\n\nNXn=1cFP(f )m,n,\ncFPm(f ) =\n (cTPm,cFPm,cTNm,cFNm).\nMXm=1cFP(f )m,n,\n\ncFPn(f ) =\n (cTPn,cFPn,cTNn,cFNn).\n\n1\nM\n\ncTPn(f ) =\n\n1\nM\n\nMXm=1cTP(f )m,n,\n\ncTNn(f ) and cFNn(f ) are de\ufb01ned similarly. The instance-averaged performance metric is given by:\n\n(8)\n\n instance({Ak(f )}K\n\nk=1) :=\n\n1\nN\n\nNXn=1\n\n3 Characterizing the Bayes Optimal Classi\ufb01er for Multilabel Metrics\n\nWe now characterize the optimal multilabel classi\ufb01er for the large family of metrics outlined in\nSection 2.1 ( micro, macro and instance) with respect to the EUM utility. We begin by observing\nthat while micro-averaging and instance-averaging seem quite different when viewed as sample\naverages, they are in fact equivalent at the population level. Thus, we need only focus on micro to\ncharacterize instance as well.\nProposition 1. For a given binary classi\ufb01cation metric , consider the averaged multilabel metrics\n micro de\ufb01ned in (6) and instance de\ufb01ned in (8). For any f, U(f ; micro, P) \u2318U (f ; instance, P). In\nparticular, f\u21e4 \u21e4micro \u2318 f\u21e4 \u21e4instance\nWe further restrict our study to metrics selected from the linear-fractional metric family, recently\nstudied in the context of binary classi\ufb01cation [9]. Any in this family can be written as:\n\n.\n\n (cTP,cFP,cFN,cTN) =\n\na0 + a11cTP + a10cFP + a01cFN + a00cTN\nb0 + b11cTP + b10cFP + b01cFN + b00cTN\n\n,\n\nwhere a0, b0, aij, bij, i, j 2{ 0, 1} are \ufb01xed, andcTP,cFP,cFN,cTN are de\ufb01ned as in Section 2.1. Many\n\npopular multilabel metrics can be derived using linear-fractional . Some examples include3:\n\n(1 + 2)cTP\n\n(1 + 2)cTP + 2cFN +cFP\n\nHamming : Ham = cTP + cTN\n\n3Note that Hamming is typically de\ufb01ned as the loss, given by 1 Ham.\n\nJaccard : Jacc =\n\ncTP\ncTP +cFP +cFN\nPrecision : Prec = cTP\ncTP +cFP\n\n(9)\n\nF : F =\n\n4\n\n\fDe\ufb01ne the population quantities: \u21e1 =PM\nTP(f ) = EhcTP(f )i, where the expectation is over iid draws from P. From (5), it follows that,\nFP(f ) := EhcFP(f )i = (f ) TP(f ), TN(f ) = 1 \u21e1 (f ) + TP(f ) and FN(f ) = (f ) TP(f ).\n\nm=1 P(Ym = 1) and (f ) =PM\n\nNow, the population utility (2) corresponding to micro can be written succinctly as:\n\nm=1 P(fm(x) = 1). Let\n\nU(f ; micro, P) = ( TP(f ), FP(f ), FN(f ), TN(f )) =\n\nwith the constants:\n\nc0 + c1TP(f ) + c2(f )\nd0 + d1TP(f ) + d2(f )\n\n(10)\n\nc0 = a01\u21e1 + a00 a00\u21e1 + a0,\nd0 = b01\u21e1 + b00 b00\u21e1 + b0,\n\nc1 = a11 a10 a01 + a00,\nd1 = b11 b10 b01 + b00,\n\nc2 = a10 a00\nd2 = b10 b00.\n\nand\n\nWe assume that the joint P has a density \u00b5 that satis\ufb01es dP = \u00b5dx, and de\ufb01ne \u2318m(x) = P(Ym =\n1|X = x). Our \ufb01rst main result characterizes the Bayes optimal multilabel classi\ufb01er f\u21e4 micro.\nTheorem 2. Given the constants {c1, c2, c0} and {d1, d2, d0}, de\ufb01ne:\n\n\u21e4 =\n\nd2 U\u21e4 micro c2\nc1 d1 U\u21e4 micro\n\n.\n\n(11)\n\nThe optimal Bayes classi\ufb01er f\u21e4 := f\u21e4 micro de\ufb01ned in (3) is given by:\n\n1. When c1 > d1 U\u21e4 micro, f\u21e4 takes the form f\u21e4m(x) =J\u2318m(x) > \u21e4K, for m 2 [M ].\n2. When c1 < d1 U\u21e4 micro, f\u21e4 takes the form f\u21e4m(x) =J\u2318m(x) < \u21e4K, for m 2 [M ].\n\nThe proof is provided in Appendix A.2, and applies equivalently to instance-averaging. Theorem 2\nrecovers existing results in binary [9] settings (See Appendix B.1 for details), and is suf\ufb01ciently\ngeneral to capture many of the multilabel metrics used in practice. Our proof is closely related to\nthe binary classi\ufb01cation case analyzed in Theorem 2 of [9], but differs in the additional averaging\nacross labels. A key observation from Theorem 2 is that the optimal multilabel classi\ufb01er can be\nobtained by thresholding the marginal instance-conditional probability for each label P(Ym = 1|x)\nand, importantly, that the optimal classi\ufb01ers for all the labels share the same threshold \u21e4. Thus,\nthe effect of the joint distribution is only in the threshold parameter. We emphasize that while the\npresented results characterize the optimal population classi\ufb01er, incorporating label correlations into\nthe prediction algorithm may have other bene\ufb01ts with \ufb01nite samples, such as statistical ef\ufb01ciency\nwhen there are known structural similarities between the marginal distributions [3]. Further analysis\nis left for future work.\nThe Bayes optimal for the macro-averaged population metric is straightforward to establish. We\nobserve that the threshold is not shared in this case.\nProposition 3. For a given linear-fractional metric , consider the macro-averaged multilabel\n(x). We have, for m = 1, 2, . . . , M:\nmetric macro de\ufb01ned in (7). Let c1 > d1 U\u21e4 macro and f\u21e4 = f\u21e4 \u21e4macro\n\nf\u21e4m =J\u2318m(x) > \u21e4mK,\n\nwhere \u21e4m 2 [0, 1] is a constant that depends on the metric and the label-wise instance-conditional\nmarginals of P. Analogous results hold for c1 < d1 U\u21e4 macro.\nRemark 2. It is clear that micro-, macro- and instance- averaging are equivalent at the population\nlevel when the metric is linear. This is a straightforward consequence of the observation that\nthe corresponding sample utilities are the same. More generally, micro-, macro- and instance-\naveraging are equivalent whenever the optimal threshold is a constant independent of P, such as for\n(cf. Corollary 4 of Koyejo et al. [9]). Thus, our\nlinear metrics, where d1 = d2 = 0 so \u21e4 = c2\nanalysis recovers known results for Hamming loss [3, 7].\n\nc1\n\n4 Consistent Plug-in Estimation Algorithm\n\nImportantly, the Bayes optimal characterization points to a simple plug-in estimation algorithm\nthat enjoys consistency as follows. First, one obtains an estimate \u02c6\u2318m(x) of the marginal instance-\nconditional probability \u2318m(x) = P(Ym = 1|x) for each label m (see Reid and Williamson [14])\n\n5\n\n\fusing a training sample. Then, the given metric micro(f ) is maximized on a validation sample. For\nthe remainder of this manuscript, we assume wlog. that c1 > d1 U\u21e4. Note that in order to maximize\n\nover {f : fm(x) =J\u2318m(x) >K 8m = 1, 2, . . . , M, 2 (0, 1)}, it suf\ufb01ces to optimize:\n\n micro(\u02c6f),\n\n(12)\n\n\u02c6 = argmax\n2(0,1)\n\nwhere micro is the micro-averaged sample metric de\ufb01ned in (6) (similarly for instance). Though the\nthreshold search is over a continuous space 2 (0, 1) the number of distinct micro(\u02c6f) values given\na training sample of size N is at most N M. Thus (12) can be solved ef\ufb01ciently on a \ufb01nite sample.\n\nAlgorithm 1: Plugin-Estimator for micro and instance\n\nn=1 and metric micro (or instance).\n\nInput: Training examples S = {x(n), y(n)}N\nfor m = 1, 2, . . . , M do\n1. Select the training data for label m: Sm = {x(n), y(n)\n2. Split the training data Sm into two sets Sm1 and Sm2.\n3. Estimate \u02c6\u2318m(x) using Sm1, de\ufb01ne \u02c6fm(x) =J\u02c6\u2318m(x) >K.\nend for\nObtain \u02c6 by solving (12) on S2 = [M\nReturn: \u02c6f\u02c6.\n\nm=1Sm2.\n\nm }N\n\nn=1.\n\nConsistency of the proposed algorithm. The following theorem shows that the plug-in procedure\nof Algorithm 1 results in a consistent classi\ufb01er.\nTheorem 4. Let micro be a linear-fractional metric. If the estimates \u02c6\u2318m(x) satisfy \u02c6\u2318m\nthen the output multilabel classi\ufb01er \u02c6f\u02c6 of Algorithm 1 is consistent.\nThe proof is provided in Appendix A.4. From Proposition 1, it follows that consistency holds for\n instance as well. Additionally, in light of Proposition 3, we may apply the learning algorithms\nproposed by [9] for binary classi\ufb01cation independently for each label to obtain a consistent estimator\nfor macro.\n\np! \u2318m, 8m,\n\n5 Experiments\n\nWe present two sets of results. The \ufb01rst is an experimental validation on synthetic data with known\nground truth probabilities. The results serve to verify our main result (Theorem 2) characterizing\nthe Bayes optimal for averaged multilabel metrics. The second is an experimental evaluation of\nthe plugin estimator algorithms for micro-, instance-, and macro-averaged multilabel metrics on\nbenchmark datasets.\n\n5.1 Synthetic data: Veri\ufb01cation of Bayes optimal\n\n1\n\n1+exp wT\n\nWe consider the micro-averaged F1 metric in (9) for multilabel classi\ufb01cation with 4 labels. We\nsample a set of \ufb01ve 2-dimensional vectors x = {x(1), x(2), . . . , x(5)} from the standard Gaussian.\nThe conditional probability \u2318m for label m is modeled using a sigmoid function: \u2318m(x) = P(Ym =\nmx, using a vector wm sampled from the standard Gaussian. The Bayes optimal\n1|x) =\nf\u21e4(x) 2{ 0, 1}4 that maximizes the micro-averaged F1 population utility is then obtained by ex-\nhaustive search over all possible label vectors for each instance. In Figure 1 (a)-(d), we plot the\nconditional probabilities (wrt. the sample index) for each label, the corresponding f\u21e4m for each x,\nand the optimal threshold \u21e4 using (11). We observe that the optimal multilabel classi\ufb01er indeed\nthresholds P(Ym|x) for each label m, and furthermore, that the threshold is same for all the labels,\nas stated in Theorem 2.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Bayes optimal classi\ufb01er for multilabel F1 measure on synthetic data with 4 labels, and\ndistribution supported on 5 instances. Plots from left to right show the Bayes optimal classi\ufb01er\nprediction for instances, and for labels 1 through 4. Note that the optimal \u21e4 at which the label-wise\nmarginal \u2318m(x) is thresholded is shared, conforming to Theorem 2 (larger plots are included in\nAppendix C).\n\n5.2 Benchmark data: Evaluation of plug-in estimators\n\nWe now evaluate the proposed plugin-estimation (Algorithm 1) that is consistent for micro- and\ninstance-averaged multilabel metrics. We focus on two metrics, F1 and Jaccard, listed in (9). We\ncompare Algorithm 1, designed to optimize micro-averaged (or instance-averaged) multilabel met-\nrics to two related plugin-estimation methods: (i) a separate threshold \u21e4m tuned for each label m\nindividually \u2013 this optimizes the utility corresponding to the macro-averaged metric, but is not con-\nsistent for micro-averaged or instance-averaged metrics, and is the most common approach in prac-\ntice. We refer to this as Macro-Thres, (ii) a constant threshold 1/2 for all the labels \u2013 this is known\nto be optimal for averaged accuracy (equiv. Hamming loss), but not for non-decomposable F1 or\nJaccard metrics. We refer to this as Binary Relevance (BR) [15].\nWe use four benchmark multilabel datasets4 in our experiments: (i) SCENE, an image dataset con-\nsisting of 6 labels, with 1211 training and 1196 test instances, (ii) BIRDS, an audio dataset consisting\nof 19 labels, with 323 training and 322 test instances, (iii) EMOTIONS, a music dataset consisting\nof 6 labels, with 393 training and 202 test instances, and (iv) CAL500, a music dataset consisting\nof 174 labels, with 400 training and 100 test instances5. We perform logistic regression (with L2\nregularization) on a separate subsample to obtain estimates of \u02c6\u2318m(x) of P(Ym = 1|x), for each label\nm (as described in Section 4). All the methods we evaluate rely on obtaining a good estimator for\nthe conditional probability. So we exclude labels that are associated with very few instances \u2013 in\nparticular, we train and evaluate using labels associated with at least 20 instances, in each dataset,\nfor all the methods.\nIn Table 1, we report the micro-averaged F1 and Jaccard metrics on the test set for Algorithm 1,\nMacro-Thres and Binary Relevance. We observe that estimating a \ufb01xed threshold for all the labels\n(Algorithm 1) consistently performs better than estimating thresholds for each label (Macro-Thres)\nand than using threshold 1/2 for all labels (BR); this conforms to our main result in Theorem 2 and\nthe consistency analysis of Algorithm 1 in Theorem 4. A similar trend is observed for the instance-\naveraged metrics computed on the test set, shown in Table 2. Proposition 1 shows that maximizing\nthe population utilities of micro-averaged and instance-averaged metrics are equivalent; the result\nholds in practice as presented in Table 2. Finally, we report macro-averaged metrics computed on\ntest set in Table 3. We observe that Macro-Thres is competitive in 3 out of 4 datasets; this conforms\nto Proposition 3 which shows that in the case of macro-averaged metrics, it is optimal to tune a\nthreshold speci\ufb01c to each label independently. Beyond consistency, we note that by using more\nsamples, joint threshold estimation enjoys additional statistical ef\ufb01ciency, while separate threshold\nestimation enjoys greater \ufb02exibility. This trade-off may explain why Algorithm 1 achieves the best\nperformance in three out of four datasets in Table 3, though it is not consistent for macro-averaged\nmetrics.\n\n4The datasets were obtained from http://mulan.sourceforge.net/datasets-mlc.html.\n5Original CAL500 dataset does not provide splits; we split the data randomly into train and test sets.\n\n7\n\n\fDATASET\n\nMacro-Thres\n\nMacro-Thres\n\nMacro-Thres BR\n\nAlgorithm 1\nBR\nF1\n0.6559 0.6847 \u00b1 0.0072\nSCENE\n0.4040 0.4088 \u00b1 0.0130\nBIRDS\nEMOTIONS 0.5815 0.6554 \u00b1 0.0069\n0.3647 0.4891 \u00b1 0.0035\nCAL500\n\n0.6631 \u00b1 0.0125\n0.2871 \u00b1 0.0734\n0.6419 \u00b1 0.0174\n0.4160 \u00b1 0.0078\n\n0.5010 \u00b1 0.0122\n0.1942 \u00b1 0.0401\n0.4790 \u00b1 0.0077\n0.2608 \u00b1 0.0056\nTable 1: Comparison of plugin-estimator methods on multilabel F1 and Jaccard metrics. Reported\nvalues correspond to micro-averaged metric (F1 and Jaccard) computed on test data (with standard\ndeviation, over 10 random validation sets for tuning thresholds). Algorithm 1 is consistent for micro-\naveraged metrics, and performs the best consistently across datasets.\nDATASET BR\n\nAlgorithm 1\nJaccard\n0.4878 0.5151 \u00b1 0.0084\n0.2495 0.2648 \u00b1 0.0095\n0.3982 0.4908 \u00b1 0.0074\n0.2229 0.3225 \u00b1 0.0024\n\nAlgorithm 1\nF1\n0.5695 0.6422 \u00b1 0.0206\nSCENE\n0.1209 0.1390 \u00b1 0.0110\nBIRDS\nEMOTIONS 0.4787 0.6241 \u00b1 0.0204\n0.3632 0.4855 \u00b1 0.0035\nCAL500\n\n0.6303 \u00b1 0.0167\n0.1390 \u00b1 0.0259\n0.6156 \u00b1 0.0170\n0.4135 \u00b1 0.0079\n\n0.5902 \u00b1 0.0176\n0.1195 \u00b1 0.0096\n0.5173 \u00b1 0.0086\n0.2623 \u00b1 0.0055\nTable 2: Comparison of plugin-estimator methods on multilabel F1 and Jaccard metrics. Reported\nvalues correspond to instance-averaged metric (F1 and Jaccard) computed on test data (with stan-\ndard deviation, over 10 random validation sets for tuning thresholds). Algorithm 1 is consistent for\ninstance-averaged metrics, and performs the best consistently across datasets.\nDATASET BR\n\nAlgorithm 1\nJaccard\n0.5466 0.5976 \u00b1 0.0177\n0.1058 0.1239 \u00b1 0.0077\n0.4078 0.5340 \u00b1 0.0072\n0.2268 0.3252 \u00b1 0.0024\n\nMacro-Thres BR\n\nMacro-Thres\n\nMacro-Thres BR\n\nAlgorithm 1\nF1\n0.6601 0.6941 \u00b1 0.0205\nSCENE\n0.3366 0.3448 \u00b1 0.0110\nBIRDS\nEMOTIONS 0.5440 0.6450 \u00b1 0.0204\n0.1293 0.2687 \u00b1 0.0035\nCAL500\n\n0.6737 \u00b1 0.0137\n0.2971 \u00b1 0.0267\n0.6440 \u00b1 0.0164\n0.3226 \u00b1 0.0068\n\n0.5260 \u00b1 0.0176\n0.2051 \u00b1 0.0215\n0.4900 \u00b1 0.0133\n0.2146 \u00b1 0.0036\nTable 3: Comparison of plugin-estimator methods on multilabel F1 and Jaccard metrics. Reported\nvalues correspond to the macro-averaged metric computed on test data (with standard deviation,\nover 10 random validation sets for tuning thresholds). Macro-Thres is consistent for macro-averaged\nmetrics, and is competitive in three out of four datasets. Though not consistent for macro-averaged\nmetrics, Algorithm 1 achieves the best performance in three out of four datasets.\n\nAlgorithm 1\nJaccard\n0.5046 0.5373 \u00b1 0.0177\n0.2178 0.2341 \u00b1 0.0077\n0.3982 0.4912 \u00b1 0.0072\n0.0880 0.1834 \u00b1 0.0024\n\n6 Conclusions and Future Work\n\nWe have proposed a framework for the construction and analysis of multilabel classi\ufb01cation metrics\nand corresponding population optimal classi\ufb01ers. Our main result is that for a large family of aver-\naged performance metrics, the EUM optimal multilabel classi\ufb01er can be explicitly characterized by\nthresholding of label-wise marginal instance-conditional probabilities, with weak label dependence\nvia a shared threshold. We have also proposed ef\ufb01cient and consistent estimators for maximizing\nsuch multilabel performance metrics in practice. Our results are a step forward in the direction of\nextending the state-of-the-art understanding of learning with respect to general metrics in binary and\nmulticlass settings. Our work opens up many interesting research directions, including the potential\nfor further generalization of our results beyond averaged metrics, and generalized results for DTA\npopulation optimal classi\ufb01cation, which is currently only well-understood for the F -measure.\nAcknowledgments: We acknowledge the support of NSF via CCF-1117055, CCF-1320746 and IIS-\n1320894, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS Initiative to Support\nResearch at the Interface of the Biological and Mathematical Sciences.\n\n8\n\n\fReferences\n[1] Weiwei Cheng, Eyke H\u00a8ullermeier, and Krzysztof J Dembczynski. Bayes optimal multilabel\nclassi\ufb01cation via probabilistic classi\ufb01er chains. In Proceedings of the 27th International Con-\nference on Machine Learning (ICML-10), pages 279\u2013286, 2010.\n\n[2] Krzysztof Dembczynski, Wojciech Kotlowski, and Eyke H\u00a8ullermeier. Consistent multilabel\nIn Proceedings of the 29th International Conference on\n\nranking through univariate losses.\nMachine Learning (ICML-12), pages 1319\u20131326, 2012.\n\n[3] Krzysztof Dembczy\u00b4nski, Willem Waegeman, Weiwei Cheng, and Eyke H\u00a8ullermeier. On label\ndependence and loss minimization in multi-label classi\ufb01cation. Machine Learning, 88(1-2):\n5\u201345, 2012.\n\n[4] Krzysztof Dembczynski, Arkadiusz Jachnik, Wojciech Kotlowski, Willem Waegeman, and\nEyke H\u00a8ullermeier. Optimizing the F-measure in multi-label classi\ufb01cation: Plug-in rule ap-\nproach versus structured loss minimization. In Proceedings of the 30th International Confer-\nence on Machine Learning, pages 1130\u20131138, 2013.\n\n[5] Krzysztof J Dembczynski, Willem Waegeman, Weiwei Cheng, and Eyke H\u00a8ullermeier. An\nexact algorithm for F-measure maximization. In Advances in Neural Information Processing\nSystems, pages 1404\u20131412, 2011.\n\n[6] Luc Devroye. A probabilistic theory of pattern recognition, volume 31. Springer, 1996.\n[7] Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. Arti\ufb01cial Intelligence,\n\n199:22\u201344, 2013.\n\n[8] Ashish Kapoor, Raajay Viswanathan, and Prateek Jain. Multilabel classi\ufb01cation using bayesian\nIn Advances in Neural Information Processing Systems, pages 2645\u2013\n\ncompressed sensing.\n2653, 2012.\n\n[9] Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon.\nConsistent binary classi\ufb01cation with generalized performance metrics. In Advances in Neural\nInformation Processing Systems, pages 2744\u20132752, 2014.\n\n[10] Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consistency\nIn Advances in Neural\n\nof plug-in classi\ufb01ers for non-decomposable performance measures.\nInformation Processing Systems, pages 1493\u20131501, 2014.\n\n[11] Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consis-\nIn Proceedings of the 32nd\n\ntent multiclass algorithms for complex performance measures.\nInternational Conference on Machine Learning (ICML-15), pages 2398\u20132407, 2015.\n\n[12] James Petterson and Tib\u00b4erio S Caetano. Submodular multi-label learning.\n\nNeural Information Processing Systems, pages 1512\u20131520, 2011.\n\nIn Advances in\n\n[13] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classi\ufb01er chains for multi-\n\nlabel classi\ufb01cation. Machine learning, 85(3):333\u2013359, 2011.\n\n[14] Mark D Reid and Robert C Williamson. Composite binary losses. The Journal of Machine\n\nLearning Research, 9999:2387\u20132422, 2010.\n\n[15] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data.\n\nData mining and knowledge discovery handbook, pages 667\u2013685. Springer, 2010.\n\nIn\n\n[16] Willem Waegeman, Krzysztof Dembczynski, Arkadiusz Jachnik, Weiwei Cheng, and Eyke\nH\u00a8ullermeier. On the bayes-optimality of f-measure maximizers. Journal of Machine Learning\nResearch, 15:3333\u20133388, 2014.\n\n[17] Nan Ye, Kian Ming A Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing F-measures: a\ntale of two approaches. In Proceedings of the International Conference on Machine Learning,\n2012.\n\n[18] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label\nlearning with missing labels. In Proceedings of the 31st International Conference on Machine\nLearning, pages 593\u2013601, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1837, "authors": [{"given_name": "Oluwasanmi", "family_name": "Koyejo", "institution": "Stanford University"}, {"given_name": "Nagarajan", "family_name": "Natarajan", "institution": "UT Austin"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "University of Texas at Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas at Austin"}]}