{"title": "Reverse Multi-Label Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1912, "page_last": 1920, "abstract": "Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several state-of-the-art methods, showing excellent performance.", "full_text": "Reverse Multi-Label Learning\n\nJames Petterson\n\nNICTA, Australian National University\n\nCanberra, ACT, Australia\n\nTiberio Caetano\n\nNICTA, Australian National University\n\nCanberra, ACT, Australia\n\njames.petterson@nicta.com.au\n\ntiberio.caetano@nicta.com.au\n\nAbstract\n\nMulti-label classi\ufb01cation is the task of predicting potentially multiple labels for a\ngiven instance. This is common in several applications such as image annotation,\ndocument classi\ufb01cation and gene function prediction. In this paper we present\na formulation for this problem based on reverse prediction: we predict sets of\ninstances given the labels. By viewing the problem from this perspective, the\nmost popular quality measures for assessing the performance of multi-label clas-\nsi\ufb01cation admit relaxations that can be ef\ufb01ciently optimised. We optimise these\nrelaxations with standard algorithms and compare our results with several state-\nof-the-art methods, showing excellent performance.\n\nIntroduction\n\n1\nRecently, multi-label classi\ufb01cation (MLC) has been drawing increasing attention from the machine\nlearning community (e.g., [1, 2, 3, 4]). Unlike in the case of multi-class learning, in MLC each\ninstance may belong to multiple classes simultaneously. This re\ufb02ects the situation in many real-\nworld problems: in document classi\ufb01cation, one document can cover multiple subjects; in biology, a\ngene can be associated with a set of functional classes [5]; in image annotation, one image can have\nseveral tags [6].\nAs diverse as the applications, however, are the evaluation measures used to assess the performance\nof different methods. That is understandable, since different applications have different goals. In\ne-discovery applications [7] it is mandatory that all relevant documents are retrieved, so recall is\nthe most relevant measure. In web search, on the other hand, precision is also important, so the\nF1-score, which is the harmonic mean of precision and recall, might be more appropriate.\nIn this paper we present a method for MLC which is able to optimise appropriate surrogates for\na variety of performance measures. This means that the objective function being optimised by\nthe method is tailored to the performance measure on which we want to do well in our speci\ufb01c\napplication. This is in contrast particularly with probabilistic approaches, which typically aim for\nmaximisation of likelihood scores rather than the performance measure used to assess the quality of\nthe results. In addition, the method is based on well-understood facts from the domain of structured\noutput learning, which gives us theoretical guarantees regarding the accuracy of the results obtained.\nFinally, source code is made available by us.\nAn interesting aspect of the method is that we are only able to optimise the desired performance\nmeasures because we formulate the prediction problem in a reverse manner, in the spirit of [8]. We\npose the prediction problem as predicting sets of instances given the labels. When this insight is\n\ufb01t into max-margin structured output methods, we obtain surrogate losses for the most widely used\nperformance measures for multi-label classi\ufb01cation. We perform experiments against state-of-the-\nart methods in \ufb01ve publicly available benchmark datasets for MLC, and the proposed approach is\nthe best performing overall.\n\n1.1 Related Work\nThe literature in this topic is vast and we cannot possibly make justice here since a comprehensive\nreview is clearly impractical. Instead, we focus particularly on some state-of-the-art approaches\n\n1\n\n\fthat have been tested on publicly available benchmark datasets for MLC, which facilitates a fair\ncomparison against our method. A straightforward way to deal with multiple labels is to solve a\nbinary classi\ufb01cation problem for each one of them, treating them independently. This approach is\nknown as Binary Method (BM) [9]. Classi\ufb01er Chains (CC) [4] extends that by building a chain of\nbinary classi\ufb01ers, one for each possible label, but with each classi\ufb01er augmented by all prior rele-\nvance predictions. Since the order of the classi\ufb01ers in the chain is arbitrary, the authors also propose\nan ensemble method \u2013 Ensemble of Classi\ufb01er Chains (ECC) \u2013 where several random chains are\ncombined with a voting scheme. Probabilistic Classi\ufb01er Chains (PCC) [1] extends CC to the prob-\nabilistic setting, with EPCC [1] being its corresponding ensemble method. Another way of working\nwith multiple labels is to consider each possible set of labels as a class, thus encoding the problem\nas single-label classi\ufb01cation. The problem with that is the exponentially large number of classes.\nRAndom K-labELsets (RAKEL) [10] deals with that by proposing an ensemble of classi\ufb01ers, each\none taking a small random subset of the labels and learning a single-label classi\ufb01er for the prediction\nof each element in the power set of this subset. Other proposed ensemble methods are Ensemble\nof Binary Method (EBM) [4], which applies a simple voting scheme to a set of BM classi\ufb01ers, and\nEnsemble of Pruned Sets (EPS) [11], which combines a set of Pruned Sets (PS) classi\ufb01ers. PS is\nessentially a problem transformation method that maps sets of labels to single labels while pruning\naway infrequently occurring sets.Canonical Correlation Analysis (CCA) [3] exploits label related-\nness by using a probabilistic interpretation of CCA as a dimensionality reduction technique and\napplying it to learn useful predictive features for multi-label learning. Meta Stacking (MS) [12] also\nexploits label relatedness by combining text features and features indicating relationships between\nclasses in a discriminative framework.\nTwo papers closely related to ours from the methodological point of view, which are however not tai-\nlored particularly to the multi-label learning problem, are [13] and [14]. In [13] the author proposes\na smooth but non-concave relaxation of the F -measure for binary classi\ufb01cation problems using a\nlogistic regression classi\ufb01er, and optimisation is performed by taking the maximum across several\nruns of BFGS starting from random initial values. In [14] the author proposes a method for optimis-\ning multivariate performance measures in a general setting in which the loss function is not assumed\nto be additive in the instances nor in the labels. The method also consists of optimising a convex\nrelaxation of the derived losses. The key difference of our method is that we have a specialised\nconvex relaxation for the case in which the loss does not decompose over the instances, but does\ndecompose over the labels.\n\n2 The Model\nLet the input x \u2208 X denote a label (e.g., a tag of an image), and the output y \u2208 Y denote a set\nof instances, (e.g., a set of training images). Let N = |X| be the number of labels and V be the\nnumber of instances. An input label x is encoded as x \u2208{ 0, 1}N, s.t.!i xi = 1. For example\nif N = 5 the second label is denoted as x = [0 1 0 0 0]. An output instance y is encoded as\ny \u2208{ 0, 1}V (Y := {0, 1}V ), and yn\ni = 1 iff instance xn was annotated with label i. For example\nif V = 10 and only instances 1 and 3 are annotated with label 2, then the y corresponding to\nx = [0 1 0 0 0] is y = [1 0 1 0 0 0 0 0 0 0]. We assume a given training set {(xn, yn)}N\nn=1, where\nn=1 represents the\n{xn}N\nsets of instances associated to those labels. The task consists of estimating a map f : X \u2192 Y which\nreproduces well the outputs of the training set (i.e., f(xn) \u2248 yn) but also generalises well to new\ntest instances.\n\nn=1 comprises the entirety of labels available ({xn}N\n\nn=1 = X), and {yn}N\n\n2.1 Loss Functions\nThe reason for this reverse prediction is the following: most widely accepted performance measures\ntarget information retrieval (IR) applications \u2013 that is, given a label we want to \ufb01nd a set of relevant\ninstances. As a consequence, the measures are averaged over the set of possible labels. This is the\ncase for, in particular, Macro-precision, Macro-recall, Macro-F\u03b2\n\n1 and Hamming loss [10]:\n\nMacro-precision =\n\n1\nN\n\nN\"n=1\n\np(yn, \u00afyn),\n\nMacro-recall =\n\nr(yn, \u00afyn)\n\n1\nN\n\nN\"n=1\n\n1Macro-F1 is the particular case of this when \u03b2 equals to 1. Macro-precision and macro-recall are particular\n\ncases of macro-F\u03b2 for \u03b2 \u2192 0 and \u03b2 \u2192 \u221e, respectively.\n\n2\n\n\fMacro-F\u03b2 =\n\nwhere\n\n(1 + \u03b22)\n\n1\nN\n\nN\"n=1\n\np(yn, \u00afyn)r(yn, \u00afyn)\n\n\u03b22p(yn, \u00afyn) + r(yn, \u00afyn) , Hamming loss =\n\nh(yn, \u00afyn),\n\n1\nN\n\nN\"n=1\nr(y, \u00afy) = yT \u00afy\nyT y\n\n.\n\nh(y, \u00afy) = yT 1 + \u00afyT 1 \u2212 2yT \u00afy\n\nV\n\n,\n\np(y, \u00afy) = yT \u00afy\n\u00afyT \u00afy\n\n,\n\nHere, \u00afyn is our prediction for input label n, and yn the corresponding ground-truth. Since these\nmeasures average over the labels, in order to optimise them we need to average over the labels as\nwell, and this happens naturally in a setting in which the empirical risk is additive on the labels.2\nInstead of maximising a performance measure we frame the problem as minimising a loss function\nassociated to the performance measure. We assume a known loss function \u2206: Y \u00d7 Y \u2192 R+ which\nassigns a non-negative number to every possible pair of outputs. This loss function represents how\nmuch we want to penalise a prediction \u00afy when the correct prediction is y, i.e., it has the opposite\nsemantics of a performance measure. As already mentioned, we will be able to deal with a variety of\nloss functions in this framework, but for concreteness of exposition we will focus on a loss derived\nfrom the Macro-F\u03b2 score de\ufb01ned above, whose particular case for \u03b2 equal to 1 (F1) is arguably the\nmost popular performance measure for multi-label classi\ufb01cation. In our notation, the F\u03b2 score of a\ngiven prediction is\n\nF\u03b2(y, \u00afy) = (1 + \u03b22)\n\nyT \u00afy\n\n\u03b22yT y + \u00afyT \u00afy\n\n,\n\n(1)\n\nand since F\u03b2 is a score of alignment between y and \u00afy, one possible choice for the loss is \u2206(y, \u00afy) =\n1 \u2212 F\u03b2(y, \u00afy), which is the one we focus on in this paper,\n\nyT \u00afy\n\n\u03b22yT y + \u00afyT \u00afy\n\n.\n\n(2)\n\n\u2206(y, \u00afy) = 1 \u2212 (1 + \u03b22)\n\n2.2 Features and Parameterization\n\nOur next assumption is that the prediction for a given input x returns the maximiser(s) of a linear\nscore of the model parameter vector \u03b8, i.e., a prediction is given by \u00afy such that 3\n\n\u00afy \u2208 argmax\ny\u2208Y\n\n&\u03c6(x, y),\u03b8 \u2019 .\n\n(3)\n\ni.e., \u03c6(x, y) = !V\n\nHere we assume that \u03c6(x, y) is linearly composed of features of the instances encoded in each yv,\nv=1 yv(\u03c8v \u2297 x). The vector \u03c8v is the feature representation for the instance v.\nThe map \u03c6(x, y) will be the zero vector whenever yv = 0, i.e., when instance v does not have label\nx. The feature map \u03c6(x, y) has a total of DN dimensions, where D is the dimensionality of our\ninstance features (\u03c8v) and N is the number of labels. Therefore DN is the dimensionality of our\nparameter \u03b8 to be learned.\n\n2.3 Optimisation Problem\n\nWe are now ready to formulate our estimator. We assume an initial, \u2018ideal\u2019 estimator taking the form\n\n\u03b8\u2217 = argmin\n\n\u03b8\n\n#$ 1\n\nN\n\nN\"n=1\n\n\u2206(\u00afyn(xn; \u03b8), yn)% + \u03bb\n\n2 )\u03b8)2& .\n\n(4)\n\nIn other words, we want to \ufb01nd a model that minimises the average prediction loss in the training set\nplus a quadratic regulariser that penalises complex solutions (the parameter \u03bb determines the trade-\noff between data \ufb01tting and good generalisation). Estimators of this type are known as regularised\nrisk minimisers [15].\n\ndirection as well.\n\n2The Hamming loss also averages over the instances so it can be optimised in the \u2018normal\u2019 (not reverse)\n3#A, B$ denotes the inner product of the vectorized versions of A and B\n\n3\n\n\f[\u03b8\u2217,\u03be \u2217] = argmin\n\n\u03b8,\u03be\n\n# 1\n\nN\n\nN\"n=1\n\n\u03ben + \u03bb\n\n2 )\u03b8)2&\n\ns.t. &\u03c6(xn, yn),\u03b8 \u2019 \u2212 &\u03c6(xn, y),\u03b8 \u2019 \u2265 \u2206(y, yn) \u2212 \u03ben,\u03be\n\u2200n, y \u2208 Y.\n\n(5)\n\n(6)\n\nn \u2265 0\n\n3 Optimisation\n\n3.1 Convex Relaxation\n\nThe optimisation problem (4) is non-convex. Even more critical, the loss is a piecewise constant\nfunction of \u03b8.4 A similar problem occurs when one aims at optimising a 0/1 loss in binary classi-\n\ufb01cation; in that case, a typical workaround consists of minimising a surrogate convex loss function\nwhich upper bounds the 0/1 loss, for example the hinge loss, what gives rise to the support vec-\ntor machine. Here we use an analogous approach, notably popularised in [16], which optimises a\nconvex upper bound on the structured loss of (4). The resulting optimisation problem is\n\n\u2217 , yn).\n\nIt is easy to see that \u03be\u2217n upper bounds \u2206(\u00afyn\n\u2217 , yn) (and therefore the objective in (5) upper bounds\nthat of (4) for the optimal solution). Here, \u00afyn\n:= argmaxy &\u03c6(xn, y),\u03b8 \u2217\u2019. First note that since the\n\u2217\nconstraints (6) hold for all y, they also hold for \u00afyn\n. Second, the left hand side of the inequality\n\u2217\nfor y = \u00afyn must be non-positive from the de\ufb01nition of \u00afy in equation (3).\nIt then follows that\n\u03be\u2217n \u2265 \u2206(\u00afyn\nThe constraints (6) basically enforce a loss-sensitive margin: \u03b8 is learned so that mispredictions y\nthat incur some loss end up with a score &\u03c6(xn, y),\u03b8 \u2019 that is smaller than the score &\u03c6(xn, yn),\u03b8 \u2019\nof the correct prediction yn by a margin equal to that loss (minus slack \u03be). The formulation is a\ngeneralisation of support vector machines for the case in which there are an exponential number of\nclasses y. It is in this sense that our approach is somewhat related in spirit to [10], as mentioned in\nthe Introduction. However, as described below, here we can use a method for selecting a polynomial\nnumber of constraints which provably approximates well the original problem.\nThe optimisation problem (5) has n|Y| = n2V constraints. Naturally, this number is too large to\nallow for a practical solution of the quadratic program. Here we resort to a constraint generation\nstrategy, which consists of starting with no constraints and iteratively adding the most violated con-\nstraint for the current solution of the optimisation problem. Such an approach is assured to \ufb01nd an\n\u0001-close approximation of the solution of (5) after including only O(\u0001\u22122) constraints [16]. The key\nproblem that needs to be solved at each iteration is constraint generation, i.e., to \ufb01nd the maximiser\nof the violation margin \u03ben,\n\ny\u2217n \u2208 argmax\ny\u2208Y\n\n[\u2206(y, yn) + &\u03c6(xn, y),\u03b8 \u2019] .\n\n(7)\n\nThe dif\ufb01culty in solving the above optimisation problem depends on the choice of \u03c6(x, y) and \u2206.\nNext we investigate how this problem can be solved for our particular choices of these quantities.\n\n3.2 Constraint generation\n\nUsing eq.(2) and \u03c6(x, y) =!V\n\nv=1 yv(\u03c8v \u2297 x), eq. (7) becomes\n\ny\u2217n \u2208 argmax\ny\u2208Y\n\n&y, zn\u2019 .\n\nwhere\n\nand\n\nzn =\u03a8 \u03b8n \u2212\n\n(1 + \u03b22)yn\n\n)y)2 + \u03b22 )yn)2 ,\n\n(8)\n\n(9)\n\n\u2022 \u03a8 is a V \u00d7 D matrix with row v corresponding to \u03c8v;\n\u2022 \u03b8n is the nth column of matrix \u03b8;\n\n4There is a countable number of loss values but an uncountable number of parameters, so there are large\n\nequivalence classes of parameters that correspond to precisely the same loss.\n\n4\n\n\fn=1, \u03bb, \u03b2, Output: \u03b8\n\nfor n = 1 to N do\n\nAlgorithm 1 Reverse Multi-Label Learning\n1: Input: training set {(xn, yn)}N\n2: Initialize i = 1, \u03b81 = 0, MAX= \u2212\u221e\n3: repeat\n4:\n5:\n6:\n7:\n8:\n9: until converged (see [18])\n10: return \u03b8\n\n\u03bb\n\nCompute y\u2217n (Na\u00a8\u0131ve: Algorithm 2. Improved: See Appendix)\n\nend for\nCompute gradient gi (equation (12)) and objective oi (equation (11))\n\u03b8i+1 := argmin\u03b8\n\n2 )\u03b8)2 + max(0, max\n\nj\u2264i &gj,\u03b8 \u2019 + oj); i \u2190 i + 1\n\nzn =\u03a8 \u03b8n \u2212 (1+\u03b22)yn\nk+\u03b22%yn%2\ny\u2217 = argmaxy\u2208Yk &y, zn\u2019 (i.e. \ufb01nd top k entries in zn in O(V ) time)\nCURRENT= maxy\u2208Yk &y, zn\u2019\nif CURRENT>MAX then\n\nAlgorithm 2 Na\u00a8\u0131ve Constraint Generation\n1: Input: (xn, yn), \u03a8, \u03b8, \u03b2, V , Output: y\u2217n\n2: MAX= \u2212\u221e\n3: for k = 1 to V do\n4:\n5:\n6:\n7:\n8:\n9:\nend if\n10:\n11: end for\n12: return y\u2217n\n\nMAX = CURRENT\ny\u2217n = y\u2217\n\nWe now investigate how to solve (8) for a \ufb01xed \u03b8. For the purpose of clarity, here we describe a\nsimple, na\u00a8\u0131ve algorithm. In the appendix we present a more involved but much faster algorithm. A\nsimple algorithm can be obtained by \ufb01rst noticing that zn depends on y only through the number\nof its nonzero elements. Consider the set of all y with precisely k nonzero elements, i.e., Yk =:\n{y : )y)2 = k}. Then the objective in (8), if the maximisation is instead restricted to the domain\nYk, is effectively linear in y, since zn in this case is a constant w.r.t. y. Therefore we can solve\nseparately for each Yk by \ufb01nding the top k entries in zn. Finding the top k elements of a list of size\nV can be done in O(V ) time [17]. Therefore we have a O(V 2) algorithm (for every k from 1 to V ,\nsolve argmaxy\u2208Yk &y, z\u2019 in O(V ) expected time). Algorithm 1 describes in detail the optimisation,\nas solved by BMRM [18], and Algorithm 2 shows the na\u00a8\u0131ve constraint generation routine. The\nBMRM solver requires both the value of the objective function for the slack corresponding to the\nmost violated constraint and its gradient. The value of the slack variable corresponding to y\u2217n is\n\nthus the objective function from (5) becomes\n\n\u03be\u2217n = \u2206(y\u2217n, yn) + &\u03c6(xn, y\u2217n),\u03b8 \u2019 \u2212 &\u03c6(xn, yn),\u03b8 \u2019 ,\n\u2206(y\u2217n, yn) + &\u03c6(xn, y\u2217n),\u03b8 \u2019 \u2212 &\u03c6(xn, yn),\u03b8 \u2019 + \u03bb\n\n2 )\u03b8)2 ,\n\n1\n\nN \"n\n\n(10)\n\n(11)\n\n(12)\n\nwhose gradient (with respect to \u03b8) is\n\u03bb\u03b8 \u2212\n\n1\n\nN \"n\n\n(\u03c6(xn, yn) \u2212 \u03c6(xn, y\u2217n)).\n\nWe need both expressions (11) and (12) in Algorithm 1.\n3.3 Prediction at Test Time\n\nThe problem to be solved at test time (eq. (3)) has the same form as the problem of constraint\ngeneration (eq. (7)), the only difference being that zn =\u03a8 \u03b8n (i.e., the second term in eq. (9), due to\nthe loss, is not present). Since zn a constant vector, the solution y\u2217n for (7) is the vector that indicates\nthe positive entries of zn, which can be ef\ufb01ciently found in O(V ). Therefore inference at prediction\ntime is very fast.\n\n5\n\n\fTable 1: Evaluation scores and cor-\nresponding losses\nscore\nmacro-F\u03b2\nmacro-precision\n\n\u03b22yT y+\u00afyT \u00afy\n\n\u2206(y, \u00afy)\n1 \u2212 (1+\u03b22)(yT \u00afy)\n1 \u2212 yT \u00afy\n1 \u2212 yT \u00afy\nyT 1+\u00afyT 1\u22122yT \u00afy\n\n\u00afyT \u00afy\n\nyT y\n\nmacro-recall\n\nHamming loss\n\n3.4 Other scores\n\nV\n\nTable 2: Datasets. #train/#test denotes the number of\nobservations used for training and testing respectively;\nN is the number of labels and D the dimensionality of\nthe features.\ndataset\nyeast\nscene\nmedical\nenron\nemotions music\n\n#test N\n14\n917\n6\n1196\n45\n333\n579\n53\n6\n202\n\ndomain\nbiology\nimage\ntext\ntext\n\n#train\n1500\n1211\n645\n1123\n391\n\nD\n103\n294\n1449\n1001\n72\n\nUp to now we have focused on optimising Macro-F\u03b2, which already gives us several scores, in\nparticular Macro-F1, macro-recall and macro-precision. We can however optimise other scores, in\nparticular the popular Hamming loss \u2013 Table 1 shows a list with the corresponding loss, which we\nthen plug in eq.(4).\nNote that for Hamming loss and macro-recall the denominator is constant, and therefore it is not\nnecessary to solve (8) multiple times as described earlier, which makes constraint generation as fast\nas test-time prediction (see subsection 3.3).\n\n4 Experimental Results\n\nIn this section we evaluate our method in several real world datasets, for both macro-F\u03b2 and Ham-\nming loss. These scores were chosen because macro-F\u03b2 is a generalisation of the most relevant\nscores, and the Hamming loss is a generic, popular score in the multi-label classi\ufb01cation literature.\nDatasets\nWe used 5 publicly available5 multi-label datasets: yeast, scene, medical, enron and emotions. We\nselected these datasets because they cover a variety of application domains \u2013 biology, image, text\nand music \u2013 and there are published results of competing methods on them for some of the popular\nevaluation measures for MLC (macro-F1 and Hamming loss). Table 2 describes them in more detail.\nModel selection\nOur model requires only one parameter: \u03bb, the trade-off between data \ufb01tting and good generalisa-\ntion. For each experiment we selected it with 5-fold cross-validation using only the training data.\nImplementation\nOur implementation is in C++, using the Bundle Methods for Risk Minimization (BMRM) of [18]\nas a base. Source code is available6 under the Mozilla Public License.7\nComparison to published results on Macro-F1\nIn our \ufb01rst set of experiments we compared our model to published results on the Macro-F1 score.\nWe strived to make our comparison as broad as possible, but we limited ourselves to methods with\npublished results on public datasets, where the experimental setting was described in enough detail\nto allow us to make a fair comparison.\nWe therefore compared our model to Canonical Correlation Analysis [3] (CCA), Binary Method [9]\n(BM), Classi\ufb01er Chains [4] (CC), Subset Mapping [19] (SM), Meta Stacking [12] (MS), Ensembles\nof Binary Method [4] (EBM) , Ensembles of Classi\ufb01er Chains [4] (ECC), Ensembles of Pruned Sets\n[11] (EPS) and Random K Label Subsets [10] (RAKEL).\nTable 3 summarizes our results, along with competing methods\u2019 which were taken from compilations\nby [3] and [4]. We can see that our model has the best performance in yeast, medical and enron. In\n\n5http://mulan.sourceforge.net/datasets.html\n6http://users.cecs.anu.edu.au/\u223cjpetterson/.\n7http://www.mozilla.org/MPL/MPL-1.1.html\n\n6\n\n\fscene it doesn\u2019t perform as well \u2013 we suspect this is related to the label cardinality of this dataset:\nalmost all instances have just one label, making this essentially equivalent to a multiclass dataset.\nComparison to published results on Hamming Loss\nTo illustrate the \ufb02exibility of our model we also evaluated it on the Hamming loss. Here, we com-\npared our model to classi\ufb01er chains [4] (CC), probabilistic classi\ufb01er chains [1] (PCC), ensembles of\nclassi\ufb01er chains [4] (ECC) and ensembled probabilistic classi\ufb01er chains [1] (EPCC). These are the\nmethods for which we could \ufb01nd Hamming loss results on publicly available data.\nTable 4 summarizes our results, along with competing methods\u2019 which were taken from a compila-\ntion by [1]. As can be seen, our model has the best performance on both datasets.\nResults on F\u03b2\nOne strength of our method is that it can be optimised for the speci\ufb01c measure we are interested\nIn Macro-F\u03b2, for example, \u03b2 is a trade-off between precision and recall: when \u03b2 \u2192 0 we\nin.\nrecover precision, and when \u03b2 \u2192 \u221e we get recall. Unlike with other methods, given a desired\nprecision/recall trade-off encoded in a choice of \u03b2, we can optimise our model such that it gets\nthe best performance on Macro-F\u03b2. To show this we ran our method on all \ufb01ve datasets, but this\ntime with different choices of \u03b2, ranging from 10\u22122 to 102. In this case, however, we could not\n\ufb01nd published results to compare to, so we used Mulan8, an open-source library for learning from\nmulti-label datasets, to train three models: BM[9], RAKEL[10] and MLKNN[20]. BM was chosen\nas a simple baseline, and RAKEL and MLKNN are the two state-of-the-art methods available in the\npackage.\nMLKNN has two parameters: the number of neighbors k and a smoothing parameter s controlling\nthe strength of the uniform prior. We kept both \ufb01xed to 10 and 1.0, respectively, as was done in [20].\nRAKEL has three parameters: the number of models m, the size of the labelset k and the threshold\nt. Since a complete search over the parameter space would be impractical, we adopted the library\u2019s\ndefault for t and m (respectively 0.5 and 2\u2217 N) and set k to N\n2 as suggested by [4]. For BM we kept\nthe library\u2019s defaults.\nIn Figure 1 we plot the results. We can see that BM tends to prioritize recall (right side of the plot),\nwhile ML-KNN and RAKEL give more emphasis to precision (left side). Our method, however,\ngoes well in both sides, as it is trained separately for each value of \u03b2. In both scene and yeast it\ndominates the right side while is still competitive on the left side. And in the other three datasets \u2013\nmedical, enron and emotions \u2013 it practically dominates over the entire range of \u03b2.\n\n5 Conclusion and Future Work\nWe presented a new approach to multi-label learning which consists of predicting sets of instances\nfrom the labels. This apparent unintuitive approach is in fact natural since, once the problem is\nviewed from this perspective, many popular performance measures admit convex relaxations that\ncan be directly and ef\ufb01ciently optimised with existing methods. The method only requires one pa-\nrameter, as opposed to most existing methods, which have several. The method leverages on existing\ntools from structured output learning, which gives us certain theoretical guarantees. A simple ver-\nsion of constraint generation is presented for small problems, but we also developed a scalable, fast\nversion for dealing with large datasets. We presented a detailed experimental comparison against\nseveral state-of-the-art methods and overall our performance is notably superior.\nA fundamental limitation of our current approach is that it does not handle dependencies among\nlabels. It is however possible to include such dependencies by assuming for example a bivariate\nfeature map on the labels, rather than univariate. This however complicates the algorithmics, and is\nleft as subject for future research.\nAcknowledgements\nWe thank Miro Dud\u00b4\u0131k as well as the anonymous reviewers for insightful observations that helped to\nimprove the paper. NICTA is funded by the Australian Government as represented by the Depart-\nment of Broadband, Communications and the Digital Economy and the Australian Research Council\nthrough the ICT Centre of Excellence program.\n\n8http://mulan.sourceforge.net/\n\n7\n\n\fTable 3: Macro-F1 results. Bold face indicates the best performance. We don\u2019t have results for CCA\nin the Medical and Enron datasets.\nDataset Ours CCA\nCC\nYeast\n0.346\n0.346\nScene\n0.696\n0.374\nMedical\n0.377\n-\nEnron\n-\n0.198\n\nECC EBM EPS\n0.362\n0.420\n0.763\n0.742\n0.324\n0.386\n0.201\n0.155\n\nRAKEL\n0.413\n0.750\n0.377\n0.206\n\n0.440\n0.671\n0.420\n0.243\n\nBM SM\n0.326\n0.685\n0.364\n0.197\n\n0.327\n0.666\n0.321\n0.144\n\nMS\n0.331\n0.694\n0.370\n0.198\n\n0.364\n0.729\n0.382\n0.201\n\nTable 4: Hamming loss results. Bold face indicates the best performance.\n\nDataset\nScene\nEmotions\n\nOurs\n0.1271\n0.2252\n\nCC\n0.1780\n0.2448\n\nPCC\n0.1780\n0.2417\n\nECC\n0.1503\n0.2428\n\nEPCC\n0.1498\n0.2372\n\nscene\n\n0\n\nlog(\u03b2)\n\nenron\n\nML\u2212KNN\nRaKEL\nBM\nOur method\n\n0.5\n\n1\n\n1.5\n\n \n\n2\n\n \n\n \n\n2\n\n \n\nyeast\n\nML\u2212KNN\nRaKEL\nBM\nOur method\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n\u03b2\n\nF\n\u2212\no\nr\nc\na\nm\n\n0.5\n\n1\n\n1.5\n\n0\n\nlog(\u03b2)\n\nmedical\n\n \n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\nML\u2212KNN\nRaKEL\nBM\nOur method\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n\u03b2\n\nF\n\u2212\no\nr\nc\na\nm\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n\u03b2\n\nF\n\u2212\no\nr\nc\na\nm\n\n0.2\n \n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\nML\u2212KNN\nRaKEL\nBM\nOur method\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n\u03b2\n\nF\n\u2212\no\nr\nc\na\nm\n\n0\n \n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\nlog(\u03b2)\n\n0.5\n\n1\n\n1.5\n\n2\n\n0\n \n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\nlog(\u03b2)\n\n0.5\n\n1\n\n1.5\n\n2\n\nemotions\n\n \n\nML\u2212KNN\nRaKEL\nBM\nOur method\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n\u03b2\n\nF\n\u2212\no\nr\nc\na\nm\n\n \n\u22122\n\n\u22121.5\n\n\u22121\n\n\u22120.5\n\n0\n\nlog(\u03b2)\n\n0.5\n\n1\n\n1.5\n\n2\n\nFigure 1: Macro-F\u03b2 results on \ufb01ve datasets, with \u03b2 ranging from 10\u22122 to 102 (i.e., log10 \u03b2 ranging\nfrom -2 to 2). The center point (log \u03b2 = 0) corresponds to macro-F1. \u03b2 controls a trade-off between\nMacro-precision (left side) and Macro-recall (right side).\n\n8\n\n\fReferences\n[1] Krzysztof Dembczynski, Weiwei Cheng, and Eyke H\u00a8ullermeier. Bayes Optimal Multilabel\nClassi\ufb01cation via Probabilistic Classi\ufb01er Chains. In Proc. Intl. Conf. Machine Learning, 2010.\n[2] Xinhua Zhang, T. Graepel, and Ralf Herbrich. Bayesian Online Learning for Multi-label and\nMulti-variate Performance Measures. In Proc. Intl. Conf. on Arti\ufb01cial Intelligence and Statis-\ntics, volume 9, pages 956\u2013963, 2010.\n\n[3] Piyush Rai and Hal Daume. Multi-Label Prediction via Sparse In\ufb01nite CCA. In Y. Bengio,\nD. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural\nInformation Processing Systems 22, pages 1518\u20131526. 2009.\n\n[4] Jesse Read, Bernhard Pfahringer, Geoffrey Holmes, and Eibe Frank. Classi\ufb01er chains for\nmulti-label classi\ufb01cation. In Wray L. Buntine, Marko Grobelnik, Dunja Mladenic, and John\nShawe-Taylor, editors, ECML/PKDD (2), volume 5782 of Lecture Notes in Computer Science,\npages 254\u2013269. Springer, 2009.\n\n[5] Andr\u00b4e Elisseeff and Jason Weston. A kernel method for multi-labelled classi\ufb01cation. In Annual\nACM Conference on Research and Development in Information Retrieval, pages 274\u2013281,\n2005.\n\n[6] Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid. TagProp: Dis-\ncriminative Metric Learning in Nearest Neighbor Models for Image Auto-Annotation. In Proc.\nIntl. Conf. Computer Vision, 2009.\n\n[7] Douglas W. Oard and Jason R. Baron. Overview of the TREC 2008 Legal Track.\n[8] Linli Xu, Martha White, and Dale Schuurmans. Optimal reverse prediction. Proc. Intl. Conf.\n\nMachine Learning, pages 1\u20138, 2009.\n\n[9] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis P. Vlahavas. Mining Multi-label Data.\n\nSpringer, 2009.\n\n[10] Grigorios Tsoumakas and Ioannis P. Vlahavas. Random k-labelsets: An ensemble method\nIn Proceedings of the 18th European Conference on Machine\n\nfor multilabel classi\ufb01cation.\nLearning (ECML 2007), pages 406\u2013417, Warsaw, Poland, 2007.\n\n[11] Jesse Read, Bernhard Pfahringer, and Geoff Holmes. Multi-label classi\ufb01cation using ensem-\nbles of pruned sets. In ICDM \u201908: Proceedings of the 2008 Eighth IEEE International Confer-\nence on Data Mining, pages 995\u20131000, Washington, DC, USA, 2008. IEEE Computer Society.\n[12] Shantanu Godbole and Sunita Sarawagi. Discriminative methods for multi-labeled classi\ufb01ca-\ntion. In Proceedings of the 8th Paci\ufb01c-Asia Conference on Knowledge Discovery and Data\nMining, pages 22\u201330. Springer, 2004.\n\n[13] Martin Jansche. Maximum expected F-measure training of logistic regression models. HLT,\n\npages 692\u2013699, 2005.\n\n[14] T. Joachims. A support vector method for multivariate performance measures. In Proc. Intl.\nConf. Machine Learning, pages 377\u2013384, San Francisco, California, 2005. Morgan Kaufmann\nPublishers.\n\n[15] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.\n[16] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured\n\nand interdependent output variables. J. Mach. Learn. Res., 6:1453\u20131484, 2005.\n\n[17] D. E. Knuth. The Art of Computer Programming: Fundamental Algorithms, volume 1.\n\nAddison-Wesley, Reading, Massachusetts, second edition, 1998.\n\n[18] Choon Hui Teo, S. V. N. Vishwanathan, Alex J. Smola, and Quoc V. Le. Bundle methods for\n\nregularized risk minimization. Journal of Machine Learning Research, 11:311\u2013365, 2010.\n\n[19] Robert E. Schapire and Y. Singer. Improved boosting algorithms using con\ufb01dence-rated pre-\n\ndictions. Machine Learning, 37(3):297\u2013336, 1999.\n\n[20] Min-Ling Zhang and Zhi-Hua Zhou. ML-KNN: A lazy learning approach to multi-label learn-\n\ning. Pattern Recognition, 40(7):2038\u20132048, July 2007.\n\n9\n\n\f", "award": [], "sourceid": 662, "authors": [{"given_name": "James", "family_name": "Petterson", "institution": null}, {"given_name": "Tib\u00e9rio", "family_name": "Caetano", "institution": null}]}