{"title": "Maximizing Subset Accuracy with Recurrent Neural Networks in Multi-label Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 5413, "page_last": 5423, "abstract": "Multi-label classification is the task of predicting a set of labels for a given input instance. Classifier chains are a state-of-the-art method for tackling such problems, which essentially converts this problem into a sequential prediction problem, where the labels are first ordered in an arbitrary fashion, and the task is to predict a sequence of binary values for these labels. In this paper, we replace classifier chains with recurrent neural networks, a sequence-to-sequence prediction algorithm which has recently been successfully applied to sequential prediction tasks in many domains. The key advantage of this approach is that it allows to focus on the prediction of the positive labels only, a much smaller set than the full set of possible labels. Moreover, parameter sharing across all classifiers allows to better exploit information of previous decisions. As both, classifier chains and recurrent neural networks depend on a fixed ordering of the labels, which is typically not part of a multi-label problem specification, we also compare different ways of ordering the label set, and give some recommendations on suitable ordering strategies.", "full_text": "Maximizing Subset Accuracy with Recurrent Neural\n\nNetworks in Multi-label Classi\ufb01cation\n\nJinseok Nam1, Eneldo Loza Menc\u00eda1, Hyunwoo J. Kim2, and Johannes F\u00fcrnkranz1\n\n1Knowledge Engineering Group, TU Darmstadt\n\n2Department of Computer Sciences, University of Wisconsin-Madison\n\nAbstract\n\nMulti-label classi\ufb01cation is the task of predicting a set of labels for a given input\ninstance. Classi\ufb01er chains are a state-of-the-art method for tackling such problems,\nwhich essentially converts this problem into a sequential prediction problem, where\nthe labels are \ufb01rst ordered in an arbitrary fashion, and the task is to predict a\nsequence of binary values for these labels. In this paper, we replace classi\ufb01er\nchains with recurrent neural networks, a sequence-to-sequence prediction algorithm\nwhich has recently been successfully applied to sequential prediction tasks in many\ndomains. The key advantage of this approach is that it allows to focus on the\nprediction of the positive labels only, a much smaller set than the full set of possible\nlabels. Moreover, parameter sharing across all classi\ufb01ers allows to better exploit\ninformation of previous decisions. As both, classi\ufb01er chains and recurrent neural\nnetworks depend on a \ufb01xed ordering of the labels, which is typically not part of a\nmulti-label problem speci\ufb01cation, we also compare different ways of ordering the\nlabel set, and give some recommendations on suitable ordering strategies.\n\n1\n\nIntroduction\n\nThere is a growing need for developing scalable multi-label classi\ufb01cation (MLC) systems, which, e.g.,\nallow to assign multiple topic terms to a document or to identify objects in an image. While the simple\nbinary relevance (BR) method approaches this problem by treating multiple targets independently,\ncurrent research in MLC has focused on designing algorithms that exploit the underlying label\nstructures. More formally, MLC is the task of learning a function f that maps inputs to subsets of\na label set L = {1, 2,\u00b7\u00b7\u00b7 , L}. Consider a set of N samples D = {(xn, yn)}N\nn=1, each of which\nconsists of an input x \u2208 X and its target y \u2208 Y, and the (xn, yn) are assumed to be i.i.d following\n(cid:80)N\nan unknown distribution P (X, Y ) over a sample space X \u00d7 Y. We let Tn = |yn| denote the size\nn=1 Tn the cardinality of D, which is usually much\nof the label set associated to xn and C = 1\nsmaller than L. Often, it is convenient to view y not as a subset of L but as a binary vector of size L,\nN\ni.e., y \u2208 {0, 1}L. Given a function f parameterized by \u03b8 that returns predicted outputs \u02c6y of inputs x,\ni.e., \u02c6y \u2190 f (x; \u03b8), and a loss function (cid:96) : (y, \u02c6y) \u2192 R which measures the discrepancy between y and\n\u02c6y, the goal is to \ufb01nd an optimal parametrization f\u2217 that minimizes the expected loss on an unknown\nsample drawn from P (X, Y ) such that f\u2217 = arg minf\nexpected risk minimization over P (X, Y ) is intractable, for a given observation x it can be simpli\ufb01ed\nto f\u2217(x) = arg minf\nEY |X [(cid:96) (Y , f (x; \u03b8))] . A natural choice for the loss function is subset 0/1\nloss de\ufb01ned as (cid:96)0/1(y, f (x; \u03b8)) = I [y (cid:54)= \u02c6y] which is a generalization of the 0/1 loss in binary\nclassi\ufb01cation to multi-label problems. It can be interpreted as an objective to \ufb01nd the mode of the\njoint probability of label sets y given instances x: EY |X\nConversely, 1 \u2212 (cid:96)0/1(y, f (x; \u03b8)) is often referred to as subset accuracy in the literature.\n\n(cid:2)EY |X [(cid:96)(Y , f (X; \u03b8))](cid:3) . While the\n(cid:2)(cid:96)0/1 (Y , \u02c6y)(cid:3) = 1 \u2212 P (Y = y|X = x).\n\nEX\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Subset Accuracy Maximization in Multi-label Classi\ufb01cation\n\nFor maximizing subset accuracy, there are two principled ways for reducing a MLC problem to\nmultiple subproblems. The simplest method, label powerset (LP), de\ufb01nes a set of all possible label\ncombinations SL = {{1},{2},\u00b7\u00b7\u00b7 ,{1, 2,\u00b7\u00b7\u00b7 , L}}, from which a new class label is assigned to each\nlabel subset consisting of positive labels in D. LP, then, addresses MLC as a multi-class classi\ufb01cation\nproblem with min(N, 2L) possible labels such that\n\nP (y1, y2,\u00b7\u00b7\u00b7 , yL|x) LP\u2212\u2212\u2192 P (yLP = k|x)\n\n(1)\nwhere k \u2208 {1, 2,\u00b7\u00b7\u00b7 , min(N, 2L)}. While LP is appealing because most methods well studied\nin multi-class classi\ufb01cation can be used, training LP models becomes intractable for large-scale\nproblems with an increasing number of labels in SL. Even if the number of labels L is small enough,\nthe problem is still prone to suffer from data scarcity because each label subset in LP will in general\nonly have a few training instances. An effective solution to these problems is to build an ensemble of\nLP models learning from randomly constructed small label subset spaces [29].\nAn alternative approach is to learn the joint probability of labels, which is prohibitively expensive\ndue to 2L label con\ufb01gurations. To address such a problem, Dembczy\u00b4nski et al. [3] have proposed\nprobabilistic classi\ufb01er chain (PCC) which decomposes the joint probability into L conditional\nprobabilities:\n\nL(cid:89)\n\nP (y1, y2,\u00b7\u00b7\u00b7 , yL|x) =\n\nP (yi|y