{"title": "On the Optimality of Classifier Chain for Multi-label Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 712, "page_last": 720, "abstract": "To capture the interdependencies between labels in multi-label classification problems, classifier chain (CC) tries to take the multiple labels of each instance into account under a deterministic high-order Markov Chain model. Since its performance is sensitive to the choice of label order, the key issue is how to determine the optimal label order for CC. In this work, we first generalize the CC model over a random label order. Then, we present a theoretical analysis of the generalization error for the proposed generalized model. Based on our results, we propose a dynamic programming based classifier chain (CC-DP) algorithm to search the globally optimal label order for CC and a greedy classifier chain (CC-Greedy) algorithm to find a locally optimal CC. Comprehensive experiments on a number of real-world multi-label data sets from various domains demonstrate that our proposed CC-DP algorithm outperforms state-of-the-art approaches and the CC-Greedy algorithm achieves comparable prediction performance with CC-DP.", "full_text": "On the Optimality of Classi\ufb01er Chain for\n\nMulti-label Classi\ufb01cation\n\nWeiwei Liu\n\nIvor W. Tsang\u2217\n\nCentre for Quantum Computation and Intelligent Systems\n\nUniversity of Technology, Sydney\n\nliuweiwei863@gmail.com, ivor.tsang@uts.edu.au\n\nAbstract\n\nTo capture the interdependencies between labels in multi-label classi\ufb01cation prob-\nlems, classi\ufb01er chain (CC) tries to take the multiple labels of each instance into\naccount under a deterministic high-order Markov Chain model. Since its perfor-\nmance is sensitive to the choice of label order, the key issue is how to determine\nthe optimal label order for CC. In this work, we \ufb01rst generalize the CC model over\na random label order. Then, we present a theoretical analysis of the generaliza-\ntion error for the proposed generalized model. Based on our results, we propose\na dynamic programming based classi\ufb01er chain (CC-DP) algorithm to search the\nglobally optimal label order for CC and a greedy classi\ufb01er chain (CC-Greedy)\nalgorithm to \ufb01nd a locally optimal CC. Comprehensive experiments on a num-\nber of real-world multi-label data sets from various domains demonstrate that our\nproposed CC-DP algorithm outperforms state-of-the-art approaches and the CC-\nGreedy algorithm achieves comparable prediction performance with CC-DP.\n\n1 Introduction\n\nMulti-label classi\ufb01cation, where each instance can belong to multiple labels simultaneously, has\nsigni\ufb01cantly attracted the attention of researchers as a result of its various applications, ranging from\ndocument classi\ufb01cation and gene function prediction, to automatic image annotation. For example,\na document can be associated with a range of topics, such as Sports, Finance and Education [1]; a\ngene belongs to the functions of protein synthesis, metabolism and transcription [2]; an image may\nhave both beach and tree tags [3].\nOne popular strategy for multi-label classi\ufb01cation is to reduce the original problem into many bina-\nry classi\ufb01cation problems. Many works have followed this strategy. For example, binary relevance\n(BR) [4] is a simple approach for multi-label learning which independently trains a binary classi\ufb01er\nfor each label. Recently, Dembczynski et al. [5] have shown that methods of multi-label learn-\ning which explicitly capture label dependency will usually achieve better prediction performance.\nTherefore, modeling the label dependency is one of the major challenges in multi-label classi\ufb01ca-\ntion problems. Many multi-label learning models [5, 6, 7, 8, 9, 10, 11, 12] have been developed to\ncapture label dependency. Amongst them, the classi\ufb01er chain (CC) model is one of the most popular\nmethods due to its simplicity and promising experimental results [6].\nCC works as follows: One classi\ufb01er is trained for each label. For the (i + 1)th label, each instance\nis augmented with the 1st, 2nd, \u00b7\u00b7\u00b7 , ith label as the input to train the (i + 1)th classi\ufb01er. Given a\nnew instance to be classi\ufb01ed, CC \ufb01rstly predicts the value of the \ufb01rst label, then takes this instance\ntogether with the predicted value as the input to predict the value of the next label. CC proceeds\nin this way until the last label is predicted. However, here is the question: Does the label order\naffect the performance of CC? Apparently yes, because different classi\ufb01er chains involve different\n\n(cid:3)\n\nCorresponding author\n\n1\n\n\fclassi\ufb01ers trained on different training sets. Thus, to reduce the in\ufb02uence of the label order, Read et\nal. [6] proposed the ensembled classi\ufb01er chain (ECC) to average the multi-label predictions of CC\nover a set of random chain ordering. Since the performance of CC is sensitive to the choice of label\norder, there is another important question: Is there any globally optimal classi\ufb01er chain which can\nachieve the optimal prediction performance for CC? If yes, how can the globally optimal classi\ufb01er\nchain be found?\nTo answer the last two questions, we \ufb01rst generalize the CC model over a random label order. We\nthen present a theoretical analysis of the generalization error for the proposed generalized model.\nOur results show that the upper bound of the generalization error depends on the sum of reciprocal\nof square of the margin over the labels. Thus, we can answer the second question: the globally\noptimal CC exists only when the minimization of the upper bound is achieved over this CC. To\n\ufb01nd the globally optimal CC, we can search over q! different label orders1, where q denotes the\nnumber of labels, which is computationally infeasible for a large q. In this paper, we propose the\ndynamic programming based classi\ufb01er chain (CC-DP) algorithm to simplify the search algorithm,\nwhich requires O(q3nd) time complexity. Furthermore, to speed up the training process, a greedy\nclassi\ufb01er chain (CC-Greedy) algorithm is proposed to \ufb01nd a locally optimal CC, where the time\ncomplexity of the CC-Greedy algorithm is O(q2nd).\nNotations: Assume xt \u2208 Rd is a real vector representing an input or instance (feature) for t \u2208\n{1,\u00b7\u00b7\u00b7 , n}. n denotes the number of training samples. Yt \u2286 {\u03bb1, \u03bb2,\u00b7\u00b7\u00b7 , \u03bbq} is the corresponding\n\u2208 {0, 1}q is used to represent the label set Yt, where yt(j) = 1 if and only if\noutput (label). yt\n\u03bbj \u2208 Yt.\n\n2 Related work and preliminaries\n\nTo capture label dependency, Hsu et al. [13] \ufb01rst use compressed sensing technique to handle the\nmulti-label classi\ufb01cation problem. They project the original label space into a low dimensional label\nspace. A regression model is trained on each transformed label. Recovering multi-labels from the\nregression output usually involves solving a quadratic programming problem [13], and many works\nhave been developed in this way [7, 14, 15]. Such methods mainly aim to use different projection\nmethods to transform the original label space into another effective label space.\nAnother important approach attempts to exploit the different orders (\ufb01rst-order, second-order and\nhigh-order) of label correlations [16]. Following this way, some works also try to provide a proba-\nbilistic interpretation for label correlations. For example, Guo and Gu [8] model the label correla-\ntions using a conditional dependency network; PCC [5] exploits a high-order Markov Chain model\nto capture the correlations between the labels and provide an accurate probabilistic interpretation of\nCC. Other works [6, 9, 10] focus on modeling the label correlations in a deterministic way, and CC\nis one of the most popular methods among them. This work will mainly focus on the deterministic\nhigh-order classi\ufb01er chain.\n\n2.1 Classi\ufb01er chain\nSimilar to BR, the classi\ufb01er chain (CC) model [6] trains q binary classi\ufb01ers hj (j \u2208 {1,\u00b7\u00b7\u00b7 , q}).\nClassi\ufb01ers are linked along a chain where each classi\ufb01er hj deals with the binary classi\ufb01cation prob-\nlem for label \u03bbj. The augmented vector {xt, yt(1),\u00b7\u00b7\u00b7 , yt(j)}n\nt=1 is used as the input for training\nclassi\ufb01er hj+1. Given a new testing instance x, classi\ufb01er h1 in the chain is responsible for predict-\ning the value of y(1) using input x. Then, h2 predicts the value of y(2) taking x plus the predicted\nvalue of y(1) as an input. Following in this way, hj+1 predicts y(j + 1) using the predicted value\nof y(1),\u00b7\u00b7\u00b7 , y(j) as additional input information. CC passes label information between classi\ufb01ers,\nallowing CC to exploit the label dependence and thus overcome the label independence problem\nof BR. Essentially, it builds a deterministic high-order Markov Chain model to capture the label\ncorrelations.\n\n1! represents the factorial notation.\n\n2\n\n\f2.2 Ensembled classi\ufb01er chain\n\nDifferent classi\ufb01er chains involve different classi\ufb01ers learned on different training sets and thus the\norder of the chain itself clearly affects the prediction performance. To solve the issue of selecting a\nchain order for CC, Read et al. [6] proposed the extension of CC, called ensembled classi\ufb01er chain\n(ECC), to average the multi-label predictions of CC over a set of random chain ordering. ECC \ufb01rst\nrandomly reorders the labels {\u03bb1, \u03bb2,\u00b7\u00b7\u00b7 , \u03bbq} many times. Then, CC is applied to the reordered\nlabels for each time and the performance of CC is averaged over those times to obtain the \ufb01nal\nprediction performance.\n\n3 Proposed model and generalization error analysis\n\n3.1 Generalized classi\ufb01er chain\n\nWe generalize the CC model over a random label order, called generalized classi\ufb01er chain (GCC)\nmodel. Assume the labels {\u03bb1, \u03bb2,\u00b7\u00b7\u00b7 , \u03bbq} are randomly reordered as {\u03b61, \u03b62,\u00b7\u00b7\u00b7 , \u03b6q}, where \u03b6j =\n\u03bbk means label \u03bbk moves to position j from k. In the GCC model, classi\ufb01ers are also linked along\na chain where each classi\ufb01er hj deals with the binary classi\ufb01cation problem for label \u03b6j (\u03bbk). GCC\nfollows the same training and testing procedures as CC, while the only difference is the label order.\nIn the GCC model, for input xt, yt(j) = 1 if and only if \u03b6j \u2208 Yt.\n3.2 Generalization error analysis\n\nIn this section, we analyze the generalization error bound of the multi-label classi\ufb01cation problem\nusing GCC based on the techniques developed for the generalization performance of classi\ufb01ers with\na large margin [17] and perceptron decision tree [18].\nLet X represent the input space. Both s and (cid:22)s are m samples drawn independently according to\nan unknown distribution D. We denote logarithms to base 2 by log. If S is a set, |S| denotes its\ncardinality. \u2225 \u00b7 \u2225 means the l2 norm. We train a support vector machine(SVM) for each label \u03b6j.\nLet {xt}n\nt=1 as the label, the output parameter of SVM is de\ufb01ned as\n[wj, bj] = SV M ({xt, yt(\u03b61),\u00b7\u00b7\u00b7 , yt(\u03b6j\u22121)}n\nt=1). The margin for label \u03b6j is de\ufb01ned\nas:\n\nt=1 as the feature and {yt(\u03b6j)}n\n\nt=1,{yt(\u03b6j)}n\n\n(1)\n\nThe fat shattering dimension f at(\u03b3) of the set H is a function from the positive real numbers to the\nintegers which maps a value \u03b3 to the size of the largest \u03b3-shattered set, if this is \ufb01nite, or in\ufb01nity\notherwise.\nAssume H is the real valued function class and h \u2208 H. l(y, h(x)) denotes the loss function. The\nexpected error of h is de\ufb01ned as erD[h] = E(x,y)\u223cD[l(y, h(x))], where (x, y) drawn from the\nunknown distribution D. Here we select 0-1 loss function. So, erD[h] = P(x,y)\u223cD(h(x) \u0338= y).\ners[h] is de\ufb01ned as ers[h] = 1\nn\nSuppose N (\u03f5,H, s) is the \u03f5-covering number of H with respect to the l\u221e pseudo-metric measuring\nthe maximum discrepancy on the sample s. The notion of the covering number can be referred to\nthe Supplementary Materials. We introduce the following general corollary regarding the bound of\nthe covering number:\n\n[yt \u0338= h(xt)].2\n\nn\u2211\n\nt=1\n\n2The expression [yt \u0338= h(xt)] evaluates to 1 if yt \u0338= h(xt) is true and to 0 otherwise.\n\n3\n\nWe begin with the de\ufb01nition of the fat shattering dimension.\nDe\ufb01nition 1 ([19]). Let H be a set of real valued functions. We say that a set of points P is \u03b3-\nshattered by H relative to r = (rp)p\u2208P if there are real numbers rp indexed by p \u2208 P such that for\nall binary vectors b indexed by P , there is a function fb \u2208 H satisfying\n\n\u03b3j =\n\n1||wj||2\n{\u2265 rp + \u03b3\n\n\u2264 rp \u2212 \u03b3\n\nfb(p) =\n\nif bp = 1\notherwise\n\n\f\u03f52\n\nCorollary 1 ([17]). Let H be a class of functions X \u2192 [a, b] and D a distribution over X. Choose\n(\n0 < \u03f5 < 1 and let d = f at(\u03f5/4) \u2264 em. Then\nE(N (\u03f5,H, s)) \u2264 2\n\n)d log(2em(b\u2212a)/(d\u03f5))\n\n4m(b \u2212 a)2\n\n(2)\n\nwhere the expectation E is over samples s \u2208 X m drawn according to Dm.\nWe study the generalization error bound of the speci\ufb01ed GCC with the speci\ufb01ed number of labels\nand margins. Let G be the set of classi\ufb01ers of GCC, G = {h1, h2,\u00b7\u00b7\u00b7 , hq}. ers[G] denotes the\nfraction of the number of errors that GCC makes on s. De\ufb01ne ^x \u2208 X \u00d7 {0, 1}, ^hj(^x) = hj(x)(1 \u2212\ny(j))\u2212 hj(x)y(j). If an instance x \u2208 X is correctly classi\ufb01ed by hj, then ^hj(^x) < 0. Moreover, we\nintroduce the following proposition:\nProposition 1. If an instance x \u2208 X is misclassi\ufb01ed by a GCC model, then \u2203hj \u2208 G, ^hj(^x) \u2265 0.\nLemma 1. Given a speci\ufb01ed GCC model with q labels and with margins \u03b31, \u03b32,\u00b7\u00b7\u00b7 , \u03b3q for each\nlabel satisfying ki = f at(\u03b3i/8), where f at is continuous from the right. If GCC has correctly\nclassi\ufb01ed m multi-labeled examples s generated independently according to the unknown (but \ufb01xed)\ndistribution D and (cid:22)s is a set of another m multi-labeled examples, then we can bound the following\nprobability to be less than \u03b4: P 2m{s(cid:22)s : \u2203 a GCC model, it correctly classi\ufb01es s, fraction of (cid:22)s misclas-\nsi\ufb01ed > \u03f5(m, q, \u03b4)} < \u03b4, where \u03f5(m, q, \u03b4) = 1\nq\n).\ni=1 ki log( 8em\nki\nProof. (of Lemma 1). Suppose G is a GCC model with q labels and with margins \u03b31, \u03b32,\u00b7\u00b7\u00b7 , \u03b3q,\nthe probability event in Lemma 1 can be described as\n\nm (Q log(32m)+log 2q\n\n\u03b4 ) and Q =\n\n\u2211\n\nA = {s(cid:22)s : \u2203G, ki = f at(\u03b3i/8), ers[G] = 0, er(cid:22)s[G] > \u03f5}.\n\nLet ^s and ^(cid:22)s denote two different set of m examples, which are drawn i.i.d. from the distribution\nD \u00d7 {0, 1}. Applying the de\ufb01nition of ^x, ^h and Proposition 1, the event can also be written as\n^hi(^xt), 2^\u03b3i = \u2212ri,|{^y \u2208 ^(cid:22)s :\nA = {^s^(cid:22)s : \u2203G, ^\u03b3i = \u03b3i/2, ki = f at(^\u03b3i/4), ers[G] = 0, ri = maxt\n^hi(^xt) means the minimal value of |hi(x)|\n\u2203hi \u2208 G, ^hi(^y) \u2265 2^\u03b3i + ri}| > m\u03f5}. Here, \u2212maxt\nwhich represents the margin for label \u03b6i, so 2^\u03b3i = \u2212ri. Let \u03b3ki = min{\u03b3\n/4) \u2264 ki}, so\n\u03b3ki\n\n\u2264 ^\u03b3i, we de\ufb01ne the following function:\n\n: f at(\u03b3\n\n\u2032\n\n\u2032\n\nif ^h \u2265 0\nif ^h \u2264 \u22122\u03b3ki\notherwise\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f30\n\n\u22122\u03b3ki\n^h\n\n\u03c0(^h) =\n\nso \u03c0(^h) \u2208 [\u22122\u03b3ki, 0]. Let \u03c0( ^G) = {\u03c0(^h) : h \u2208 G}.\n^s^(cid:22)s represent the minimal \u03b3ki-cover set of \u03c0( ^G) in the pseudo-metric d^s^(cid:22)s. We have that for\nLet Bki\nany hi \u2208 G, there exists ~f \u2208 Bki\n^s^(cid:22)s , |\u03c0(^hi(^z)) \u2212 \u03c0( ~f (^z))| < \u03b3ki, for all ^z \u2208 ^s^(cid:22)s. For all ^x \u2208 ^s, by\nthe de\ufb01nition of ri, ^hi(^x) \u2264 ri = \u22122^\u03b3i, and \u03b3ki\n\u2264 ^\u03b3i, ^hi(^x) \u2264 \u22122\u03b3ki, \u03c0(^hi(^x)) = \u22122\u03b3ki, so\n\u03c0( ~f (^x)) < \u22122\u03b3ki + \u03b3ki = \u2212\u03b3ki. However, there are at least m\u03f5 points ^y \u2208 ^(cid:22)s such that ^hi(^y) \u2265 0,\nso \u03c0( ~f (^y)) > \u2212\u03b3ki > maxt\u03c0( ~f (^xt)). Since \u03c0 only reduces separation between output values, we\n~f (^xt) holds. Moreover, the m\u03f5 points in ^(cid:22)s with the largest\nconclude that the inequality ~f (^y) > maxt\n\u2212m\u03f5 of the\n~f values must remain for the inequality to hold. By the permutation argument, at most 2\nsequences obtained by swapping corresponding points satisfy the conditions for \ufb01xed ~f.\nAs for any hi \u2208 G, there exists ~f \u2208 Bki\ninequality for ki. Note that |Bki\nunion bound, we get the following inequality:\n|) + \u00b7\u00b7\u00b7 + E(|Bkq\n\n| possibilities of ~f that satisfy the\n| is a positive integer which is usually bigger than 1 and by the\n\n^s^(cid:22)s , so there are |Bki\n\n|) \u00d7 \u00b7\u00b7\u00b7 \u00d7 E(|Bkq\n\nP (A) \u2264 (E(|Bk1\n\n\u2212m\u03f5 \u2264 (E(|Bk1\n\n|))2\n\n|))2\n\n\u2212m\u03f5\n\n^s^(cid:22)s\n\n^s^(cid:22)s\n\nSince every set of points \u03b3-shattered by \u03c0( ^G) can be \u03b3-shattered by ^G, so f at\u03c0( ^G)(\u03b3) \u2264 f at ^G(\u03b3),\nwhere ^G = {^h : h \u2208 G}. Hence, by Corollary 1 (setting [a, b] to [\u22122\u03b3ki , 0], \u03f5 to \u03b3ki and m to 2m),\n\n^s^(cid:22)s\n\n^s^(cid:22)s\n\n^s^(cid:22)s\n\n^s^(cid:22)s\n\nE(|Bki\n\n^s^(cid:22)s\n\n|) = E(N (\u03b3ki, \u03c0( ^G), ^s^(cid:22)s)) \u2264 2(32m)d log( 8em\n\nd )\n\n4\n\n\fwhere d = f at\u03c0( ^G)(\u03b3ki/4) \u2264 f at ^G(\u03b3ki/4) \u2264 ki. Thus E(|Bki\nobtain\n\n^s^(cid:22)s\n\n|) \u2264 2(32m)ki log( 8em\n\nki\n\n), and we\n\nP (A) \u2264 (E(|Bk1\n\u2211\n\n^s^(cid:22)s\n\nwhere Q =\n\nq\ni=1 ki log( 8em\nki\n\n^s^(cid:22)s\n\n|) \u00d7 \u00b7\u00b7\u00b7 \u00d7 E(|Bkq\n\n|))2\n(\n). And so (E(|Bk1\n\u03f5(m, q, \u03b4) \u2265 1\nm\n\n^s^(cid:22)s\n\ni=1\n\n)\n|))2\n|) \u00d7 \u00b7\u00b7\u00b7 \u00d7 E(|Bkq\n^s^(cid:22)s\n2q\n\u03b4\n\nQ log(32m) + log\n\n2(32m)ki log( 8em\n\nki\n\n) = 2q(32m)Q\n\n\u2212m\u03f5 < \u03b4 provided\n\n\u2212m\u03f5 \u2264 q\u220f\n\nas required.\n\nLemma 1 applies to a particular GCC model with a speci\ufb01ed number of labels and a speci\ufb01ed margin\nfor each label. In practice, we will observe the margins after running the GCC model. Thus, we must\nbound the probabilities uniformly over all of the possible margins that can arise to obtain a practical\nbound. The generalization error bound of the multi-label classi\ufb01cation problem using GCC is shown\nas follows:\nTheorem 1. Suppose a random m multi-labeled sample can be correctly classi\ufb01ed using a GCC\nmodel, and suppose this GCC model contains q classi\ufb01ers with margins \u03b31, \u03b32,\u00b7\u00b7\u00b7 , \u03b3q for each\nlabel. Then we can bound the generalization error with probability greater than 1\u2212 \u03b4 to be less than\n\n)\n\n(\n\n130R2\n\nm\n\n\u2032\n\nQ\n\nlog(8em) log(32m) + log\n\n2(2m)q\n\n\u03b4\n\nq\ni=1\n\n1\n\n(\u03b3i)2 and R is the radius of a ball containing the support of the distribution.\n\n\u2211\n\nwhere Q\n\n\u2032\n\n=\n\nBefore proving Theorem 1, we state one key Symmetrization lemma and Theorem 2.\nLemma 2 (Symmetrization). Let H be the real valued function class. s and (cid:22)s are m samples both\ndrawn independently according to the unknown distribution D. If m\u03f52 \u2265 2, then\n\nPs(sup\nh\u2208H\n\n|erD[h] \u2212 ers[h]| \u2265 \u03f5) \u2264 2Ps(cid:22)s(sup\nh\u2208H\n\n|er(cid:22)s[h] \u2212 ers[h]| \u2265 \u03f5/2)\n\nThe proof details of this lemma can be found in the Supplementary Material.\nTheorem 2 ([20]). Let H be restricted to points in a ball of M dimensions of radius R about the\norigin, then\n\n{\n\n}\nR2\n\u03b32 , M + 1\n\nf atH(\u03b3) \u2264 min\n\n(3)\n\n(4)\n\nProof. (of Theorem 1). We must bound the probabilities over different margins. We \ufb01rst use Lem-\nma 2 to bound the probability of error in terms of the probability of the discrepancy between the\nperformance on two halves of a double sample. Then we combine this result with Lemma 1. We\nmust consider all possible patterns of ki\u2019s for label \u03b6i. The largest value of ki is m. Thus, for \ufb01xed q,\nwe can bound the number of possibilities by mq. Hence, there are mq of applications of Lemma 1.\nLet ci = {\u03b31, \u03b32,\u00b7\u00b7\u00b7 , \u03b3q} denote the i-th combination of margins varied in {1,\u00b7\u00b7\u00b7 , m}q. G denotes\na set of GCC models. The generalization error of G can be represented as erD[G] and ers[G] is 0,\nwhere G \u2208 G. The uniform convergence bound of the generalization error is\n\n|erD[G] \u2212 ers[G]| \u2265 \u03f5)\n\nPs(sup\nG\u2208G\n\nApplying Lemma 2,\n\nPs(sup\nG\u2208G\n\n|erD[G]\u2212ers[G]| \u2265 \u03f5) \u2264 2Ps(cid:22)s(sup\nG\u2208G\n\n|er(cid:22)s[G] \u2212 ers[G]| \u2265 \u03f5/2)\n\nLet Jci = {s(cid:22)s : \u2203 a GCC model G with q labels and with margins ci : ki = f at(\u03b3i/8), ers[G] =\n0, er(cid:22)s[G] \u2265 \u03f5/2}. Clearly,\n\n( mq\u222a\n\n)\n\nJci\n\ni=1\n\n|er(cid:22)s[G] \u2212 ers[G]| \u2265 \u03f5/2) \u2264 P mq\n\nPs(cid:22)s(sup\nG\u2208G\n\n5\n\n\fAs ki still satis\ufb01es ki = f at(\u03b3i/8), Lemma 1 can still be applied to each case of P mq\n\u03b4k = \u03b4/mq. Applying Lemma 1 (replacing \u03b4 by \u03b4k/2), we get:\n\nwhere \u03f5(m, k, \u03b4k/2) \u2265 2/m(Q log(32m) + log 2\u00d72q\nbound, it suf\ufb01ces to show that P mq\nLemma 2,\n\nP mq\n\n(Jci) < \u03b4k/2\n\n\u222a\ni=1 Jci) \u2264\u2211\n\n\u2211\n) and Q =\n). By the union\n(Jci) < \u03b4k/2 \u00d7 mq = \u03b4/2. Applying\nmq\ni=1 P mq\n|erD[G] \u2212 ers[G]| \u2265 \u03f5) \u2264 2Ps(cid:22)s(sup\n|er(cid:22)s[G] \u2212 ers[G]| \u2265 \u03f5/2)\n)\n( mq\u222a\nG\u2208G\n\nq\ni=1 ki log( 4em\nki\n\n(Jci). Let\n\nPs(sup\nG\u2208G\n\nmq\n\n\u03b4k\n\n(\n\n\u2264 2P mq\n\nJci\n\n< \u03b4\n\nThus, Ps(supG\u2208G |erD[G] \u2212 ers[G]| \u2264 \u03f5) \u2265 1 \u2212 \u03b4. Let R be the radius of a ball containing the\nsupport of the distribution. Applying Theorem 2, we get ki = f at(\u03b3i/8) \u2264 65R2/(\u03b3i)2. Note\nthat we have replaced the constant 82 = 64 by 65 in order to ensure the continuity from the right\nrequired for the application of Lemma 1. We have upperbounded log(8em/ki) by log(8em). Thus,\n\ni=1\n\n)\n\nQ log(32m) + log\n\n2(2m)q\n\n\u03b4\n\n\u2032\n\nQ\n\nlog(8em) log(32m) + log\n\n2(2m)q\n\n\u03b4\n\nm\n\n)\n\n(\n\nerD[G] \u2264 2/m\n\u2264 130R2\n\n(\n\n\u2211\n\nwhere Q\n\n\u2032\n\n=\n\nq\ni=1\n\n1\n\n(\u03b3i)2 .\n\nGiven the training data size and the number of labels, Theorem 1 reveals one important factor in re-\nducing the generalization error bound for the GCC model: the minimization of the sum of reciprocal\nof square of the margin over the labels. Thus, we obtain the following Corollary:\nCorollary 2 (Globally Optimal Classi\ufb01er Chain). Suppose a random m multi-labeled sample with\nq labels can be correctly classi\ufb01ed using a GCC model, this GCC model is the globally optimal\n\u2032 in Theorem 1 is achieved over this classi\ufb01er\nclassi\ufb01er chain if and only if the minimization of Q\nchain.\n\nGiven the number of labels q, there are q! different label orders. It is very expensive to \ufb01nd the\n\u2032, by searching over all of the label orders. Next, we\nglobally optimal CC, which can minimize Q\ndiscuss two simple algorithms.\n\n4 Optimal classi\ufb01er chain algorithm\n\nIn this section, we propose two simple algorithms for \ufb01nding the optimal CC based on our result\nin Section 3. To clearly state the algorithms, we rede\ufb01ne the margins with label order information.\nGiven label set M = {\u03bb1, \u03bb2,\u00b7\u00b7\u00b7 , \u03bbq}, suppose a GCC model contains q classi\ufb01ers. Let oi(1 \u2264\n\u2211\noi \u2264 q) denote the order of \u03bbi in the GCC model, \u03b3oi\nrepresents the margin for label \u03bbi, with\nprevious oi \u2212 1 labels as the augmented input. If oi = 1, then \u03b31\ni represents the margin for label \u03bbi,\nwithout augmented input. Then Q\ni )2 .\n1\n(\u03b3oi\n\n\u2032 is rede\ufb01ned as Q\n\u2032\n\nq\ni=1\n\n=\n\ni\n\n4.1 Dynamic programming algorithm\n\n]\n\u2211\nTo simplify the search algorithm mentioned before, we propose the CC-DP algorithm to \ufb01nd the\nglobally optimal CC. Note that Q\n, we\ni )2 = 1\n1\n(\u03b3oi\n\u2032 over a subset of M with the length of 1, 2,\u00b7\u00b7\u00b7 , q.\nexplore the idea of DP to iteratively optimize Q\n\u2032 over M. Assume i \u2208 {1,\u00b7\u00b7\u00b7 , q}. Let V (i, \u03b7) be the optimal\nFinally, we can obtain the optimal Q\n\u2032 over a subset of M with the length of \u03b7(1 \u2264 \u03b7 \u2264 q), where the label order is ending by label \u03bbi.\n{\nQ\nSuppose M \u03b7\ni represent the corresponding label set for V (i, \u03b7). When \u03b7 = q, V (i, q) be the optimal\n\u2032 over M, where the label order is ending by label \u03bbi. The DP equation is written as:\nQ\n\nq )2 + \u00b7\u00b7\u00b7 +\n\nk+1 )2 +\n\n\u2211\n\n}\n\n1\noj\nj )2\n\nk\nj=1\n\nq\ni=1\n\n[\n\nok+1\n\n=\n\n(\u03b3\n\n(\u03b3\n\n(\u03b3\n\noq\n\n\u2032\n\n1\n\nV (i, \u03b7 + 1) = min\n\nj\u0338=i,\u03bbi\u0338\u2208M (cid:17)\n\nj\n\n1\n(\u03b3\u03b7+1\n\ni\n\n)2\n\n+ V (j, \u03b7)\n\n(5)\n\n6\n\n\fi\n\ni )2 and M 1\n\ni = M 1\nj\n\nis the margin for label \u03bbi, with M \u03b7\n\ni = {\u03bbi}. Then, the optimal Q\n\nwhere \u03b3\u03b7+1\nj as the augmented input. The initial condition of\n\u2032 over M can be obtained by solving\nDP is: V (i, 1) = 1\nmini\u2208{1,\u00b7\u00b7\u00b7 ,q} V (i, q). Assume the training of linear SVM takes O(nd). The CC-DP algorithm is\n(\u03b31\nshown as the following bottom-up procedure: from the bottom, we \ufb01rst compute V (i, 1) = 1\ni )2 ,\n(\u03b31\nwhich takes O(nd). Then we compute V (i, 2) = minj\u0338=i,\u03bbi\u0338\u2208M 1\ni )2 + V (j, 1)}, which requires\nat most O(qnd), and set M 2\n\u222a {\u03bbi}. Similarly, it takes at most O(q2nd) time complexity to\ncalculate V (i, q). Last, we iteratively solve this DP Equation, and use mini\u2208{1,\u00b7\u00b7\u00b7 ,q} V (i, q) to get\nthe optimal solution, which requires at most O(q3nd) time complexity.\nTheorem 3 (Correctness of CC-DP). Q\ncan \ufb01nd the globally optimal CC.\n\n\u2032 can be minimized by CC-DP, which means this Algorithm\n\n{ 1\n\n(\u03b32\n\nj\n\nThe proof can be referred to in the Supplementary Materials.\n\n4.2 Greedy algorithm\n\nWe propose a CC-Greedy algorithm to \ufb01nd a locally optimal CC to speed up the CC-DP algorithm.\nTo save time, we construct only one classi\ufb01er chain with the locally optimal label order. Based on\nthe training instances, we select the label from {\u03bb1, \u03bb2,\u00b7\u00b7\u00b7 , \u03bbq} as the \ufb01rst label, if the maximum\nmargin can be achieved over this label, without augmented input. The \ufb01rst label is denoted by \u03b61.\nThen we select the label from the remainder as the second label, if the maximum margin can be\nachieved over this label with \u03b61 as the augmented input. We continue in this way until the last label\nis selected. Finally, this algorithm will converge to the locally optimal CC. We present the details\nof the CC-Greedy algorithm in the Supplementary Materials, where the time complexity of this\nalgorithm is O(q2nd).\n\n5 Experiment\n\nIn this section, we perform experimental studies on a number of benchmark data sets from different\ndomains to evaluate the performance of our proposed algorithms for multi-label classi\ufb01cation. All\nthe methods are implemented in Matlab and all experiments are conducted on a workstation with a\n3.2GHZ Intel CPU and 4GB main memory running 64-bit Windows platform.\n\n5.1 Data sets and baselines\n\nWe conduct experiments on eight real-world data sets with various domains from three websites.345\nFollowing the experimental settings in [5] and [7], we preprocess the LLog, yahoo art, eurlex sm\nand eurlex ed data sets. Their statistics are presented in the Supplementary Materials. We compare\nour algorithms with some baseline methods: BR, CC, ECC, CCA [14] and MMOC [7]. To perform\na fair comparison, we use the same linear classi\ufb01cation/regression package LIBLINEAR [21] with\nL2-regularized square hinge loss (primal) to train the classi\ufb01ers for all the methods. ECC is averaged\nover several CC predictions with random order and the ensemble size in ECC is set to 10 according\nto [5, 6]. In our experiment, the running time of PCC and EPCC [5] on most data sets, like slashdot\nand yahoo art, takes more than one week. From the results in [5], ECC is comparable with EPCC\nand outperforms PCC, so we do not consider PCC and EPCC here. CCA and MMOC are two\nstate-of-the-art encoding-decoding [13] methods. We cannot get the results of CCA and MMOC on\nyahoo art 10, eurlex sm 10 and eurlex ed 10 data sets in one week. Following [22], we consider\nthe Example-F1, Macro-F1 and Micro-F1 measures to evaluate the prediction performance of all\nmethods. We perform 5-fold cross-validation on each data set and report the mean and standard\nerror of each evaluation measurement. The running time complexity comparison is reported in the\nSupplementary Materials.\n\n3http://mulan.sourceforge.net\n4http://meka.sourceforge.net/#datasets\n5http://cse.seu.edu.cn/people/zhangml/Resources.htm#data\n\n7\n\n\fTable 1: Results of Example-F1 on the various data sets (mean \u00b1 standard deviation). The best\nresults are in bold. Numbers in square brackets indicate the rank.\n\nData set\n\nBR\n\nCC\n\nECC\n\nCCA\n\n0.6109 (cid:6) 0.024[4]\n0.5850(cid:6) 0.033[7]\n0.6076 (cid:6) 0.019[6]\n0.6096(cid:6) 0.018[5]\nyeast\n0.5947(cid:6) 0.015[4] 0.5947 (cid:6) 0.009[4]\n0.5247 (cid:6) 0.025[7]\n0.5991(cid:6) 0.021[1]\nimage\n0.5260 (cid:6) 0.021[3]\n0.5246(cid:6) 0.028[4] 0.5123(cid:6) 0.027[5]\n0.4898 (cid:6) 0.024[6]\nslashdot\n0.4848(cid:6) 0.014[4]\n0.4792 (cid:6) 0.017[7] 0.4799(cid:6) 0.011[6]\n0.4812 (cid:6) 0.024[5]\nenron\n0.3138 (cid:6) 0.022[6]\n0.3219(cid:6) 0.028[4] 0.3223(cid:6) 0.030[3]\n0.2978 (cid:6) 0.026[7]\nLLog 10\n0.5013(cid:6) 0.022[4]\n0.5070(cid:6) 0.020[3]\nyahoo art 10 0.4840 (cid:6) 0.023[5]\n-\neurlex sm 10 0.8594 (cid:6) 0.003[5]\n0.8609(cid:6) 0.004[1] 0.8606(cid:6) 0.003[3]\n-\n0.7183(cid:6) 0.013[2]\n0.7176(cid:6) 0.012[4]\neurlex ed 10 0.7170 (cid:6) 0.012[5]\n-\n3.63\n3.88\nAverage Rank\n\n5.88\n\n4.60\n\nCC-DP\n\nMMOC\n\nCC-Greedy\n0.6135(cid:6) 0.015[2]\n0.6132 (cid:6) 0.021 [3] 0.6144(cid:6) 0.021[1]\n0.5939(cid:6) 0.021[6]\n0.5960 (cid:6) 0.012[3]\n0.5976(cid:6) 0.015[2]\n0.5268(cid:6) 0.022[1]\n0.5266(cid:6) 0.022[2]\n0.4895 (cid:6) 0.022[7]\n0.4894 (cid:6) 0.016[2] 0.4880(cid:6) 0.015[3]\n0.4940 (cid:6) 0.016[1]\n0.3153 (cid:6) 0.026[5]\n0.3269(cid:6) 0.023[2]\n0.3298(cid:6) 0.025[1]\n0.5131(cid:6) 0.015[2]\n0.5135(cid:6) 0.020[1]\n-\n0.8600(cid:6) 0.004[4]\n0.8609(cid:6) 0.004[1]\n-\n0.7190(cid:6) 0.013[1]\n0.7183(cid:6) 0.013[2]\n-\n1.50\n2.63\n\n3.80\n\n5.2 Prediction performance\n\nExample-F1 results for our method and baseline approaches in respect of the different data sets\nare reported in Table 1. Other measure results are reported in the Supplementary Materials. From\nthe results, we can see that: 1) BR is much inferior to other methods in terms of Example-F1.\nOur experiment provides empirical evidence that the label correlations exist in many real word\ndata sets and because BR ignores the information about the correlations between the labels, BR\nachieves poor performance on most data sets. 2) CC improves the performance of BR, however,\nit underperforms ECC. This result veri\ufb01es the answer to our \ufb01rst question stated in Section 1: the\nlabel order does affect the performance of CC; ECC, which averages over several CC predictions\nwith random order, improves the performance of CC. 3) CC-DP and CC-Greedy outperforms CCA\nand MMOC. This studies verify that optimal CC achieve competitive results compared with state-\nof-the-art encoding-decoding approaches. 4) Our proposed CC-DP and CC-Greedy algorithms are\nsuccessful on most data sets. This empirical result also veri\ufb01es the answers to the last two questions\nstated in Section 1: the globally optimal CC exists and CC-DP can \ufb01nd the globally optimal CC\nwhich achieves the best prediction performance; the CC-Greedy algorithm achieves comparable\nprediction performance with CC-DP, while it requires lower time complexity than CC-DP. In the\nexperiment, our proposed algorithms are much faster than CCA and MMOC in terms of both training\nand testing time, and achieve the same testing time with CC. Through the training time for our\nalgorithms is slower than BR, CC and ECC. Our extensive empirical studies show that our algorithms\nachieve superior performance than those baselines.\n\n6 Conclusion\n\nTo improve the performance of multi-label classi\ufb01cation, a plethora of models have been developed\nto capture label correlations. Amongst them, classi\ufb01er chain is one of the most popular approaches\ndue to its simplicity and good prediction performance. Instead of proposing a new learning model,\nwe discuss three important questions in this work regarding the optimal classi\ufb01er chain stated in\nSection 1. To answer these questions, we \ufb01rst propose a generalized CC model. We then provide\na theoretical analysis of the generalization error for the proposed generalized model. Based on our\nresults, we obtain the answer to the second question: the globally optimal CC exists only if the mini-\nmization of the upper bound is achieved over this CC. It is very expensive to search over q! different\nlabel orders to \ufb01nd the globally optimal CC. Thus, we propose the CC-DP algorithm to simplify\nthe search algorithm, which requires O(q3nd) complexity. To speed up the CC-DP algorithm, we\npropose a CC-Greedy algorithm to \ufb01nd a locally optimal CC, where the time complexity of the CC-\nGreedy algorithm is O(q2nd). Comprehensive experiments on eight real-world multi-label data sets\nfrom different domains verify our theoretical studies and the effectiveness of proposed algorithms.\n\nAcknowledgments\n\nThis research was supported by the Australian Research Council Future Fellowship FT130100746.\n\nReferences\n[1] Robert E. Schapire and Yoram Singer. BoosTexter: A Boosting-based System for Text Categorization.\n\nMachine Learning, 39(2-3):135\u2013168, 2000.\n\n8\n\n\f[2] Zafer Barutc\u00b8uoglu and Robert E. Schapire and Olga G. Troyanskaya. Hierarchical multi-label prediction\n\nof gene function. Bioinformatics, 22(7):22\u20137, 2006.\n\n[3] Matthew R. Boutell and Jiebo Luo and Xipeng Shen and Christopher M. Brown. Learning Multi-Label\n\nScene Classi\ufb01cation. Pattern Recognition, 37(9):1757\u20131771, 2004.\n\n[4] Grigorios Tsoumakas and Ioannis Katakis and Ioannis P. Vlahavas. Mining Multi-label Data. In Data\n\nMining and Knowledge Discovery Handbook, pages 667\u2013685, 2010. Springer US.\n\n[5] Krzysztof Dembczynski and Weiwei Cheng and Eyke H\u00a8ullermeier. Bayes Optimal Multilabel Classi\ufb01-\ncation via Probabilistic Classi\ufb01er Chains. Proceedings of the 27th International Conference on Machine\nLearning, pages 279\u2013286, Haifa, Israel, 2010. Omnipress.\n\n[6] Jesse Read and Bernhard Pfahringer and Geoffrey Holmes and Eibe Frank. Classi\ufb01er Chains for Multi-\nlabel Classi\ufb01cation. In Proceedings of the European Conference on Machine Learning and Knowledge\nDiscovery in Databases: Part II, pages 254\u2013269, Berlin, Heidelberg, 2009. Springer-Verlag.\n\n[7] Yi Zhang and Jeff G. Schneider. Maximum Margin Output Coding. Proceedings of the 29th International\n\nConference on Machine Learning, pages 1575\u20131582, New York, NY, 2012. Omnipress.\n\n[8] Yuhong Guo and Suicheng Gu. Multi-Label Classi\ufb01cation Using Conditional Dependency Networks.\nProceedings of the Twenty-Second International Joint Conference on Arti\ufb01cial Intelligence, pages 1300\u2013\n1305, Barcelona, Catalonia, Spain, 2011. AAAI Press.\n\n[9] Sheng-Jun Huang and Zhi-Hua Zhou. Multi-Label Learning by Exploiting Label Correlations Locally.\nProceedings of the Twenty-Sixth AAAI Conference on Arti\ufb01cial Intelligence, Toronto, Ontario, Canada,\n2012. AAAI Press.\n\n[10] Feng Kang and Rong Jin and Rahul Sukthankar. Correlated Label Propagation with Application to Multi-\nlabel Learning. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,\npages 1719\u20131726, New York, NY, 2006. IEEE Computer Society.\n\n[11] Weiwei Liu and Ivor W. Tsang. Large Margin Metric Learning for Multi-Label Prediction. Proceedings\nof the Twenty-Ninth Conference on Arti\ufb01cial Intelligence, pages 2800\u20132806, Texas, USA, 2015. AAAI\nPress.\n\n[12] Mingkui Tan and Qinfeng Shi and Anton van den Hengel and Chunhua Shen and Junbin Gao and Fuyuan\nHu and Zhen Zhang. Learning Graph Structure for Multi-Label Image Classi\ufb01cation via Clique Genera-\ntion. The IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[13] Daniel Hsu and Sham Kakade and John Langford and Tong Zhang. Multi-Label Prediction via Com-\npressed Sensing. Advances in Neural Information Processing Systems, pages 772\u2013780, 2009. Curran\nAssociates, Inc.\n\n[14] Yi Zhang and Jeff G. Schneider. Multi-Label Output Codes using Canonical Correlation Analysis. Pro-\nceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 873\u2013\n882, Fort Lauderdale, USA, 2011. JMLR.org.\n\n[15] Farbound Tai and Hsuan-Tien Lin. Multilabel Classi\ufb01cation with Principal Label Space Transformation.\n\nNeural Computation, 24(9):2508\u20132542, 2012.\n\n[16] Min-Ling Zhang and Kun Zhang. Multi-label learning by exploiting label dependency. Proceedings\nof the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages\n999\u20131008, QWashington, DC, USA, 2010. ACM.\n\n[17] John Shawe-Taylor and Peter L. Bartlett and Robert C. Williamson and Martin Anthony. Structural Risk\nMinimization Over Data-Dependent Hierarchies. IEEE Transactions on Information Theory, 44(5):1926\u2013\n1940, 1998.\n\n[18] Kristin P. Bennett and Nello Cristianini and John Shawe-Taylor and Donghui Wu. Enlarging the Margins\n\nin Perceptron Decision Trees. Machine Learning, 41(3):295\u2013313, 2000.\n\n[19] Michael J. Kearns and Robert E. Schapire. Ef\ufb01cient Distribution-free Learning of Probabilistic Concept-\ns. Proceedings of the 31st Symposium on the Foundations of Computer Science, pages 382\u2013391, Los\nAlamitos, CA, 1990. IEEE Computer Society Press.\n\n[20] Peter L. Bartlett and John Shawe-Taylor. Generalization Performance of Support Vector Machines and\nOther Pattern Classi\ufb01ers. Advances in Kernel Methods - Support Vector Learning, pages 43\u201354, Cam-\nbridge, MA, USA, 1998. MIT Press.\n\n[21] Rong-En Fan and Kai-Wei Chang and Cho-Jui Hsieh and Xiang-Rui Wang and Chih-Jen Lin. LIBLIN-\nEAR: A Library for Large Linear Classi\ufb01cation. Journal of Machine Learning Research, 9:1871\u20131874,\n2008.\n\n[22] Qi Mao and Ivor Wai-Hung Tsang and Shenghua Gao. Objective-Guided Image Annotation.\n\nTransactions on Image Processing, 22(4):1585\u20131597, 2013.\n\nIEEE\n\n9\n\n\f", "award": [], "sourceid": 489, "authors": [{"given_name": "Weiwei", "family_name": "Liu", "institution": "UTS"}, {"given_name": "Ivor", "family_name": "Tsang", "institution": "University of Technology, Sydney"}]}