{"title": "Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 153, "page_last": 161, "abstract": "In hierarchical classification, the prediction paths may be required to always end at leaf nodes. This is called mandatory leaf node prediction (MLNP) and is particularly useful when the leaf nodes have much stronger semantic meaning than the internal nodes. However, while there have been a lot of MLNP methods in hierarchical multiclass classification, performing MLNP in hierarchical multilabel classification is much more difficult. In this paper, we propose a novel MLNP algorithm that (i) considers the global hierarchy structure; and (ii) can  be used on hierarchies of both trees and DAGs. We show that one can efficiently maximize the joint posterior probability of all the node labels by a simple greedy algorithm. Moreover, this can be further extended to the minimization of the expected symmetric loss. Experiments are performed on a number of real-world data sets with tree- and DAG-structured label hierarchies. The  proposed method consistently outperforms other hierarchical and flat multilabel classification methods.", "full_text": "Mandatory Leaf Node Prediction in\nHierarchical Multilabel Classi\ufb01cation\n\nWei Bi\n\nJames T. Kwok\n\nDepartment of Computer Science and Engineering\nHong Kong University of Science and Technology\n\nClear Water Bay, Hong Kong\n\n{weibi,jamesk}@cse.ust.hk\n\nAbstract\n\nIn hierarchical classi\ufb01cation, the prediction paths may be required to always end\nat leaf nodes. This is called mandatory leaf node prediction (MLNP) and is par-\nticularly useful when the leaf nodes have much stronger semantic meaning than\nthe internal nodes. However, while there have been a lot of MLNP methods in hi-\nerarchical multiclass classi\ufb01cation, performing MLNP in hierarchical multilabel\nclassi\ufb01cation is much more dif\ufb01cult. In this paper, we propose a novel MLNP\nalgorithm that (i) considers the global hierarchy structure; and (ii) can be used on\nhierarchies of both trees and DAGs. We show that one can ef\ufb01ciently maximize\nthe joint posterior probability of all the node labels by a simple greedy algorithm.\nMoreover, this can be further extended to the minimization of the expected sym-\nmetric loss. Experiments are performed on a number of real-world data sets with\ntree- and DAG-structured label hierarchies. The proposed method consistently\noutperforms other hierarchical and \ufb02at multilabel classi\ufb01cation methods.\n\n1\n\nIntroduction\n\nIn many real-world classi\ufb01cation problems, the output labels are organized in a hierarchy. For\nexample, gene functions are arranged in a tree in the Functional Catalog (FunCat) or as a directed\nacyclic graph (DAG) in the Gene Ontology (GO) [1]; musical signals are organized in an audio\ntaxonomy [2]; and documents in the Wikipedia hierarchy. Hierarchical classi\ufb01cation algorithms,\nwhich utilize these hierarchical relationships between labels in making predictions, often lead to\nbetter performance than traditional non-hierarchical (\ufb02at) approaches.\nIn hierarchical classi\ufb01cation, the labels associated with each pattern can be on a path from the root\nto a leaf (full-path prediction); or stop at an internal node (partial-path prediction [3]). Following\nthe terminology in the recent survey [4], when only full-path predictions are allowed, it is called\nmandatory leaf node prediction (MLNP); whereas when partial-path predictions are also allowed,\nit is called non-mandatory leaf node prediction (NMLNP). Depending on the application and how\nthe label hierarchy is generated, either one of these prediction modes may be more relevant. For\nexample, in the taxonomies of musical signals [2] and genes [5], the leaf nodes have much stronger\nsemantic/biological meanings than the internal nodes, and MLNP is more important. Besides, some-\ntimes the label hierarchy is learned from the data, using methods like hierarchical clustering [6],\nBayesian network structure learning [7] and label tree methods [8, 9]. In these cases, the internal\nnodes are only arti\ufb01cial, and MLNP is again more relevant. In the recent Second Pascal Challenge\non Large-scale Hierarchical Text Classi\ufb01cation, the tasks also require MLNP.\nIn this paper, we focus on hierarchical multilabel classi\ufb01cation (HMC), which differs from hierar-\nchical multiclass classi\ufb01cation in that the labels of each pattern can fall on a union of paths in the\nhierarchy [10]. An everyday example is that a document/image/song/video may have multiple tags.\nBecause of its practical signi\ufb01cance, HMC has been extensively studied in recent years [1,3,10\u201312].\n\n1\n\n\fWhile there have been a lot of MLNP methods in hierarchical multiclass classi\ufb01cation [4], none of\nthese can be easily extended for the more dif\ufb01cult HMC setting. They all rely on training a multiclass\nclassi\ufb01er at each node, and then use a recursive strategy to predict which subtree to pursue at the next\nlower level. In hierarchical multiclass classi\ufb01cation, exactly one subtree is to be pursued; whereas\nin HMC, one has to decide at each node how many and which subtrees to pursue. Even when this\ncan be performed (e.g., by adjusting the classi\ufb01cation threshold heuristically), it is dif\ufb01cult to ensure\nthat all the prediction paths will end at leaf nodes, and so a lot of partial paths may be resulted.\nAlternatively, one may perform MLNP by \ufb01rst predicting the number of leaf labels (k) that the test\npattern has, and then pick the k leaf labels whose posterior probabilities are the largest. Prediction of\nk can be achieved by using the MetaLabeler [13], though this involves another, possibly non-trivial,\nlearning task. Moreover, the posterior probability computed at each leaf l corresponds to a single\nprediction path from the root to l. However, the target multilabel in HMC can have multiple paths.\nHence, a better approach is to compute the posterior probabilities of all subtrees/subgraphs that have\n\nk leaf nodes; and then pick the one with the largest probability. However, as there are(cid:0)N\n\npossible subsets (where N is the number of leafs), this can be expensive when N is large.\nRecently, Cerri et al. [14] proposed the HMC-label-powerset (HMC-LP), which is specially de-\nsigned for MLNP in HMC. Its main idea is to reduce the hierarchical problem to a non-hierarchical\nproblem by running the (non-hierarchical) multilabel classi\ufb01cation method of label-powerset [15]\nat each level of the hierarchy. However, this signi\ufb01cantly increases the number of \u201cmeta-labels\u201d,\nmaking it unsuitable for large hierarchies. Moreover, as it processes the hierarchy level-by-level,\nthis cannot be applied on DAGs, where \u201clevels\u201d are not well-de\ufb01ned.\nIn this paper, we propose an ef\ufb01cient algorithm for MLNP in both tree-structured and DAG-\nstructured hierarchical multilabel classi\ufb01cation. The target multilabel is obtained by maximizing\nthe posterior probability among all feasible multilabels. By adopting a weak \u201cnested approxima-\ntion\u201d assumption, we show that the resultant optimization problem can be ef\ufb01ciently solved by a\ngreedy algorithm. Empirical results also demonstrate that this \u201cnested approximation\u201d assumption\nholds in general. The rest of this paper is organized as follows. Section 2 describes the proposed\nframework for MLNP on tree-structured hierarchies, which is then extended to DAG-structured hi-\nerarchies in Section 3. Experimental results are presented in Section 4, and the last section gives\nsome concluding remarks.\n\n(cid:1) such\n\nk\n\n2 Maximum a Posteriori MLNP on Label Trees\nIn this section, we assume that the label hierarchy is a tree T . With a slight abuse of notation,\nwe will also use T to denote the set of all the tree nodes, which are indexed from 0 (for the root),\n1, 2, . . . , N. Let the set of leaf nodes in T be L. For a subset A \u2286 T , its complement is denoted by\nAc = T \\A. For a node i, denote its parent by pa(i), and its set of children by child(i). Moreover,\ngiven a vector y, yA is the subvector of y with indices from A.\nIn HMC, we are given a set of training examples {(x, y)}, where x is the input and y =\n[y0, . . . , yN ](cid:48) \u2208 {0, 1}N +1 is the multilabel denoting memberships of x to each of the nodes. Equiv-\nalently, y can be represented by a set \u2126 \u2286 T , such that yi = 1 if i \u2208 \u2126; and 0 otherwise. For y (or\n\u2126) to respect the tree structure, we require that yi = 1 \u21d2 ypa(i) = 1 for any non-root node i \u2208 T .\nIn this paper, we assume that for any group of siblings {i1, i2, . . . , im}, their labels are condition-\n(cid:81)m\nally independent given the label of their parent pa(i1) and x, i.e., p(yi1, yi2, . . . yim|ypa(i1), x) =\nj=1 p(yij|ypa(i1), x). This simpli\ufb01cation is standard in Bayesian networks and also commonly\nused in HMC [16, 17]. By repeated application of the probability product rule, we have\n\n(cid:89)\n\ni\u2208T \\{0}\n\np(y0, . . . , yN|x) = p(y0|x)\n\np(yi | ypa(i), x).\n\n(1)\n\n2.1 Training\nWith the simpli\ufb01cation in (1), we only need to train estimators for p(yi = 1 | ypa(i) = 1, x), i \u2208\nT \\{0}. The algorithms to be proposed are independent of the way these probability estimators are\nlearned. In the experiments, we train a multitask lasso model for each group of sibling nodes, using\nthose training examples that their shared parent is labeled positive.\n\n2\n\n\f2.2 Prediction\nFor maximum a posteriori MLNP of a test pattern x, we want to \ufb01nd the multilabel \u2126\u2217 that (i)\nmaximizes the posterior probability in (1); and (ii) respects T . Suppose that it is also known that x\nhas k leaf labels. The prediction task is then:\n\n\u2126\u2217 = max\u2126 p(y\u2126 = 1, y\u2126c = 0 | x)\n\ns.t.\n\ny0 = 1, k of the leaves in L are labeled 1,\n\u2126 contains no partial path,\nall yi\u2019s respect the label hierarchy.\n\n(3)\nNote that p(y\u2126 = 1, y\u2126c = 0 | x) considers all the node labels in the hierarchy simultaneously.\nIn contrast, as discussed in Section 1, existing MLNP methods in hierarchical multiclass/multilabel\nclassi\ufb01cation only considers the hierarchy information locally at each node.\nAssociate an indicator function \u03c8 : T \u2192 {0, 1}N +1 with \u2126, such that \u03c8i \u2261 \u03c8(i) = 1 if i \u2208 \u2126, and\n0 otherwise. The following Proposition shows that (2) can be written as an integer linear program.\nProposition 1. For a label tree, problem (2) can be rewritten as\n\nmax\u03c8\n\nwi\u03c8i\n\n(cid:88)\ns.t. (cid:88)\n(cid:88)\n\ni\u2208L\n\ni\u2208T\n\n\u03c8i = k, \u03c80 = 1, \u03c8i \u2208 {0, 1} \u2200i \u2208 T ,\n\n\u03c8j \u2265 1 \u2200i \u2208 Lc : \u03c8i = 1,\n\nj\u2208child(i)\n\u03c8i \u2264 \u03c8pa(i) \u2200i \u2208 T \\{0},\n\n(2)\n\n(4)\n\n(5)\n\n(6)\n\nk\n\nwhere\n\n\uf8f1\uf8f2\uf8f3\nProblem (4) has(cid:0)|L|\n\nwi =\n\nl\u2208child(i) log(1 \u2212 pl)\nlog pi \u2212 log(1 \u2212 pi)\n\n(cid:80)\nlog pi \u2212 log(1 \u2212 pi) +(cid:80)\n(cid:1) candidate solutions, which can be expensive to solve when T is large. In\n\ni = 0\ni \u2208 L\ni \u2208 Lc\\{0} ,\n\nl\u2208child(i) log(1 \u2212 pl)\n\nand pi \u2261 p(yi = 1 | ypa(i) = 1, x).\n\nthe following, we will extend the nested approximation property (NAP), \ufb01rst introduced in [18] for\nmodel-based compressed sensing, to constrain the optimal solution.\nDe\ufb01nition 1 (k-leaf-sparse). A multilabel y is k-leaf-sparse if k of the leaf nodes are labeled one.\nDe\ufb01nition 2 (Nested Approximation Property (NAP)). For a pattern x, let its optimal k-leaf-sparse\nmultilabel be \u2126k. The NAP is satis\ufb01ed if {i : i \u2208 \u2126k} \u2282 {i : i \u2208 \u2126k(cid:48)} for all k < k(cid:48).\nNote that NAP is often implicitly assumed in many HMC algorithms. For example, consider the\ncommon approach that trains a binary classi\ufb01er at each node and recursively predicts from the root to\nthe subtrees. When the classi\ufb01cation threshold at each node is high, prediction stops early; whereas\nwhen the threshold is lowered, prediction can go further down the hierarchy. Hence, nodes that\nare labeled positive at a high threshold will always be labeled at a lower threshold, implying NAP.\nAnother example is the CSSA algorithm in [11]. Since it is greedy, a larger solution (with more\nlabels predicted positive) always includes the smaller solutions.\nAlgorithm 1 shows the proposed algorithm, which will be called MAS (MAndatory leaf node pre-\ndiction on Structures). Similar to [11], Algorithm 1 is also greedy and based on keeping track of the\nsupernodes. However, the de\ufb01nition of a supernode and its updating are different. Each node i \u2208 T\nis associated with the weight wi in (6). Initially, only the root is selected (\u03c80 = 1). For each leaf l\nin L, we create a supernode, which is a subset in T containing all the nodes on the path from l to the\nroot. Given |L| leaves in T , there are initially |L| supernodes. Moreover, all of them are unassigned\n(i.e., each contains an unselected leaf node). Each supernode S has a supernode value (SNV) which\n\nis de\ufb01ned as SNV(S) =(cid:80)\n\ni\u2208S wi.\n\n3\n\n\ffrom each leaf with its ancestors.\n\nAlgorithm 1 MAS (Mandatory leaf node prediction on structures).\n1: Initialization: Initialize every node (except the root) with \u03c8i \u2190 0; \u2126 \u2190 {0}; Create a supernode\n2: for iteration=1 to k do\n3:\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nselect the unassigned supernode S\u2217 with the largest SNV;\nassign all unselected nodes in S\u2217 with \u03c8i \u2190 1;\n\u2126 \u2190 \u2126 \u222a S\u2217;\nfor each unassigned supernode S do\n\nupdate the SNV of S (using Algorithm 2 for trees and Algorithm 3 for DAGs);\n\nend for\n\nwill take if S is merged with \u2126, i.e., SNV(S) \u2190(cid:80)\n\nIn each iteration, supernode S\u2217 with the largest SNV is selected among all the unassigned supern-\nodes. S\u2217 is then assigned, with the \u03c8i\u2019s of all its constituent nodes set to 1, and \u2126 is updated\naccordingly. For each remaining unassigned supernode S, we update its SNV to be the value that it\ni\u2208S\\\u2126 wi + SNV(\u2126). Since each\nunassigned S contains exactly one leaf and we have a tree structure, this update can be implemented\nef\ufb01ciently in O(h2) time, where h is the height of the tree (Algorithm 2).\n\ni\u2208S\u222a\u2126 wi =(cid:80)\n\nAlgorithm 2 Updating the SNV of an\nunassigned tree supernode S, contain-\ning the leaf l.\n1: node \u2190 l;\n2: SNV(S) \u2190 SNV(\u2126);\n3: repeat\n4:\n5:\n6: until node \u2208 \u2126.\n\nSNV(S) \u2190 SNV(S) + wnode;\nnode \u2190 pa(node);\n\nAlgorithm 3 Updating the SNV of an unassigned DAG\nsupernode S, containing the leaf l.\n1: insert l to T ;\n2: SNV(S) \u2190 SNV(\u2126);\n3: repeat\n4:\n5:\n6:\n7:\n8: until T = \u2205.\n\nnode \u2190 \ufb01nd-max(T );\ndelete node from T ;\nSNV(S) \u2190 SNV(S) + wnode;\ninsert nodes in Pa(node)\\(\u2126 \u222a T ) to T ;\n\nThe following Proposition shows that MAS \ufb01nds the best k-leaf-sparse prediction.\nProposition 2. Algorithm 1 obtains an optimal \u03c8 solution of (4) under the NAP assumption.\nFinally, we study the time complexity of Algorithm 1. Step 3 takes O(|L|) time; steps 4 and 5 take\nO(h) time; and updating all the remaining unassigned supernodes takes O(h2|L|) time. Therefore,\neach iteration takes O(h2|L|) time, and the total time to obtain an optimal k-leaf-sparse solution is\n\nO(h2k|L|). In contrast, a brute-force search will take(cid:0)|L|\n\n(cid:1) time.\n\nk\n\n2.2.1 Unknown Number of Labels\n\nIn practice, the value of k may not be known. The straightforward approach is to run Algorithm 1\nwith k = 1, . . . ,|L|, and \ufb01nd the \u2126k \u2208 {\u21261, . . . , \u2126|L|} that maximizes the posterior probability in\n(1). However, recall that \u2126k \u2282 \u2126k+1 under the NAP assumption. Hence, we can simply set k = |L|,\nand \u2126i is immediately obtained as the \u2126 in iteration i. The total time complexity is O(h2|L|2). In\ncontrast, a brute-force search takes O(2|L|) time when k is unknown.\n\n2.3 MLNP that Minimizes Risk\n\nWhile maximizing the posterior probability minimizes the 0-1 loss, another loss function that has\nbeen popularly used in hierarchical classi\ufb01cation is the H-loss [12]. However, along each prediction\npath, H-loss only penalizes the \ufb01rst classi\ufb01cation mistake closest to the root. On the other hand, we\nare more interested in the leaf nodes in MLNP. Hence, we will adopt the symmetric loss instead,\nwhich is de\ufb01ned as (cid:96)(\u2126, \u02da\u2126) = |\u2126\\\u02da\u2126| + |\u02da\u2126\\\u2126|, where \u02da\u2126 is the true multilabel for the given x, and\n\u2126 is the prediction. However, this weights mistakes in any part of the hierarchy equally; whereas in\nHMC, a mistake that occurs at the higher level of the hierarchy is usually considered more crucial.\n\n4\n\n\fincorporate the hierarchy structure into (cid:96)(\u2126, \u02da\u2126) by extending it as(cid:80)\n\nLet I(\u00b7) be the indicator function that returns 1 when the argument holds, 0 otherwise. We thus\ni ciI(i \u2208 \u2126\\\u02da\u2126)+ciI(i \u2208 \u02da\u2126\\\u2126),\nwhere c0 = 1, ci = cpa(i)/nsibl(i) as in [3], and nsibl(i) is the number of siblings of i (including i\nitself). Finally, one can also allow different relative importance (\u03b1 \u2265 0) for the false positives and\nnegatives, and generalize (cid:96)(\u2126, \u02da\u2126) further as\n\n(cid:96)(\u2126, \u02da\u2126) =\n\ni I(i \u2208 \u2126\\\u02da\u2126) + c\u2212\nc+\n\ni I(i \u2208 \u02da\u2126\\\u2126),\n\n(7)\n\n(cid:88)\n\ni\n\n1+\u03b1 and c\u2212\n\ni = 2ci\n\n1+\u03b1 .\ni = 2\u03b1ci\n\nwhere c+\n(cid:80)\nGiven a loss function (cid:96)(\u00b7,\u00b7), from Bayesian decision theory, the optimal multilabel \u2126\u2217 is the one that\n\u02da\u2126 (cid:96)(\u2126, \u02da\u2126) p(y\u02da\u2126 = 1, y\u02da\u2126c = 0|x). The proposed\nminimizes the expected loss: \u2126\u2217 = arg min\u2126\nformulation can be easily extended for this. The following Proposition shows that it leads to a\nproblem very similar to (4). Extension to a DAG-structured label hierarchy is analogous.\nProposition 3. With a label tree and the loss function in (7), the optimal \u2126\u2217 that minimizes the\ni )p(yi = 1|x) \u2212 c+\ni .\nexpected loss can be obtained by solving (4), but with wi = (c+\n\ni + c\u2212\n\n3 Maximum a Posteriori MLNP on Label DAGs\nWhen the label hierarchy is a DAG G, on using the same conditional independence simpli\ufb01cation in\nSection 2, we have\n\np(y0, y1, . . . , yN|x) = p(y0|x)\n\np(yi | yPa(i), x),\n\n(8)\n\n(cid:89)\n\ni\u2208G\\{0}\n\nwhere Pa(i) is the set of parents of node i. The prediction task involves the same optimization\nproblem as in (2). However, there are now two interpretations on how the labels should respect the\nDAG in (3) [1, 11]. The \ufb01rst one requires that if a node is labeled positive, all its parents must also\nbe positive. In bioinformatics, this is also called the true path rule that governs the DAG-structured\nGO taxonomy on gene functions. The alternative is that a node can be labeled positive if at least one\nof its parents is positive. Here, we adopt the \ufb01rst interpretation which is more common.\nA direct maximization of p(y0, y1, . . . , yN|x) by (8) is NP-hard [19]. Moreover, the size of each\nprobability table p(yi|yPa(i), x) in (8) grows exponentially with |Pa(i)|. Hence, it can be both im-\npractical and inaccurate when G is large and the sample size is limited. In the following, we assume\n\np(y0, y1, . . . , yN|x) =\n\n1\n\nn(x)\n\np(y0|x)\n\np(yi | yj, x),\n\n(9)\n\n(cid:89)\n\n(cid:89)\n\ni\u2208G\\{0}\n\nj\u2208Pa(i)\n\nwhere n(x) is a normalization term. This follows from the approach of composite likelihood (or\npseudolikelihood) [20] which replaces a dif\ufb01cult probability density function by a set of marginal or\nconditional events that are easier to evaluate. In particular, (9) corresponds to the so-called pairwise\nconditional likelihood that has been used in longitudinal studies and bioinformatics [21]. Composite\nlikelihood has been successfully used in different applications such as genetics, spatial statistics\nand image analysis. The connection between composite likelihood and various (\ufb02at) multilabel\nclassi\ufb01cation models is also recently discussed in [21]. Moreover, by using (9), the 2|Pa(i)| numbers\nin the probability table p(yi|yPa(i), x) are replaced by the |Pa(i)| numbers in {p(yi|yj, x)}j\u2208Pa(i),\nand thus the estimates obtained are much more reliable. The following Proposition shows that\nmaximizing (9) can be reduced to a problem similar to (4).\nProposition 4. With the assumption (9), problem (2) for the label DAG can be rewritten as\n\n(10)\n\n(11)\n\nmax\u03c8\n\nwi\u03c8i\n\n(cid:88)\ns.t. (cid:88)\n(cid:88)\n\ni\u2208L\n\ni\u2208G\n\n\u03c8i = k, \u03c80 = 1, \u03c8i \u2208 {0, 1} \u2200i \u2208 G,\n\n\u03c8j \u2265 1 \u2200i \u2208 Lc : \u03c8i = 1,\nj\u2208child(i)\n\u03c8i \u2264 \u03c8j \u2200j \u2208 Pa(i), \u2200i \u2208 G\\{0},\n\n5\n\n\fi = 0,\ni \u2208 L,\n\n\uf8f1\uf8f2\uf8f3\n\nl\u2208child(i) log(1 \u2212 pli) i \u2208 Lc\\{0},\n\n(cid:80)\n(cid:80)\nl\u2208child(0) log(1 \u2212 pl0)\n(cid:80)\nj\u2208Pa(i)(log pij \u2212 log(1 \u2212 pij)) +(cid:80)\nj\u2208Pa(i)(log pij \u2212 log(1 \u2212 pij))\nwhere wi=\nand pij \u2261 p(yi = 1|yj = 1, x) for j \u2208 Pa(i).\nProblem (10) is similar to problem (4), except in the de\ufb01nition of wi and that the hierarchy constraint\n(11) is more general than (5). When the DAG is indeed a tree, (10) reduces to (4), and Proposition 4\nreduces to Proposition 1. When k is unknown, the same procedure in Section 2.2.1 applies.\nIn the proof of Proposition 2, we do not constrain the number of parents for each node. Hence, (10)\ncan be solved ef\ufb01ciently as before, except for two modi\ufb01cations: (i) Each initial supernode now\ncontains a leaf and its ancestors along all paths to the root. (ii) Since Pa(i) is a set and the hierarchy\nis a DAG, updating the SNV gets more complicated. In Algorithm 3, T is a self-balancing binary\nsearch tree (BST) that keeps track of the nodes in S\\\u2126 using their topological order1. To facilitate\nthe checking of whether a node is in \u2126 (step 7), \u2126 also stores its nodes in a self-balancing BST.\nRecall that for a self-balancing BST, the operations of insert, delete, \ufb01nd-max and \ufb01nding an element\nall take O(log V ) time, where V \u2264 N is the number of nodes in the BST. Hence, updating the SNV\nof one supernode by Algorithm 3 takes O(N log N ) time. As O(|L|) supernodes need to be updated\nin each iteration of Algorithm 1, this step (which is the most expensive step in Algorithm 1) takes\nO(|L| \u00b7 N log N ) time. The total time for Algorithm 1 is O(k \u00b7 |L| \u00b7 N log N ).\n4 Experiments\n\nIn this section, experiments are performed on a number of benchmark multilabel data sets2, with\nboth tree- and DAG-structured label hierarchies (Table 1). As pre-processing, we remove examples\nthat contain partial label paths and nodes with fewer than 10 positive examples. At each parent node,\nwe then train a multitask lasso model with logistic loss using the MALSAR package [22].\n\n4.1 Classi\ufb01cation Performance\n\nThe proposed MAS algorithm is compared with HMC-LP [14], the only existing algorithm that can\nperform MLNP on trees (but not on DAGs). We also compare with the combined use of MetaLabeler\n[13] and NMLNP methods as described in Section 1. These NMLNP methods include (i) HBR,\nwhich is modi\ufb01ed from the hierarchical classi\ufb01er H-SVM [3], by replacing its base learner SVM\nwith the multitask lasso as for MAS; (ii) CLUS-HMC [1]; and (iii) \ufb02at BR [23], which is a popular\nMLNP method but does not use the hierarchy information. For performance evaluation, we use\nthe hierarchical F-measure (HF) which has been commonly used in hierarchical classi\ufb01cation [4].\nResults based on 5-fold cross-validation are shown in Table 1. As can be seen, MAS is always\namong the best on almost all data sets.\nNext, we compare the methods using the loss in (7), where the relative importance for false positives\nvs negatives (\u03b1) is set to be the ratio of the numbers of negative and positive training labels. Results\nare shown in Table 2. As can be seen, the risk-minimizing version (MASR) can always obtain the\nsmallest loss. We also vary \u03b1 in the range { 1\n2 , 1, 2,\u00b7\u00b7\u00b7 , 9, 10}. As can be seen from\nFigure 1, MASR consistently outperforms the other methods, sometimes by a signi\ufb01cant margin.\nFinally, Figure 2 illustrates some example query images and their misclassi\ufb01cations by MAS, MASR\nand BR on the caltech101 data set. As can be seen, even when MAS/MASR misclassi\ufb01es the image,\nthe hierarchy often helps to keep the prediction close to the true label.\n4.2 Validating the NAP Assumption\n\n9 , . . . , 1\n\n10 , 1\n\nIn this section, we verify the validity of the NAP assumption. For each test pattern, we use brute-\nforce search to \ufb01nd its best k-leaf-sparse prediction, and check if it includes the best (k \u2212 1)-leaf-\nsparse prediction. As brute-force search is very expensive, experiments are only performed on four\n\n1We number the sorted order such that nodes nearer to the root are assigned smaller values. Note that the\n\ntopological sort only needs to be performed once as part of pre-processing.\n\n2Downloaded from http://mulan.sourceforge.net/datasets.html and http://dtai.\n\ncs.kuleuven.be/clus/hmcdatasets/\n\n6\n\n\fTable 1: HF values obtained by the various methods on all data sets. The best results and those\nthat are not statistically worse (according to paired t-test with p-value less than 0.05) are in bold.\nHMC-LP and CLUS-HMC cannot be run on the caltech101 data, which is large and dense.\n\ndata set\n\nrcv1v2 subset1\nrcv1v2 subset2\nrcv1v2 subset3\nrcv1v2 subset4\nrcv1v2 subset5\n\ndelicious\n\nenron\nwipo\n\ncaltech-101\nseq (funcat)\npheno (funcat)\nstruc (funcat)\nhom (funcat)\n\ncellcycle (funcat)\nchurch (funcat)\nderisi (funcat)\neisen (funcat)\ngasch1 (funcat)\ngasch2 (funcat)\n\nspo (funcat)\nexpr (funcat)\n\nseq (GO)\npheno (GO)\nstruc (GO)\nhom (GO)\n\ncellcycle (GO)\nchurch (GO)\nderisi (GO)\neisen (GO)\ngasch1 (GO)\ngasch2 (GO)\n\nspo (GO)\nexpr (GO)\n\n#pattern\n\n4422\n4485\n4513\n4569\n4452\n768\n1607\n569\n9144\n1115\n330\n1065\n1124\n1080\n1104\n995\n768\n1038\n1076\n1053\n1109\n518\n227\n505\n507\n484\n511\n492\n404\n512\n508\n494\n504\n\n#leaf\n42\n43\n46\n44\n45\n49\n24\n21\n102\n36\n14\n33\n35\n33\n35\n33\n29\n32\n33\n32\n32\n32\n19\n33\n29\n29\n28\n31\n28\n32\n32\n32\n35\n\navg #leaf\n\nper\n\npattern\n\n1.3\n1.3\n1.3\n1.3\n1.4\n5.4\n2.6\n1\n1\n1.8\n1.6\n1.8\n1.8\n1.9\n1.8\n1.8\n1.8\n1.8\n1.8\n1.8\n1.8\n3.6\n3.5\n3.5\n3.2\n3.1\n3.2\n3.4\n3.4\n3.4\n3.3\n3.3\n3.5\n\n(hierarchical)\n\n(with MetaLabeler)\n\n-\n\n0.63\n0.64\n0.63\n0.64\n0.63\n0.57\n0.68\n0.71\n\n-\n\n0.22\n0.21\n0.20\n0.21\n0.21\n0.23\n0.72\n0.42\n\n0.15\n0.12\n0.03\n0.21\n0.12\n0.05\n0.08\n0.10\n0.11\n0.05\n0.10\n0.12\n\nMAS HMC-LP HBR CLUS-HMC\n0.85\n0.85\n0.85\n0.86\n0.84\n0.53\n0.75\n0.83\n0.82\n0.26\n0.25\n0.23\n0.35\n0.20\n0.17\n0.18\n0.28\n0.25\n0.24\n0.18\n0.28\n0.52\n0.57\n0.51\n0.65\n0.49\n0.57\n0.56\n0.48\n0.64\n0.55\n0.50\n0.49\n\n0.83\n0.84\n0.83\n0.84\n0.83\n0.28\n0.74\n0.83\n0.82\n0.25\n0.25\n0.25\n0.36\n0.21\n0.18\n0.18\n0.29\n0.23\n0.22\n0.18\n0.25\n0.58\n0.53\n0.48\n0.60\n0.49\n0.50\n0.49\n0.54\n0.56\n0.50\n0.47\n0.57\n\n0.26\n0.20\n0.21\n0.27\n0.19\n0.20\n0.21\n0.28\n0.29\n0.25\n0.23\n0.25\n0.59\n0.49\n0.55\n0.59\n0.51\n0.53\n0.53\n0.57\n0.57\n0.51\n0.49\n0.55\n\n-\n-\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n\n(\ufb02at)\n\nBR\n0.83\n0.84\n0.83\n0.84\n0.83\n0.54\n0.74\n0.83\n0.70\n0.23\n0.23\n0.24\n0.36\n0.19\n0.17\n0.18\n0.27\n0.22\n0.25\n0.18\n0.27\n0.61\n0.55\n0.53\n0.63\n0.51\n0.54\n0.54\n0.57\n0.58\n0.53\n0.51\n0.60\n\nsmaller data sets for k = 2, . . . , 10. Figure 3 shows the percentage of test patterns satisfying the\nNAP assumption at different values of k. As can be seen, the NAP holds almost 100% of the time.\n5 Conclusion\n\nIn this paper, we proposed a novel hierarchical multilabel classi\ufb01cation (HMC) algorithm for manda-\ntory leaf node prediction. Unlike many hierarchical multilabel/multiclass classi\ufb01cation algorithms,\nit utilizes the global hierarchy information by \ufb01nding the multilabel that has the largest posterior\nprobability over all the node labels. By adopting a weak \u201cnested approximation\u201d assumption, which\nis already implicitly assumed in many HMC algorithms, we showed that this can be ef\ufb01ciently\noptimized by a simple greedy algorithm. Moreover, it can be extended to minimize the risk asso-\nciated with the (hierarchically weighted) symmetric loss. Experiments performed on a number of\nreal-world data sets demonstrate that the proposed algorithms are computationally simple and more\naccurate than existing HMC and \ufb02at multilabel classi\ufb01cation methods.\nAcknowledgment\n\nThis research has been partially supported by the Research Grants Council of the Hong Kong Special\nAdministrative Region under grant 614012.\n\n7\n\n\f(a) rcv1subset1\n\n(c) struc(funcat)\nFigure 1: Hierarchically weighted symmetric loss values (7) for different \u03b1\u2019s.\n\n(b) enron\n\nFigure 2: Example misclassi\ufb01cations on the caltech101 data set.\n\nTable 2: Hierarchically weighted symmetric loss values (7) on the tree-structured data sets.\n\ndata set\n\nrcv1v2 subset1\nrcv1v2 subset2\nrcv1v2 subset3\nrcv1v2 subset4\nrcv1v2 subset5\n\ndelicious\n\nenron\nwipo\n\ncaltech-101\nseq (funcat)\npheno (funcat)\nstruc (funcat)\nhom (funcat)\n\ncellcycle (funcat)\nchurch (funcat)\nderisi (funcat)\neisen (funcat)\ngasch1 (funcat)\ngasch2 (funcat)\n\nspo (funcat)\nexpr (funcat)\n\n(used with MetaLabeler)\n\n-\n\n0.46\n0.45\n0.45\n0.44\n0.46\n0.23\n0.25\n0.34\n\nMASR MAS HMC-LP HBR CLUS-HMC\n0.05\n0.04\n0.04\n0.04\n0.05\n0.23\n0.31\n0.07\n0.00\n0.24\n0.39\n0.29\n0.32\n0.24\n0.26\n0.26\n0.30\n0.24\n0.30\n0.31\n0.24\n\n0.10\n0.09\n0.09\n0.10\n0.10\n0.19\n0.36\n0.09\n0.01\n0.26\n0.38\n0.39\n0.36\n0.29\n0.30\n0.30\n0.36\n0.27\n0.27\n0.29\n0.26\n\n0.12\n0.11\n0.11\n0.11\n0.11\n0.14\n0.35\n0.09\n0.01\n0.38\n0.38\n0.89\n0.36\n0.29\n0.30\n0.30\n0.38\n0.29\n0.29\n0.30\n0.28\n\n0.41\n0.61\n0.42\n0.37\n0.41\n0.42\n0.45\n0.39\n0.43\n0.42\n0.42\n0.41\n\n0.20\n0.19\n0.20\n0.19\n0.20\n0.13\n0.41\n0.16\n\n-\n\n0.38\n0.55\n0.41\n0.34\n0.38\n0.41\n0.43\n0.36\n0.39\n0.39\n0.40\n0.39\n\nBR\n0.13\n0.12\n0.12\n0.11\n0.12\n0.14\n0.35\n0.09\n0.01\n0.41\n0.41\n0.40\n0.32\n0.30\n0.31\n0.30\n0.38\n0.29\n0.29\n0.30\n0.28\n\n(a) pheno(funcat).\n\n(b) pheno(GO).\n\n(c) eisen(funcat).\n\n(d) eisen(GO).\n\nFigure 3: Percentage of patterns satisfying the NAP assumption at different values of k.\n\n8\n\n1/10 1/5 1 5 10 0.040.050.060.070.080.090.10.110.12(cid:95)Average Testing Loss MASRMASHBR1/10 1/5 1 5 10 0.20.220.240.260.280.30.320.340.360.380.4(cid:95)Average Testing Loss MASRMASHBR1/10 1/5 1 5 10 0.20.250.30.350.4(cid:95)Average Testing Loss MASRMASHBRanimate inanimate animal water crayfish lobster insect butterfly root Query MASR MAS BR music wind accordion animate inanimate animal water crocodile dolphin air ibis root Query MASR MAS BR transportation air airplane animate plant flower sunflower water lily insect butterfly root Query MASR MAS BR human face 24681090919293949596979899100kinstances satisfying NAP(%)24681090919293949596979899100kinstances satisfying NAP(%)24681090919293949596979899100kinstances satisfying NAP(%)24681090919293949596979899100kinstances satisfying NAP(%)\fReferences\n[1] C. Vens, J. Struyf, L. Schietgat, S. Dvzeroski, and H. Blockeel. Decision trees for hierarchical multi-label\n\nclassi\ufb01cation. Machine Learning, 73:185\u2013214, 2008.\n\n[2] J.J. Burred and A. Lerch. A hierarchical approach to automatic musical genre classi\ufb01cation. In Proceed-\n\nings of the 6th International Conference on Digital Audio Effects, 2003.\n\n[3] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni.\n\nIncremental algorithms for hierarchical classi\ufb01cation.\n\nJournal of Machine Learning Research, 7:31\u201354, 2006.\n\n[4] C.N. Silla and A.A. Freitas. A survey of hierarchical classi\ufb01cation across different application domains.\n\nData Mining and Knowledge Discovery, 22(1-2):31\u201372, 2011.\n\n[5] Z. Barutcuoglu and O.G. Troyanskaya. Hierarchical multi-label prediction of gene function. Bioinfor-\n\nmatics, 22:830\u2013836, 2006.\n\n[6] K. Punera, S. Rajan, and J. Ghosh. Automatically learning document taxonomies for hierarchical clas-\nsi\ufb01cation. In Proceedings of the 14th International Conference on World Wide Web, pages 1010\u20131011,\n2005.\n\n[7] M.-L. Zhang and K. Zhang. Multi-label learning by exploiting label dependency. In Proceedings of the\n\n16th International Conference on Knowledge Discovery and Data Mining, pages 999\u20131008, 2010.\n\n[8] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In Advances in\n\nNeural Information Processing Systems 23, pages 163\u2013171. 2010.\n\n[9] J. Deng, S. Satheesh, A.C. Berg, and L. Fei-Fei. Fast and balanced: Ef\ufb01cient label tree learning for large\nIn Advances in Neural Information Processing Systems 24, pages 567\u2013575.\n\nscale object recognition.\n2011.\n\n[10] J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Kernel-based learning of hierarchical multilabel\n\nclassi\ufb01cation models. Journal of Machine Learning Research, 7:1601\u20131626, 2006.\n\n[11] W. Bi and J.T. Kwok. Multi-label classi\ufb01cation on tree- and DAG-structured hierarchies. In Proceedings\n\nof the 28th International Conference on Machine Learning, pages 17\u201324, 2011.\n\n[12] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Hierarchical classi\ufb01cation: Combining Bayes with SVM.\n\nIn Proceedings of the 23rd International Conference on Machine Learning, pages 177\u2013184, 2006.\n\n[13] L. Tang, S. Rajan, and V.K. Narayanan. Large scale multi-label classi\ufb01cation via metalabeler. In Pro-\n\nceedings of the 18th International Conference on World Wide Web, pages 211\u2013220, 2009.\n\n[14] R. Cerri, A. C. P. L. F. de Carvalho, and A. A. Freitas. Adapting non-hierarchical multilabel classi\ufb01cation\n\nmethods for hierarchical multilabel classi\ufb01cation. Intelligent Data Analysis, 15:861\u2013887, 2011.\n\n[15] G. Tsoumakas and I. Vlahavas. Random k-labelsets: An ensemble method for multilabel classi\ufb01cation.\nIn Proceedings of the 18th European Conference on Machine Learning, pages 406\u2013417, Warsaw, Poland,\n2007.\n\n[16] N. Cesa-Bianchi, C. Gentile, A. Tironi, and L. Zaniboni. Incremental algorithms for hierarchical classi\ufb01-\n\ncation. In Advances in Neural Information Processing Systems 17, pages 233\u2013240. 2005.\n\n[17] J.H. Zaragoza, L.E. Sucar, and EF Morales. Bayesian chain classi\ufb01ers for multidimensional classi\ufb01cation.\n\nIn Twenty-Second International Joint Conference on Arti\ufb01cial Intelligence, pages 2192\u20132197, 2011.\n\n[18] R.G. Baraniuk, V. Cevher, M.F. Duarte, and C. Hegde. Model-based compressive sensing. IEEE Trans-\n\nactions on Information Theory, 56:1982\u20132001, 2010.\n\n[19] S.E. Shimony. Finding maps for belief networks is NP-hard. Arti\ufb01cial Intelligence, 68:399\u2013410, 1994.\n[20] C. Varin, N. Reid, and D. Firth. An overview of composite likelihood methods. Statistica Sinica, 21:5\u201342,\n\n2011.\n\n[21] Y. Zhang and J. Schneider. A composite likelihood view for multi-label classi\ufb01cation. In Proceedings of\n\nthe 15th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1407\u20131415, 2012.\n\n[22] J. Zhou, J. Chen, and J. Ye. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State\n\nUniversity, 2012.\n\n[23] G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In Data Mining and Knowledge\n\nDiscovery Handbook, pages 667\u2013685. Springer, 2010.\n\n9\n\n\f", "award": [], "sourceid": 100, "authors": [{"given_name": "Wei", "family_name": "Bi", "institution": null}, {"given_name": "James", "family_name": "Kwok", "institution": null}]}