{"title": "Robust Bloom Filters for Large MultiLabel Classification Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 1851, "page_last": 1859, "abstract": "This  paper presents an approach to multilabel classification (MLC) with a large number of labels. Our approach is a reduction to binary classification in which label sets are represented by low dimensional binary vectors. This representation follows the principle of Bloom filters, a space-efficient data structure originally designed for approximate membership testing. We show that a naive application of Bloom filters in MLC is not robust to individual binary classifiers' errors. We then present an approach that exploits a specific feature of real-world datasets when the number of labels is large: many labels (almost) never appear together.  Our approch is provably robust, has sublinear training and inference complexity with respect to the number of labels, and compares favorably to state-of-the-art algorithms on two large scale multilabel datasets.", "full_text": "Robust Bloom Filters for Large Multilabel\n\nClassi\ufb01cation Tasks\n\nMoustapha Ciss\u00b4e\n\nLIP6, UPMC\n\nSorbonne Universit\u00b4e\n\nParis, France\n\nfirst.last@lip6.fr\n\nNicolas Usunier\n\nUT Compi`egne, CNRS\nHeudiasyc UMR 7253\nCompi`egne, France\nnusunier@utc.fr\n\nAbstract\n\nThierry Artieres, Patrick Gallinari\n\nLIP6, UPMC\n\nSorbonne Universit\u00b4e\n\nParis, France\n\nfirst.last@lip6.fr\n\nThis paper presents an approach to multilabel classi\ufb01cation (MLC) with a large\nnumber of labels. Our approach is a reduction to binary classi\ufb01cation in which\nlabel sets are represented by low dimensional binary vectors. This representation\nfollows the principle of Bloom \ufb01lters, a space-ef\ufb01cient data structure originally\ndesigned for approximate membership testing. We show that a naive application\nof Bloom \ufb01lters in MLC is not robust to individual binary classi\ufb01ers\u2019 errors. We\nthen present an approach that exploits a speci\ufb01c feature of real-world datasets\nwhen the number of labels is large: many labels (almost) never appear together.\nOur approach is provably robust, has sublinear training and inference complexity\nwith respect to the number of labels, and compares favorably to state-of-the-art\nalgorithms on two large scale multilabel datasets.\n\n1\n\nIntroduction\n\nMultilabel classi\ufb01cation (MLC) is a classi\ufb01cation task where each input may be associated to several\nclass labels, and the goal is to predict the label set given the input. This label set may, for instance,\ncorrespond to the different topics covered by a text document, or to the different objects that appear\nin an image. The standard approach to MLC is the one-vs-all reduction, also called Binary Rele-\nvance (BR) [16], in which one binary classi\ufb01er is trained for each label to predict whether the label\nshould be predicted for that input. While BR remains the standard baseline for MLC problems, a lot\nof attention has recently been given to improve on it. The \ufb01rst main issue that has been addressed\nis to improve prediction performances at the expense of computational complexity by learning cor-\nrelations between labels [5] [8], [9] or considering MLC as an unstructured classi\ufb01cation problem\nover label sets in order to optimize the subset 0/1 loss (a loss of 1 is incurred as soon as the method\ngets one label wrong) [16]. The second issue is to design methods that scale to a large number of\nlabels (e.g. thousands or more), potentially at the expense of prediction performances, by learning\ncompressed representations of labels sets with lossy compression schemes that are ef\ufb01cient when\nlabel sets have small cardinality [6]. We propose here a new approach to MLC in this latter line of\nwork. A \u201cMLC dataset\u201d refers here to a dataset with a large number of labels (at least hundreds to\nthousands), in which the target label sets are smaller than the number of labels by one or several\norders of magnitude, which is the common in large-scale MLC datasets collected from the Web.\nThe major dif\ufb01culty in large-scale MLC problems is that the computational complexity of training\nand inference of standard methods is at least linear in the number of labels L. In order to scale better\nwith L, our approach to MLC is to encode individual labels on K-sparse bit vectors of dimension B,\nwhere B (cid:28) L, and use a disjunctive encoding of label sets (i.e. bitwise-OR of the codes of the labels\nthat appear in the label set). Then, we learn one binary classi\ufb01er for each of the B bits of the coding\nvector, similarly to BR (where K = 1 and B = L). By setting K > 1, one can encode individual\nlabels unambiguously on far less than L bits while keeping the disjunctive encoding unambiguous\n\n1\n\n\ffor a large number of labels sets of small cardinality. Compared to BR, our scheme learns only B\nbinary classi\ufb01ers instead of L, while conserving the desirable property that the classi\ufb01ers can be\ntrained independently and thus in parallel, making our approach suitable for large-scale problems.\nThe critical point of our method is a simple scheme to select the K representative bits (i.e. those\nset to 1) of each label with two desirable properties. First, the encoding of \u201crelevant\u201d label sets are\nunambiguous with the disjunctive encoding. Secondly, the decoding step, which recovers a label\nset from an encoding vector, is robust to prediction errors in the encoding vector: in particular, we\nprove that the number of incorrectly predicted labels is no more than twice the number of incorrectly\npredicted bits. Our (label) encoding scheme relies on the existence of mutually exclusive clusters\nof labels in real-life MLC datasets, where labels in different clusters (almost) never appear in the\nsame label set, but labels from the same clusters can. Our encoding scheme makes that B becomes\nsmaller as more clusters of similar size can be found. In practice, a strict partitioning of the labels\ninto mutually exclusive clusters does not exist, but it can be fairly well approximated by removing a\nfew of the most frequent labels, which are then dealt with the standard BR approach, and clustering\nthe remaining labels based on their co-occurrence matrix. That way, we can control the encoding\ndimension B and deal with the computational cost/prediction accuracy tradeoff.\nOur approach was inspired and motivated by Bloom \ufb01lters [2], a well-known space-ef\ufb01cient ran-\ndomized data structure designed for approximate membership testing. Bloom \ufb01lters use exactly the\nprinciple of encoding objects (in our case, labels) by K-sparse vectors and encode a set with the\ndisjunctive encoding of its members. The \ufb01lter can be queried with one object and the answer is\ncorrect up to a small error probability. The data structure is randomized because the representative\nbits of each object are obtained by random hash functions; under uniform probability assumptions\nfor the encoded set and the queries, the encoding size B of the Bloom \ufb01lter is close to the informa-\ntion theoretic limit for the desired error rate. Such \u201crandom\u201d Bloom \ufb01lter encodings are our main\nbaseline, and we consider our approach as a new design of the hash functions and of the decoding\nalgorithm to make Bloom \ufb01lter robust to errors in the encoding vector. Some background on (ran-\ndom) Bloom \ufb01lters, as well as how to apply them for MLC is given in the next section. The design\nof hash functions and the decoding algorithm are then described in Section 3, where we also discuss\nthe properties of our method compared to related works of [12, 15, 4]. Finally, in Section 4, we\npresent experimental results on two benchmark MLC datasets with a large number of classes, which\nshow that our approach obtains promising performances compared to existing approaches.\n\n2 Bloom Filters for Multilabel Classi\ufb01cation\n\nOur approach is a reduction from MLC to binary classi\ufb01cation, where the rules of the reduction fol-\nlow a scheme inspired by the encoding/decoding of sets used in Bloom \ufb01lters. We \ufb01rst describe the\nformal framework to \ufb01x the notation and the goal of our approach, and then give some background\non Bloom \ufb01lters. The main contribution of the paper is described in the next section.\nFramework Given a set of labels L of size L, MLC is the problem of learning a prediction func-\ntion c that, for each possible input x, predicts a subset of L. Throughout the paper, the letter y\nis used for label sets, while the letter (cid:96) is used for individual labels. Learning is carried out on a\ntraining set ((x1, y1), ..., (xn, yn)) of inputs for which the desired label sets are known; we assume\nthe examples are drawn i.i.d. from the data distribution D.\nA reduction from MLC to binary classi\ufb01cation relies on an encoding function e : y \u2286 L (cid:55)\u2192\n(e1(y), ..., eB(y)) \u2208 {0, 1}B, which maps subsets of L to bit vectors of size B. Then, each of the\nB bits are learnt independently by training a sequence of binary classi\ufb01ers \u02c6e = (\u02c6e1, ..., \u02c6eB), where\neach \u02c6ej is trained on ((x1, ej(y1)), ..., (xn, ej(yn))). Given a new instance x, the encoding \u02c6e(x) is\npredicted, and the \ufb01nal multilabel classi\ufb01er c is obtained by decoding \u02c6e(x), i.e. \u2200x, c(x) = d(\u02c6e(x)).\nThe goal of this paper is to design the encoding and decoding functions so that two conditions are\nmet. First, the code size B should be small compared to L, in order to improve the computational\ncost of training and inference relatively to BR. Second, the reduction should be robust in the sense\nthat the \ufb01nal performance, measured by the expected Hamming loss HL(c) between the target label\nsets y and the predictions c(x) is not much larger than HB(\u02c6e), the average error of the classi\ufb01ers we\nlearn. Using \u2206 to denote the symmetric difference between sets, HL and HB are de\ufb01ned by:\n\n(cid:104)|c(x)\u2206y|\n\n(cid:105)\n\nL\n\nHL(c) = E(x,y)\u223cD\n\n(cid:80)B\n\nj=1\n\nE(x,y)\u223cD(cid:2)1{ej(y)(cid:54)=\u02c6ej(y)}(cid:3) .\n\n(1)\n\nand HB(\u02c6e) = 1\n\nB\n\n2\n\n\flabel\n(cid:96)1\n(cid:96)2\n(cid:96)3\n(cid:96)4\n(cid:96)5\n(cid:96)6\n(cid:96)7\n(cid:96)8\n\nh1\n2\n2\n1\n1\n1\n3\n3\n2\n\nh2\n3\n4\n2\n5\n2\n5\n4\n5\n\nh3\n5\n5\n5\n3\n6\n6\n5\n6\n\ne({(cid:96)1})\n\ne({(cid:96)4})\n\ne({(cid:96)1, (cid:96)3, (cid:96)4})\n= e({(cid:96)1, (cid:96)4})\n\nh1((cid:96)1)\nh2((cid:96)1)\n\nh3((cid:96)1)\n\n0\n1\n1\n0\n1\n0\n\n1\n0\n1\n0\n1\n0\n\n1\n1\n1\n0\n1\n0\n\n(cid:96)3\n\nexample: (x,{(cid:96)1, (cid:96)4})\nc(x) = d(\u02c6e(x)) = {(cid:96)3}\n\u02c6e1(x)\n\u02c6e2(x)\n\u02c6e3(x)\n\u02c6e4(x)\n\u02c6e5(x)\n\u02c6e6(x)\n\n1\n1\n0\n0\n1\n0\n\nFigure 1: Examples of a Bloom \ufb01lter for a set L = {(cid:96)1, ..., (cid:96)8} with 8 elements, using 3 hash\nfunctions and 6 bits). (left) The table gives the hash values for each label. (middle-left) For each\nlabel, the hash functions give the index of the bits that are set to 1 in the 6-bit boolean vector. The\nexamples of the encodings for {(cid:96)1} and {(cid:96)4} are given. (middle-right) Example of a false positive:\nthe representation of the subset {(cid:96)1, (cid:96)4} includes all the representative bits of label (cid:96)3 so that is (cid:96)3\nwould be decoded erroneously. (right) Example of propagation of errors: a single erroneous bit in\nthe label set encoding, together with a false positive, leads to three label errors in the \ufb01nal prediction.\n\nBloom Filters Given the set of labels L, a Bloom \ufb01lter (BF) of size B uses K hash functions from\nL to {1, ..., B}, which we denote hk : L \u2192 {1, ..., B} for k \u2208 {1, ..., K} (in a standard approach,\neach value hk((cid:96)) is chosen uniformly at random in {1, ..., B}). These hash functions de\ufb01ne the\nrepresentative bits (i.e. non-zero bits) of each label: each singleton {(cid:96)} for (cid:96) \u2208 L is encoded by a bit\nvector of size B with at most K non-zero bits, and each hash function gives the index of one of these\nnonzero bits in the bit vector. Then, the Bloom \ufb01lter encodes a subset y \u2286 L by a bit vector of size\nB, de\ufb01ned by the bitwise OR of the bit vectors of the elements of y. Given the encoding of a set, the\nBloom \ufb01lter can be queried to test the membership of any label (cid:96); the \ufb01lter answers positively if all\nthe representative bits of (cid:96) are set to 1, and negatively otherwise. A negative answer of the Bloom\n\ufb01lter is always correct; however, the bitwise OR of label set encodings leads to the possibility of\nfalse positives, because even though any two labels have different encodings, the representative bits\nof one label can be included in the union of the representative bits of two or more other labels.\nFigure 1 (left) to (middle-right) give representative examples of the encoding/querying scheme of\nBloom \ufb01lters and an example of false positive.\n\n(cid:1) C\n\n2\n\nB ln(2)\n\nrandom subset of size C (cid:28) L, the false positive rate of a BF encoding this set is in O((cid:0) 1\n\nBloom Filters for MLC The encoding and decoding schemes of BFs are appealing to de\ufb01ne\nthe encoder e and the decoder d in a reduction of MLC to binary classi\ufb01cation (decoding consists\nin querying each label), because they are extremely simple and computationally ef\ufb01cient, but also\nbecause, if we assume that B (cid:28) L and that the random hash functions are perfect, then, given a\n) for\nthe optimal number of hash functions. This rate is, up to a constant factor, the information theoretic\nlimit [3]. Indeed, as shown in Section 4 the use of Bloom \ufb01lters with random hash functions for\nMLC (denoted S-BF for Standard BF hereafter) leads to rather good results in practice.\nNonetheless, there is much room for improvement with respect to the standard approach above.\nFirst, the distribution of label sets in usual MLC datasets is far from uniform. On the one hand, this\nleads to a substantial increase in the error rate of the BF compared to the theoretical calculation, but,\non the other hand, it is an opportunity to make sure that false positive answers only occur in cases\nthat are detectable from the observed distribution of label sets: if y is a label set and (cid:96) (cid:54)\u2208 y is a false\npositive given e(y), (cid:96) can be detected as a false positive if we know that (cid:96) never (or rarely) appears\ntogether with the labels in y. Second and more importantly, the decoding approach of BFs is far from\nrobust to errors in the predicted representation. Indeed, BFs are able to encode subsets on B (cid:28) L\nbits because each bit is representative for several labels. In the context of MLC, the consequence is\nthat any single bit incorrectly predicted may include in (or exclude from) the predicted label set all\nthe labels for which it is representative. Figure 1 (right) gives an example of the situation, where\na single error in the predicted encoding, added with a false positive, results in 3 errors in the \ufb01nal\nprediction. Our main contribution, which we detail in the next section, is to use the non-uniform\ndistribution of label sets to design the hash functions and a decoding algorithm to make sure that\nany incorrectly predicted bit has a limited impact on the predicted label set.\n\n3\n\n\f3 From Label Clustering to Hash Functions and Robust Decoding\n\nWe present a new method that we call Robust Bloom Filters (R-BF). It improves over random hash\nfunctions by relying on a structural feature of the label sets in MLC datasets: many labels are never\nobserved in the same target set, or co-occur with a probability that is small enough to be neglected.\nWe \ufb01rst formalize the structural feature we use, which is a notion of mutually exclusive clusters of\nlabels, then we describe the hash functions and the robust decoding algorithm that we propose.\n\n3.1 Label Clustering\n\nThe strict formal property on which our approach is based is the following: given P subsets\nL1, ...,LP of L, we say that (L1, ...,LP ) are mutually exclusive clusters if no target set contains\nlabels from more than one of each Lp, p = 1..P , or, equivalently, if the following condition holds:\n\n(cid:16) (cid:0)y \u2229 Lp (cid:54)= \u2205(cid:1) and (cid:0)y \u2229 (cid:91)\n\nLp(cid:48) (cid:54)= \u2205(cid:1) (cid:17)\n\n\u2200p \u2208 {1, ..., P}, Py\u223cDY\n\np(cid:48)(cid:54)=p\n\n= 0 .\n\n(2)\n\nwhere DY is the marginal distribution over label sets. For the disjunctive encoding of Bloom \ufb01lters,\nthis assumption implies that if we design the hash functions such that the false positives for a label\nset y belong to a cluster that is mutually exclusive with (at least one) label in y, then the decoding\nstep can detect and correct it. To that end, it is suf\ufb01cient to ensure that for each bit of the Bloom \ufb01lter,\nall the labels for which this bit is representative belong to mutually exclusive clusters. This will lead\nus to a simple two-step decoding algorithm cluster identi\ufb01cation/label set prediction in the cluster.\nIn terms of compression ratio B\nL , we can directly see that the more mutually exclusive clusters, the\nmore labels can share a single bit of the Bloom \ufb01lter. Thus, more (balanced) mutually exclusive\nclusters will result in smaller encoding vectors B, making our method more ef\ufb01cient overall.\nThis notion of mutually exclusive clusters is much stronger than our basic observation that some pair\nof labels rarely or never co-occur with each other, and in practice it may be dif\ufb01cult to \ufb01nd a par-\ntition of L into mutually exclusive clusters because the co-occurrence graph of labels is connected.\nHowever, as we shall see in the experiments, after removing the few most central labels (which we\ncall hubs, and in practice roughly correspond to the most frequent labels), the labels can be clustered\ninto (almost) mutually exclusive labels using a standard clustering algorithm for weighted graph.\nIn our approach, the hubs are dealt with outside the Bloom \ufb01lter, with a standard binary relevance\nscheme. The prediction for the remaining labels is then constrained to predict labels from at most\none of the clusters. From the point of view of prediction performance, we loose the possibility of\npredicting arbitrary label sets, but gain the possibility of correcting a non-negligible part of the incor-\nrectly predicted bits. As we shall see in the experiments, the trade-off is very favorable. We would\nlike to note at this point that dealing with the hubs or the most frequent labels with binary relevance\nmay not particularly be a drawback of our approach: the occurrence probabilities of the labels is\nlong-tailed, and the \ufb01rst few labels may be suf\ufb01ciently important to deserve a special treatment.\nWhat really needs to be compressed is the large set of labels that occur rarely.\nTo \ufb01nd the label clustering, we \ufb01rst build the co-occurrence graph and remove the hubs using the\ndegree centrality measure. The remaining labels are then clustered using Louvain algorithm [1]; to\ncontrol the number of clusters, a maximum size is \ufb01xed and larger clusters are recursively clustered\nuntil they reach the desired size. Finally, to obtain (almost) balanced clusters, the smallest clusters\nare merged. Both the number of hubs and the cluster size are parameters of the algorithm, and, in\nSection 4, we show how to choose them before training at negligible computational cost.\n\n3.2 Hash functions and decoding\nFrom now on, we assume that we have access to a partition of L into mutually exclusive clusters (in\npractice, this corresponds to the labels that remain after removal of the hubs).\n\nHash functions Given the parameter K, constructing K-sparse encodings follows two conditions:\n\n1. two labels from the same cluster cannot share any representative bit;\n2. two labels from different clusters can share at most K \u2212 1 representative bits.\n\n4\n\n\fbit\nindex\n\n1\n2\n3\n4\n5\n6\n\nrepresentative\n\nfor labels\n{1, 2, 3, 4, 5}\n{1, 6, 7, 8, 9}\n{2, 6, 10, 11, 12}\n{3, 7, 10, 13, 14}\n{4, 8, 11, 13, 15}\n{5, 9, 12, 14, 15}\n\nbit\nindex\n\n7\n8\n9\n10\n11\n12\n\nrepresentative\n\nfor labels\n\n{16, 17, 18, 19, 20}\n{16, 21, 22, 23, 24}\n{17, 21, 25, 26, 27}\n{18, 22, 25, 28, 29}\n{19, 23, 26, 28, 30}\n{20, 24, 27, 29, 30}\n\ncluster\nindex\n\n1\n2\n3\n4\n5\n6\n7\n8\n\nlabels in\ncluster\n{1, 15}\n{2, 16}\n{3, 17}\n{4, 18}\n{5, 19}\n{6, 20}\n{7, 21}\n{8, 22}\n\ncluster\nindex\n\n9\n10\n11\n12\n13\n14\n15\n\nlabels in\ncluster\n{9, 23}\n{10, 24}\n{11, 25}\n{12, 26}\n{13, 27}\n{14, 28}\n{15, 29}\n\nFigure 2: Representative bits for 30 labels partitioned into P = 15 mutually exclusive label clusters\nof size R = 2, using K = 2 representative bits per label and batches of Q = 6 bits. The table on the\nright gives the label clustering. The injective mapping between labels and subsets of bits is de\ufb01ned\nby g : (cid:96) (cid:55)\u2192 {g1((cid:96)) = (1 + (cid:96))/6, g2((cid:96)) = 1 + (cid:96) mod 6} for (cid:96) \u2208 {1, ..., 15} and, for (cid:96) \u2208 {15, ..., 30},\nit is de\ufb01ned by (cid:96) (cid:55)\u2192 {(6 + g1((cid:96) \u2212 15), 6 + g1((cid:96) \u2212 15)}.\n\nFinding an encoding that satis\ufb01es the conditions above is not dif\ufb01cult if we consider, for each label,\nthe set of its representative bits. In the rest of the paragraph, we say that a bit of the Bloom \ufb01lter \u201cis\nused for the encoding of a label\u201d when this bit may be a representative bit of the label. If the bit \u201cis\nnot used for the encoding of a label\u201d, then it cannot be a representative bit of the label.\nLet us consider the P mutually exclusive label clusters, and denote by R the size of the largest\nas follows. For a given r \u2208 {1, ..., R}, the r-th batch of Q successive bits (i.e. the bits of index\n(r \u2212 1)Q + 1, (r \u2212 1)Q + 2, ..., rQ) is used only for the encoding of the r-th label of each cluster.\nThat way, each batch of Q bits is used for the encoding of a single label per cluster (enforcing the\n\ufb01rst condition) but can be used for the encoding of P labels overall. For the Condition 2., we notice\n\ncluster. To satisfy Condition 1., we \ufb01nd an encoding on B = R.Q bits for Q \u2265 K and P \u2264(cid:0)Q\n(cid:1)\nthat given a batch of Q bits, there are(cid:0)Q\n(cid:1) different subsets of K bits. We then injectively map the (at\nabove for L \u2264 R.(cid:0)Q\n(cid:1) mutually exclusive clusters of size at most R.\nto(cid:0)Q\n(cid:1) for some Q. For instance, for K = 2 that we use in our experiments, if P = Q(Q+1)\nsome integer Q, and if the clusters are almost perfectly balanced, then B/L \u2248 (cid:112)2/P . The ratio\n\nFigure 2 gives an example of such an encoding. In the end, the scheme is most ef\ufb01cient (in terms of\nthe compression ratio B/L) when the clusters are perfectly balanced and when P is exactly equal\nfor\n\nmost) P labels to the subsets of size K to de\ufb01ne the K representative bits of these labels. In the end,\nwith a Bloom \ufb01lter of size B = R.Q, we have K-sparse encodings that satisfy the two conditions\n\nbecomes more and more favorable as both Q increases and K increases up to Q/2, but the number\nof different clusters P must also be large. Thus, the method should be most ef\ufb01cient on datasets\nwith a very large number of labels, assuming that P increases with L in practice.\n\nK\n\n(cid:1) labels partitioned into P \u2264(cid:0)Q\n\nK\n\nK\n\nK\n\nK\n\n2\n\nDecoding and Robustness We now present the decoding algorithm, followed by a theoretical\nguarantee that each incorrectly predicted bit in the Bloom \ufb01lter cannot imply more than 2 incorrectly\npredicted labels.\nGiven an example x and its predicted encoding \u02c6e(x), the predicted label set d(\u02c6e(x)) is computed\nwith the following two-step process, in which we say that a bit is \u201crepresentative of one cluster\u201d if it\nis a representative bit of one label in the cluster:\n\na. (Cluster Identi\ufb01cation) For each cluster Lp, compute its cluster score sp de\ufb01ned as the\nsp;\n\nnumber of its representative bits that are set to 1 in \u02c6e(x). Choose L \u02c6p for \u02c6p \u2208 arg max\np\u2208{1,...,P}\n\nb. (Label Set Prediction) For each label (cid:96) \u2208 L \u02c6p, let s(cid:48)\n\n(cid:96) set to 1 in \u02c6e(x); add (cid:96) to d(\u02c6e(x)) with probability s(cid:48)\nK .\n\n(cid:96)\n\n(cid:96) be the number of representative bits of\n\nIn case of ties in the cluster identi\ufb01cation, the tie-breaking rule can be arbitrary. For instance,\nin our experiments, we use logistic regression as base learners for binary classi\ufb01ers, so we have\naccess to posterior probabilities of being 1 for each bit of the Bloom \ufb01lter. In case of ties in the\ncluster identi\ufb01cation, we restrict our attention to the clusters that maximize the cluster score, and we\nrecompute their cluster scores using the posterior probabilities instead of the binary decision. The\n\n5\n\n\fcluster which maximizes the new cluster score is chosen. The choice of a randomized prediction for\nthe labels avoids a single incorrectly predicted bit to result in too many incorrectly predicted labels.\nThe robustness of the encoding/decoding scheme is proved below:\nTheorem 1 Let the label set L , and let (L1, ...,LP ) be a partition of L satisfying (2). Assume that\nthe encoding function satis\ufb01es Conditions 1. and 2., and that decoding is performed in the two-step\nprocess a.-b. Then, using the de\ufb01nitions of HL and HB of (1), we have:\n\nHL(d \u25e6 \u02c6e) \u2264 2B\nL\n\nHB(\u02c6e)\n\nfor the cluster costs 1\nH L(y, d(\u02c6e(x))) \u2264 1\n\nK H B(\u02c6e(x) , e(y)).\n\nfor a K-sparse encoding, where the expectation in HL is also taken over the randomized predictions.\n\nSketch of proof Let (x, y) be an example. We compare the expected number of incorrectly predicted\n\nlabels H L(y, d(\u02c6e(x))) = E(cid:2)|d(\u02c6e(x)) \u2206 y|(cid:3) (expectation taken over the randomized prediction) and\nthe number of incorrectly predicted bits H B(\u02c6e(x) , e(y)) =(cid:80)B\n\nj=1 1{\u02c6ej (x)(cid:54)=ej(y)}. Let us denote by\np\u2217 the index of the cluster in which y is included, and \u02c6p the index of the cluster chosen in step a. We\nconsider the two following cases:\n\u02c6p = p\u2217: if the cluster is correctly identi\ufb01ed then each incorrectly predicted bit that is representative\nK in H L(y, d(\u02c6e(x))). All other bits do not matter. We thus have\n\n\u02c6p (cid:54)= p\u2217: If the cluster is not correctly identi\ufb01ed, then H L(y, d(\u02c6e(x))) is the sum of (1) the number\nof labels that should be predicted but are not (|y|), and (2) the labels that are in the predicted\nlabel set but that should not. To bound the ratio H L(y,d(\u02c6e(x)))\nH B(\u02c6e(x),e(y)), we \ufb01rst notice that there are\nat least as much representative bits predicted as 1 for L \u02c6p than for Lp\u2217. Since each label\nof L \u02c6p shares at most K \u2212 1 representative bits with a label of Lp\u2217, there are at least |y|\nincorrect bits. Moreover, the maximum contribution to labels predicted in the incorrect\nK |y|. Each additional contribution of 1\ncluster by correctly predicted bits is at most K\u22121\nK\nin H L(y, d(\u02c6e(x))) comes from a bit that is incorrectly predicted to 1 instead of 0 (and is\nrepresentative for L \u02c6p). Let us denote by k the number of such contributions. Then, the most\ndefavorable ratio H L(y,d(\u02c6e(x)))\n= 2.\n\n|y|\nK +|y|(1+ K\u22121\nK )\n\nK +|y|(1+ K\u22121\nK )\n\nH B(\u02c6e(x),e(y)) is smaller than max\nk\u22650\nL comes from normalization factors). (cid:3)\n\nmax(|y|,k) =\n\nTaking the expectation over (x, y) completes the proof ( B\n\n|y|\n\nk\n\n3.3 Comparison to Related Works\n\n\u221a\n\n(cid:0) Q\n\n(cid:1) \u223c\n\nThe use of correlations between labels has a long history in MLC [11] [8] [14], but correlations\nare most often used to improve prediction performances at the expense of computational complexity\nthrough increasingly complex models, rather than to improve computational complexity using strong\nnegative correlations as we do here.\nThe most closely related works to ours is that of Hsu et al.\n[12], where the authors pro-\npose an approach based on compressed sensing to obtain low-dimension encodings of label sets.\nTheir approach has the advantage of a theoretical guarantee in terms of regret (rather than er-\nror as we do), without strong structural assumptions on the label sets; the complexity of learn-\ning scales in O(C ln(L)) where C is the number of labels in label sets. For our approach, since\n8\u03c0Q, it could be possible to obtain a logarithmic rate under the rather strong\nassumption that the number of clusters P increases linearly with L. As we shall see in our experi-\nments, however, even with a rather large number of labels (e.g. 1 000), the asymptotic logarithmic\nrate is far from being achieved for all methods. In practice, the main drawback of their method is\nthat they need to know the size of the label set to predict. This is an extremely strong requirement\nwhen classi\ufb01cation decisions are needed (less strong when only a ranking of the labels is needed),\nin contrast to our method which is inherently designed for classi\ufb01cation.\nAnother related work is that of [4], which is based on SVD for dimensionality reduction rather than\ncompressed sensing. Their method can exploit correlations between labels, and take classi\ufb01cation\ndecisions. However, their approach is purely heuristic, and no theoretical guarantee is given.\n\nQ\u2192\u221e 4Q/2/\n\nQ\n2\n\n6\n\n\fFigure 3: (left) Unrecoverable Hamming loss (UHL) due to label clustering of the R-BF as a function\nof the code size B on RCV-Industries (similar behavior on the Wikipedia1k dataset). The optimal\ncurve represents the best UHL over different settings (number of hubs,max cluster size) for a given\ncode size. (right) Hamming loss vs code size on RCV-Industries for different methods.\n\n4\n0\n1\n\u00d7\ns\ns\no\nl\n\ne\nl\nb\na\nr\ne\nv\no\nc\nn\nu\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nhubs = 0\nhubs = 20\nOptimal\n\n100\n\n150\n\n200\n\n250\n\nB\n\n3\n0\n1\n\u00d7\ns\ns\no\nl\n\ng\nn\ni\nm\nm\na\nH\n\nt\ns\ne\nT\n\n2.8\n\n2.6\n\n2.4\n\n2.2\n\n2\n\nSBF\nPLST\n\nCS-OMP\n\nR-BF\n\n60\n\n80\n\n100 120 140 160 180 200 220\n\nB\n\n4 Experiments\n\nWe performed experiments on two large-scale real world datasets: RCV-Industries, which is a subset\nof the RCV1 dataset [13] that considers the industry categories only (we used the \ufb01rst testing set \ufb01le\nfrom the RCV1 site instead of the original training set since it is larger), and Wikipedia1k, which is\na subsample of the wikipedia dataset release of the 2012 large scale hierarchical text classi\ufb01cation\nchallenge [17]. On both datasets, the labels are originally organized in a hierarchy, but we trans-\nformed them into plain MLC datasets by keeping only leaf labels. For RCV-Industries, we obtain\n303 labels for 72, 334 examples. The average cardinality of label sets is 1.73 with a maximum of\n30; 20% of the examples have label sets of cardinality \u2265 2. For Wikipedia1k, we kept the 1, 000\nmost represented leaf labels, which leads to 110, 530 examples with an average label set cardinality\nof 1.11 (max. 5). 10% of the examples have label sets of cardinality \u2265 2.\nWe compared our methods, the standard (i.e. with random hash function) BF (S-BF) and the Robust\nBF (R-BF) presented in section 3, to binary relevance (BR) and to three MLC algorithms designed\nfor MLC problems with a large number of labels: a pruned version of BR proposed in [7] (called\nBR-Dekel from now on), the compressed sensing approach (CS) of [12] and the principal label\nspace transformation (PLST) [4]. BR-Dekel consists in removing from the prediction all the labels\nwhose probability of a true positive (PTP) on the validation set is smaller than the probability of\na false positive (PFP). To control the code size B in BR-Dekel, we rank the labels based on the\nratio P T P/P F P and keep the top B labels. In that case, the inference complexity is similar to BF\nmodels, but the training complexity is still linear in L. For CS, following [4], we used orthogonal\nmatching poursuit (CS-OMP) for decoding and selected the number of labels to predict in the range\n{1, 2, . . . , 30}, on the validation set. For S-BF, the number of (random) hash functions K is also\nchosen on the validation set among {1, 2, . . . , 10}. For R-BF, we use K = 2 hash functions.\nThe code size B can be freely set for all methods except for Robust BF, where different settings of\nthe maximum cluster size and the number of hubs may lead to the same code size. Since the use\nof a label clustering in R-BF leads to unrecoverable errors even if the classi\ufb01ers perform perfectly\nwell (because labels of different clusters cannot be predicted together), we chose the max cluster size\namong {10, 20, . . . , 50} and the number of hubs (among {0, 10, 20, 30, . . . , 100} for RCV-Industries\nand {0, 50, 100, . . . , 300} for Wikipedia1k) that minimize the resulting unrecoverable Hamming loss\n(UHL), computed on the train set. Figure 3 (left) shows how the UHL naturally decreases when\nthe number of hubs increases since then the method becomes closer to BR, but at the same time\nthe overall code size B increases because it is the sum of the \ufb01lter\u2019s size and the number of hubs.\nNonetheless, we can observe on the \ufb01gure that the UHL rapidly reaches a very low value, con\ufb01rming\nthat the label clustering assumption is reasonable in practice.\nAll the methods involve training binary classi\ufb01ers or regression functions. On both datasets, we used\nlinear functions with L2 regularization (the global regularization factor in PLST and CS-OMP, as\nwell as the regularization factor of each binary classi\ufb01er in BF and BR approaches, were chosen on\nthe validation set among {0, 0.1, . . . , 10\u22125}), and unit-nom normalized TF-IDF features. We used\nthe Liblinear [10] implementation of logistic regression as base binary classi\ufb01er.\n\n7\n\n\fTable 1: Test Hamming loss (HL, in %), micro (m-F1) and macro (M-F1) F1-scores. B is code\nsize. The results of the signi\ufb01cance test for a p-value less than 5% are denoted \u2020 to indicate the best\nperforming method using the same B and \u2217 to indicate the best performing method overall.\n\nClassi\ufb01er\n\nBR\n\nBR-Dekel\n\nS-BF\n\nR-BF\n\nCS-OMP\n\nPLST\n\nB\n\n303\n150\n200\n150\n200\n150\n200\n150\n200\n150\n200\n\nHL\nm-F1\nRCV-Industries\n72.43\u2217\n0.200\u2217\n46.98\n0.308\n0.233\n65.78\n0.223\n67.45\n0.217\n68.32\n71.31\u2020\n0.210\u2020\n71.86\u2020\n0.205\u2020\n0.246\n67.59\n67.71\n0.245\n68.87\n0.226\n0.221\n70.35\n\nM-F1\n47.82\u2217\n30.14\n40.09\n40.29\n40.95\n43.44\n44.57\n45.22\u2020\n45.82\u2020\n32.36\n40.78\n\nB\n\n1000\n250\n500\n250\n500\n240\n500\n250\n500\n250\n500\n\nHL\nWikipedia1K\n\nm-F1\n\n0.0711\n0.0984\n0.0868\n0.0742\n0.0734\n0.0728\u2020\n0.0705\u2020\u2217\n0.0886\n0.0875\n0.0854\n0.0828\n\n55.96\n22.18\n38.33\n53.02\n53.90\n55.85\n57.31\n57.96\u2020\n58.46\u2020\u2217\n42.45\n45.95\n\nM-F1\n\n34.7\n12.16\n24.52\n31.41\n32.57\n34.65\n36.85\n41.84\u2020\n42.52\u2020\u2217\n09.53\n16.73\n\nResults Table 1 gives the test performances of all the methods on both datasets for different code\nsizes. We are mostly interested in the Hamming loss but we also provide the micro and macro\nF-measure. The results are averaged over 10 random splits of train/validation/test of the datasets,\nrespectively containing 50%/25%/25% of the data. The standard deviations of the values are neg-\nligible (smaller than 10\u22123 times the value of the performance measure). Our BF methods seem\nto clearly outperform all other methods and R-BF yields signi\ufb01cant improvements over S-BF. On\nWikipedia1k, with 500 classi\ufb01ers, the Hamming loss (in %) of S-BF is 0.0734 while it is only 0.0705\nfor RBF. This performance is similar to that of BR\u2019s (0.0711) which uses twice as many classi\ufb01ers.\nThe simple pruning strategy BR-Dekel is the worst baseline on both datasets, con\ufb01rming that con-\nsidering all classes is necessary on these datasets. CS-OMP reaches a much higher Hamming loss\n(about 23% worst than BR on both datasets when using 50% less classi\ufb01ers). CS-OMP achieves the\nbest performance on the macro-F measure though. This is because the size of the predicted label\nsets is \ufb01xed for CS, which increases recall but leads to poor precision. We used OMP as decoding\nprocedure for CS since it seemed to perform better than Lasso and Correlation decoding (CD)[12](\nfor instance, on RCV-Industries with a code size of 500, OMP achieves a Hamming loss of 0.0875\nwhile the Hamming loss is 0.0894 for Lasso and 0.1005 for CD). PLST improves over CS-OMP\nbut its performances are lower than those of S-BF (about 3.5% on RCV-industries and 13% and\nWikipedia when using 50% less classi\ufb01ers than BR). The macro F-measure indicates that PLST\nlikely suffers from class imbalance (only the most frequent labels are predicted), probably because\nthe label set matrix on which SVD is performed is dominated by the most frequent labels. Figure 3\n(right) gives the general picture of the Hamming loss of the methods on a larger range of code sizes.\nOverall, R-BF has the best performances except for very small code sizes because the UHL becomes\ntoo high.\nRuntime analysis Experiments were performed on a computer with 24 intel Xeon 2.6 GHz CPUs.\nFor all methods, the overall training time is dominated by the time to train the binary classi\ufb01ers\nor regressors, which depends linearly on the code size. For test, the time is also dominated by\nthe classi\ufb01ers\u2019 predictions, and the decoding algorithm of R-BF is the fastest. For instance, on\nWikipedia1k, training one binary classi\ufb01er takes 12.35s on average, and inference with one classi\ufb01er\n(for the whole test dataset) takes 3.18s. Thus, BR requires about 206 minutes (1000 \u00d7 12.35s) for\ntraining and 53m for testing on the whole test set. With B = 500, R-BF requires about half that\ntime, including the selection of the number of hubs and the max. cluster size at training time, which\nis small (computing the UHL of a R-BF con\ufb01guration takes 9.85s, including the label clustering\nstep, and we try less than 50 of them). For the same B, encoding for CS takes 6.24s and the SVD in\nPSLT takes 81.03s, while decoding takes 24.39s at test time for CS and 7.86s for PSLT.\n\nAcknowledgments\n\nThis work was partially supported by the French ANR as part of the project Class-Y (ANR-10-\nBLAN-02) and carried out in the framework of the Labex MS2T (ANR-11-IDEX-0004-02).\n\n8\n\n\fReferences\n[1] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities\n\nin large networks. Journal of Statistical Mechanics: Theory and Experiment., 10, 2008.\n\n[2] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM,\n\n13(7):422\u2013426, 1970.\n\n[3] L. Carter, R. Floyd, J. Gill, G. Markowsky, and M. Wegman. Exact and approximate mem-\nbership testers. In Proceedings of the tenth annual ACM symposium on Theory of computing,\nSTOC \u201978, pages 59\u201365, New York, NY, USA, 1978. ACM.\n\n[4] Y.-N. Chen and H.-T. Lin. Feature-aware label space dimension reduction for multi-label\n\nclassi\ufb01cation. In NIPS, pages 1538\u20131546, 2012.\n\n[5] W. Cheng and E. H\u00a8ullermeier. Combining instance-based learning and logistic regression for\n\nmultilabel classi\ufb01cation. Machine Learning, 76(2-3):211\u2013225, 2009.\n\n[6] K. Christensen, A. Roginsky, and M. Jimeno. A new analysis of the false positive rate of a\n\nbloom \ufb01lter. Inf. Process. Lett., 110(21):944\u2013949, Oct. 2010.\n\n[7] O. Dekel and O. Shamir. Multiclass-multilabel classi\ufb01cation with more classes than examples.\n\nvolume 9, pages 137\u2013144, 2010.\n\n[8] K. Dembczynski, W. Cheng, and E. H\u00a8ullermeier. Bayes optimal multilabel classi\ufb01cation via\n\nprobabilistic classi\ufb01er chains. In ICML, pages 279\u2013286, 2010.\n\n[9] K. Dembczynski, W. Waegeman, W. Cheng, and E. H\u00a8ullermeier. On label dependence and loss\n\nminimization in multi-label classi\ufb01cation. Machine Learning, 88(1-2):5\u201345, 2012.\n\n[10] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large\n\nlinear classi\ufb01cation. J. Mach. Learn. Res., 9:1871\u20131874, June 2008.\n\n[11] B. Hariharan, S. V. N. Vishwanathan, and M. Varma. Large Scale Max-Margin Multi-Label\nClassi\ufb01cation with Prior Knowledge about Densely Correlated Labels. In Proceedings of In-\nternational Conference on Machine Learning, 2010.\n\n[12] D. Hsu, S. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing.\n\nIn NIPS, pages 772\u2013780, 2009.\n\n[13] RCV1. RCV1 Dataset, http://www.daviddlewis.com/resources/testcollections/rcv1/.\n[14] J. Read, B. Pfah ringer, G. Holmes, and E. Frank. Classi\ufb01er chains for multi-label classi-\n\ufb01cation. In Proceedings of the European Conference on Machine Learning and Knowledge\nDiscovery in Databases: Part II, ECML PKDD \u201909, pages 254\u2013269, Berlin, Heidelberg, 2009.\nSpringer-Verlag.\n\n[15] F. Tai and H.-T. Lin. Multilabel classi\ufb01cation with principal label space transformation. Neural\n\nComputation, 24(9):2508\u20132542, 2012.\n\n[16] G. Tsoumakas, I. Katakis, and I. Vlahavas. A Review of Multi-Label Classi\ufb01cation Meth-\nods. In Proceedings of the 2nd ADBIS Workshop on Data Mining and Knowledge Discovery\n(ADMKD 2006), pages 99\u2013109, 2006.\n\n[17] Wikipedia. Wikipedia Dataset, http://lshtc.iit.demokritos.gr/.\n\n9\n\n\f", "award": [], "sourceid": 933, "authors": [{"given_name": "Moustapha", "family_name": "Cisse", "institution": "LIP6/UPMC"}, {"given_name": "Nicolas", "family_name": "Usunier", "institution": "Universit\u00e9 de Technologie de Compi\u00e8gne (UTC)"}, {"given_name": "Thierry", "family_name": "Arti\u00e8res", "institution": "LIP6/UPMC"}, {"given_name": "Patrick", "family_name": "Gallinari", "institution": "LIP6/UPMC"}]}