{"title": "A no-regret generalization of hierarchical softmax to extreme multi-label classification", "book": "Advances in Neural Information Processing Systems", "page_first": 6355, "page_last": 6366, "abstract": "Extreme multi-label classification (XMLC) is a problem of tagging an instance with a small subset of relevant labels chosen from an extremely large pool of possible labels. Large label spaces can be efficiently handled by organizing labels as a tree, like in the hierarchical softmax (HSM) approach commonly used for multi-class problems. In this paper, we investigate probabilistic label trees (PLTs) that have been recently devised for tackling XMLC problems. We show that PLTs are a no-regret multi-label generalization of HSM when precision@$k$ is used as a model evaluation metric. Critically, we prove that pick-one-label heuristic---a reduction technique from multi-label to multi-class that is routinely used along with HSM---is not consistent in general. We also show that our implementation of PLTs, referred to as extremeText (XT), obtains significantly better results than HSM with the pick-one-label heuristic and XML-CNN, a deep network specifically designed for XMLC problems. Moreover, XT is competitive to many state-of-the-art approaches in terms of statistical performance, model size and prediction time which makes it amenable to deploy in an online system.", "full_text": "A no-regret generalization of hierarchical softmax to\n\nextreme multi-label classi\ufb01cation\n\nMarek Wydmuch\n\nInstitute of Computing Science\n\nPoznan University of Technology, Poland\n\nmwydmuch@cs.put.poznan.pl\n\nKalina Jasinska\n\nInstitute of Computing Science\n\nPoznan University of Technology, Poland\n\nkjasinska@cs.put.poznan.pl\n\nMikhail Kuznetsov\nYahoo! Research\nNew York, USA\n\nkuznetsov@oath.com\n\nR\u00f3bert Busa-Fekete\n\nYahoo! Research\nNew York, USA\n\nbusafekete@oath.com\n\nKrzysztof Dembczy\u00b4nski\n\nInstitute of Computing Science\n\nPoznan University of Technology, Poland\n\nkdembczynski@cs.put.poznan.pl\n\nAbstract\n\nExtreme multi-label classi\ufb01cation (XMLC) is a problem of tagging an instance with\na small subset of relevant labels chosen from an extremely large pool of possible\nlabels. Large label spaces can be ef\ufb01ciently handled by organizing labels as a tree,\nlike in the hierarchical softmax (HSM) approach commonly used for multi-class\nproblems. In this paper, we investigate probabilistic label trees (PLTs) that have\nbeen recently devised for tackling XMLC problems. We show that PLTs are a\nno-regret multi-label generalization of HSM when precision@k is used as a model\nevaluation metric. Critically, we prove that pick-one-label heuristic\u2014a reduction\ntechnique from multi-label to multi-class that is routinely used along with HSM\u2014is\nnot consistent in general. We also show that our implementation of PLTs, referred\nto as EXTREMETEXT (XT), obtains signi\ufb01cantly better results than HSM with the\npick-one-label heuristic and XML-CNN, a deep network speci\ufb01cally designed for\nXMLC problems. Moreover, XT is competitive to many state-of-the-art approaches\nin terms of statistical performance, model size and prediction time which makes it\namenable to deploy in an online system.\n\n1\n\nIntroduction\n\nIn several machine learning applications, the label space can be enormous, containing even millions\nof different classes. Learning problems of this scale are often referred to as extreme classi\ufb01cation.\nTo name a few examples of such problems, consider image and video annotation for multimedia\nsearch (Deng et al., 2011), tagging of text documents for categorization of Wikipedia articles (Dekel\n& Shamir, 2010), recommendation of bid words for online ads (Prabhu & Varma, 2014), or prediction\nof the next word in a sentence (Mikolov et al., 2013).\nTo tackle extreme classi\ufb01cation problems in an ef\ufb01cient way, one can organize the labels into a\ntree. A prominent example of such label tree model is hierarchical softmax (HSM) (Morin &\nBengio, 2005), often used with neural networks to speed up computations in multi-class classi\ufb01cation\nwith large output spaces. For example, it is commonly applied in natural language processing\nproblems such as language modeling (Mikolov et al., 2013). To adapt HSM to extreme multi-label\nclassi\ufb01cation (XMLC), several very popular tools, such as FASTTEXT (Joulin et al., 2016) and\nLEARNED TREE (Jernite et al., 2017), apply the pick-one-label heuristic. As the name suggests, this\nheuristic randomly picks one of the labels from a multi-label training example and treats the example\nas a multi-class one.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this work, we exhaustively investigate the multi-label extensions of HSM. First, we show that the\npick-one-label strategy does not lead to a proper generalization of HSM for multi-label setting. More\nprecisely, we prove that using the pick-one-label reduction one cannot expect any multi-class learner\nto achieve zero regret in terms of marginal probability estimation and maximization of precision@k.\nAs a remedy to this issue, we are going to revisit probabilistic label trees (PLTs) (Jasinska et al.,\n2016) that have been recently introduced for solving XMLC problems. We show that PLTs are a\ntheoretically motivated generalization of HSM to multi-label classi\ufb01cation, that is, 1) PLTs and\nHSM are identical in multi-class case, and 2) a PLT model can get zero regret (i.e., it is consistent)\nin terms of marginal probability estimation and precision@k in the multi-label setting.\nBeside our theoretical \ufb01ndings, we provide an ef\ufb01cient implementation of PLTs, referred to as XT,\nthat we build upon FASTTEXT. The comprehensive empirical evaluation shows that it gets signi\ufb01cantly\nbetter results than the original FASTTEXT, LEARNED TREE, and XML-CNN, a speci\ufb01cally designed\ndeep network for XMLC problems. XT also achieves competitive results to other state-of-the-art\napproaches, being very ef\ufb01cient in model size and prediction time, particularly in the online setting.\nThis paper is organized as follows. First we discuss the related work and situate our approach in\nthe context. In Section 3 we formally state the XMLC problem and present some useful theoretical\ninsights. Next, we brie\ufb02y introduce the HSM approach, and in Section 5 we show theoretical results\nconcerning the pick-one-label heuristic. Section 6 formally introduces PLTs and presents the main\ntheoretical results concerning them and their relation to HSM. Section 7 provides implementation\ndetails of PLTs. The experimental results are presented in Section 8. Finally we make concluding\nremarks.\n\n2 Related work\n\nHistorically, problems with a large number of labels were usually solved by nearest neighbor or\ndecision tree methods. Some of today\u2019s algorithms are still based on these classical approaches,\nsigni\ufb01cantly extending them by a number of new tricks. If the label space is of moderate size (like\na few thousands of labels) then an independent model can be trained for each class. This is the\nso-called 1-VS-ALL approach. Unfortunately, it scales linearly with the number of labels, which\nis too costly for many applications. The extreme classi\ufb01cation algorithms try to improve over this\napproach by following different paradigms such as sparsity of labels (Yen et al., 2017; Babbar &\nSch\u00f6lkopf, 2017), low-rank approximation (Mineiro & Karampatziakis, 2015; Yu et al., 2014; Bhatia\net al., 2015), tree-based search (Prabhu & Varma, 2014; Choromanska & Langford, 2015), or label\n\ufb01ltering (Vijayanarasimhan et al., 2014; Shrivastava & Li, 2015; Niculescu-Mizil & Abbasnejad,\n2017).\nIn this paper we focus on tree-based algorithms, therefore we discuss them here in more detail. There\nare two distinct types of these algorithms: decision trees and label trees. The former type follows\nthe idea of classical decision trees. However, the direct use of the classic algorithms can be very\ncostly (Agrawal et al., 2013). Therefore, the FASTXML algorithm (Prabhu & Varma, 2014) tackles\nthe problem in a slightly different way. It uses sparse linear classi\ufb01ers in internal tree nodes to split\nthe feature space. Each linear classi\ufb01er is trained on two classes that are formed in a random way\n\ufb01rst and then reshaped by optimizing the normalized discounted cumulative gain. To improve the\noverall accuracy FASTXML uses an ensemble of trees. This algorithm, like many other decision tree\nmethods, works in a batch mode. Choromanska & Langford (2015) have succeeded to introduce a\nfully online decision tree algorithm that also uses linear classi\ufb01ers in internal nodes of the tree.\nIn label trees each label corresponds to one and only one path from the root to a leaf. Besides PLTs\nand HSM, there exist several other instances of this approach, for example, \ufb01lter trees (Beygelzimer\net al., 2009b; Li & Lin, 2014) or label embedding trees (Bengio et al., 2010). It is also worth to\nunderline that algorithms similar to HSM have been introduced independently in many different\nresearch \ufb01elds, such as nested dichotomies (Fox, 1997) in statistics, conditional probability estimation\ntrees (Beygelzimer et al., 2009a) in multi-class classi\ufb01cation, multi-stage classi\ufb01ers (Kurzynski, 1988)\nin pattern recognition, and probabilistic classi\ufb01er chains (Dembczynski et al., 2010) in multi-label\nclassi\ufb01cation under the subset 0/1 loss. All these methods have been jointly analyzed in (Dembczynski\net al., 2016).\nA still open problem in label tree approaches is the tree structure learning. FASTTEXT (Joulin\net al., 2016) uses HSM with a Huffman tree built on the label frequencies. Jernite et al. (2017)\n\n2\n\n\fhave introduced a new algorithm, called LEARNED TREE, which combines HSM with a speci\ufb01c\nhierarchical clustering that reassigns labels to paths in the tree in a semi-online manner. Prabhu et al.\n(2018) follows another approach in which a model similar to PLTs is trained in a batch mode and a\ntree is built by using recursively balanced k-means over the label pro\ufb01les. In Section 7 we discuss\nthis approach in more detail.\nThe HSM model is often used as an output layer in neural networks. The FASTTEXT implementation\ncan also be viewed as a shallow architecture with one hidden layer that represents instances as\naveraged feature (i.e., word) vectors. Another neural network-based model designed for XMLC\nhas been introduced in (Liu et al., 2017). This model, referred to as XML-CNN, uses a complex\nconvolutional deep network with a narrow last layer to make it work with large output spaces. As we\nshow in the experimental part, this quite expensive architecture gets inferior results in comparison to\nour PLTs built upon FASTTEXT.\n\n3 Problem statement\nLet X denote an instance space, and let L = {1, . . . , m} be a \ufb01nite set of m class labels. We assume\nthat an instance x 2X is associated with a subset of labels Lx 2 2L (the subset can be empty); this\nsubset is often called a set of relevant labels, while the complement L\\Lx is considered as irrelevant\nfor x. We assume m to be a large number (e.g., 105), but the size of the set of relevant labels\nLx is much smaller than m, i.e., |Lx|\u2327 m. We identify a set Lx of relevant labels with a binary\n(sparse) vector y = (y1, y2, . . . , ym), in which yj = 1 , j 2L x. By Y = {0, 1}m we denote a set\nof all possible label vectors. We assume that observations (x, y) are generated independently and\nidentically according to the probability distribution P(X = x, Y = y) (denoted later by P(x, y))\nde\ufb01ned on X\u21e5Y .\nThe problem of XMLC can be de\ufb01ned as \ufb01nding a classi\ufb01er h(x) = (h1(x), h2(x), . . . , hm(x)),\nwhich in general can be de\ufb01ned as a mapping X!R m, that minimizes the expected loss (or risk):\n\nL`(h) = E(x,y)\u21e0P(x,y)(`(y, h(x)) ,\n\nwhere `(y, \u02c6y) is the (task) loss. The optimal classi\ufb01er, the so-called Bayes classi\ufb01er, for a given loss\nfunction ` is:\n\nh\u21e4` = arg min\n\nL`(h) .\n\nh\n\nThe regret of a classi\ufb01er h with respect to ` is de\ufb01ned as:\n\nreg`(h) = L`(h) L`(h\u21e4` ) = L`(h) L\u21e4` .\n\nThe regret quanti\ufb01es the suboptimality of h compared to the optimal classi\ufb01er h\u21e4. The goal could be\nthen de\ufb01ned as \ufb01nding h with a small regret, ideally equal to zero.\nIn the following, we aim at estimating the marginal probabilities \u2318j(x) = P(yj = 1| x). As we will\nshow below, marginal probabilities are a key element to optimally solve extreme classi\ufb01cation for\nmany performance measures, like Hamming loss, macro-F measure, and precision@k. To obtain the\nmarginal probability estimates one can use the label-wise log loss as a surrogate:\n\n`log(y, h(x)) =\n\n`log(yj, hj(x)) =\n\n(yj log(hj(x)) + (1 yj) log(1 hj(x))) .\n\nmXj=1\n\nmXj=1\n\nThen the expected label-wise log loss for a single x (i.e., the so-called conditional risk) is:\n\nEy`log(y, h(x)) =\n\nEy`log(yj, hj(x)) =\n\nmXj=1\n\nmXj=1\n\nLlog(hj(x)| x) .\n\nTherefore, it is easy to see that the pointwise optimal prediction for the j-th label is given by:\n\nh\u21e4j (x) = arg min\n\nh\n\nLlog(hj(x)| x) = \u2318j(x) .\n\nFor the macro F-measure it suf\ufb01ces in turn to \ufb01nd an optimal threshold on marginal probabilities for\n\nAs shown in (Dembczynski et al., 2010), the Hamming loss is minimized by h\u21e4j (x) =J\u2318j(x) > 0.5K .\n\n3\n\n\feach label separately as proven in (Ye et al., 2012; Narasimhan et al., 2014; Jasinska et al., 2016;\nDembczynski et al., 2017). In the following, we will show a similar result for precision@k which has\nbecome a standard measure in extreme classi\ufb01cation (although it is also often criticized, as it favors\nthe most frequent labels).\nPrecision@k can be formally de\ufb01ned as:\n\nprecision@k(y, x, h) =\n\n(1)\n\n1\n\nk Xj2 \u02c6YkJyj = 1K,\nk Xj2 \u02c6Yk\n\n1\n\nwhere \u02c6Yk is a set of k labels predicted by h for x. To be consistent with the former discussion, let us\nde\ufb01ne a loss function for precision@k as `p@k = 1 precision@k. The conditional risk is then:1\n\nLp@k(h| x) = Ey`p@k(y, x, h) = 1 \n\n\u2318j(x) .\n\nThe above result shows that the optimal strategy for precision@k is to predict k labels with the\nhighest marginal probabilities \u2318j(x). As the main theoretical result given in this paper is a regret\nbound for precision@k, let us de\ufb01ne here the conditional regret for this metric:\n\nregp@k(h| x) =\n\n\u2318i(x) \n\n1\n\nk Xi2Yk\n\n1\n\nk Xj2 \u02c6Yk\n\n\u2318j(x) ,\n\nwhere Yk is a set containing the top k labels with respect to the true marginal probabilities.\nFrom the above results, we see that estimation of marginal probabilities is crucial for XMLC problems.\nTo obtain these probabilities we can use the vanilla 1-VS-ALL approach trained with the label-wise\nlog loss. Unfortunately, 1-VS-ALL is too costly in the extreme setting. In the following sections, we\ndiscuss an alternative approach based on the label trees that estimates the marginal probabilities with\nthe competitive accuracy, but in a much more ef\ufb01cient way.\n\n4 Hierarchical softmax approaches\n\nHierarchical softmax (HSM) is designed for multi-class classi\ufb01cation. Using our notation, for multi-\ni=1 yi = 1, i.e., there is one and only one label assigned to an instance\n\nclass problems we havePm\n(x, y). The marginal probabilities \u2318j(x) in this case sum up to 1.\nThe HSM classi\ufb01er h(x) takes a form of a label tree. We encode all labels from L using a pre\ufb01x\ncode. Any such code can be given in a form of a tree in which a path from the root to a leaf node\ncorresponds to a code word. Under the coding, each label yj = 1 is uniquely represented by a code\nword z = (z1, . . . , zl) 2C , where l is the length of the code word and C is a set of all code words.\nFor zi 2{ 0, 1}, the code and the label tree are binary. In general, the code alphabet can contain more\nthan two symbols. Furthermore, zis can take values from different sets of symbols depending on the\nprevious values in the code word. In other words, the code can result with nodes of a different arity\neven in the same tree, like in (Grave et al., 2017) and (Prabhu et al., 2018). We will brie\ufb02y discuss\ndifferent tree structures in Section 7.\nA tree node can be uniquely identi\ufb01ed by the partial code word zi = (z1, . . . , zi). We denote the\nroot node by z0, which is an empty vector (without any elements). The probability of a given label is\ndetermined by a sequence of decisions made by node classi\ufb01ers that predict subsequent values of the\ncode word. By using the chain rule of probability, we obtain:\n\n\u2318j(x) = P(yj = 1| x) = P(z | x) =\n\nlYi=1\n\nP(zi | zi1, x) .\n\nBy using logistic loss and a linear model fzi(x) in each node zi for estimating P(zi | zi1, x), we\nobtain the popular formulation of HSM. Let us notice that since we deal here with a multi-class\ndistribution, we have that:\n\nP(zi = c| zi1, x) = 1 .\n\n(2)\n\nXc\n\n1The derivation is given in Appendix A.\n\n4\n\n\fBecause of this normalization, we can assume that a multi-class (or binary in the case of binary\ntrees) classi\ufb01er is situated in all internal nodes and there are no classi\ufb01ers in the leaves of the tree.\nAlternatively, we can assume that each node, except the root, is associated with a binary classi\ufb01er that\nestimates P(zi = c| zi1, x), but then the additional normalization (2) has to be performed. This\nalternative formulation is important for the multi-label extension of HSM discussed in Section 6. In\neither way, learning of the node classi\ufb01ers can be performed simultaneously as independent tasks.\nNote that estimate \u02c6\u2318j(x) of the probability of label j can be easily obtained by traversing the tree\nalong the path indicated by the code of the label. Unfortunately, the task of predicting top k labels\nis more involved as it requires searching over the tree. Popular solutions are beam search (Kumar\net al., 2013; Prabhu et al., 2018), uniform-cost search (Joulin et al., 2016), and its approximate\nvariant (Dembczynski et al., 2012, 2016).\n\n5 Suboptimality of HSM for multi-label classi\ufb01cation\n\nTo deal with multi-label problems, some popular tools, such as FASTTEXT (Joulin et al., 2016) and its\nextension LEARNED TREE (Jernite et al., 2017), apply HSM with the pick-one-label heuristic which\nrandomly picks one of the positive labels from a given training instance. The resulting instance is then\ntreated as a multi-class instance. During prediction, the heuristic returns a multi-class distribution and\nthe k most probable labels. We show below that this speci\ufb01c reduction of the multi-label problem to\nmulti-class classi\ufb01cation is not consistent in general.\n\nmaps the multi-label distribution to a multi-class distribution in the following way:\n\nSince the probability of picking a label j from y is equal to yj/Pm\nyjPm\n\n\u23180j(x) = P0(yj = 1| x) =Xy2Y\n\nIt can be easily checked that the resulting \u23180j(x) form a multi-class distribution as the probabilities\nsum up to 1. It is obvious that that the heuristic changes the marginal probabilities of labels, unless\nthe initial distribution is multi-class. Therefore this method cannot lead to consistent classi\ufb01ers in\nterms of estimating \u2318j(x). As we show below, it is also not consistent for precision@k in general.\nProposition 1. A classi\ufb01er h such that hj(x) = \u23180j(x) for all j 2{ 1, . . . , m} has in general a\nnon-zero regret in terms of precision@k.\n\nj0=1 yj0, the pick-one-label heuristic\n\nP(y | x)\n\n(3)\n\nj0=1 yj0\n\nProof. We prove the proposition by giving a simple counterexample. Consider the following condi-\ntional distribution for some x:\n\nP(y = (1, 0, 0)| x) = 0.1 , P(y = (1, 1, 0)| x) = 0.5 , P(y = (0, 0, 1)| x) = 0.4 .\n\nThe optimal top 1 prediction for this example is obviously label 1, since the marginal probabilities are\n\u23181(x) = 0.6,\u2318 2(x) = 0.5,\u2318 3(x) = 0.4. However, the pick-one-label heuristic will transform the\noriginal distribution to the following one: \u231801(x) = 0.35,\u2318 02(x) = 0.25,\u2318 03(x) = 0.4. The predicted\ntop label will be then label 3, giving the regret of 0.2 for precision@1.\n\nThe proposition shows that the heuristic is in general inconsistent for precision@k. Interestingly, the\nsituation changes when the labels are conditionally independent, i.e., P(y | x) =Qm\nj=1 P(yi | x) .\nProposition 2. Given conditionally independent labels, a classi\ufb01er h such that hj(x) = \u23180j(x) for\nall j 2{ 1, . . . , m} has zero regret in terms of the precision@k loss.\nProof. We show here only a sketch of the proof. The full proof is given in Appendix B. To prove\nthe theorem, it is enough to show that in the case of conditionally independent labels the pick-\none-label heuristic does not change the order of marginal probabilities. Let yi and yj be so that\nP(yi = 1| x) P(yj = 1| x). Then in the summation over all ys in (3), we are interested\nin four different subsets of Y, Su,w\ni,j = {y 2Y : yi = u ^ yj = w}, where u, w 2{ 0, 1}.\nRemark that during mapping none of y 2 S0,0\ni,j , the value\nof yt/(Pm\nt0=1 yt0) \u21e5 P(y | x), for t 2{ i, j}, is the same for both yi and yj. Now, let y0 2 S1,0\ni,j\nand y00 2 S0,1\ni,j be the same on all elements except the i-th and the j-th one. Then, because\n\ni,j plays any role, and for each y 2 S1,1\n\n5\n\n\fof the label independence and the assumption that P(yi = 1| x) P(yj = 1| x), we have\nP(y0 | x) P(y00 | x). Therefore, after mapping we obtain \u23180i(x) \u23180j(x). Thus, for independent\nlabels, the pick-one-label heuristic is consistent for precision@k.\n\n6 Probabilistic label trees\n\nThe section above has revealed that HSM cannot be properly adapted to multi-label problems by\nthe pick-one-label heuristic. There is, however, a different way to generalize HSM to obtain no-\nregret estimates of marginal probabilities \u2318j(x). The probabilistic label trees (PLTs) (Jasinska\net al., 2016) can be derived in the following way. Let us encode yj = 1 by a slightly extended\ncode z = (1, z1, . . . , zl) in comparison to HSM. The new code gets 1 at the zero position what\ncorresponds to a question whether there exists at least one label assigned to the example. As before,\neach node is uniquely identi\ufb01ed by a partial code zi which says that there is at least one positive\nlabel in a subtree rooted in that node. It can be easily shown by the chain rule of probability that the\nmarginal probabilities can be expressed in the following way:\n\n\u2318j(x) = P(z | x) =\n\nP(zi | zi1, x) .\n\n(4)\n\nlYi=0\n\nXc\n\nThe difference to HSM is the probability P(z0 = 1| x) in the chain and a different normalization,\ni.e.:\n(5)\n\nP(zi = c| zi1, x) 1 .\n\nOnly for z0 we have P(z0 = 1| x) + P(z0 = 0| x) = 1. Because of (5), the binary models that\nestimate P(zi = c| zi1, x) (against P(zi 6= c| zi1, x)) are situated in all nodes of the tree (i.e.,\nalso in the leaves). The models can be trained independently as before for HSM. Only during\nprediction, one can re-calibrate the estimates when (5) is not satis\ufb01ed, for example, by normalizing\nthem to sum up to 1. It can be easily noticed that for a multi-class distribution, the resulting model\nof PLTs boils down to HSM, since P(z0 = 1| x) is always equal 1, and in addition, normalization\n(5) will take the form of (2). In Appendix D we additionally present the pseudocode of training and\npredicting with PLTs.\nNext, we show that the PLT model obeys strong theoretical guarantees. Let us \ufb01rst revise the result\nfrom (Jasinska et al., 2016) that relates the absolute difference between the true and the estimated\nmarginal probability of label j, |\u2318j(x) \u02c6\u2318j(x)|, to the surrogate loss ` used to train node classi\ufb01ers\nfzi. It is assumed here that ` is a strongly proper composite loss (e.g, logistic, exponential, or squared\nloss) characterized by a constant (e.g. = 4 for logistic loss).2\nTheorem 1. For any distribution P and internal node classi\ufb01ers fzi, the following holds:\n\n|\u2318j(x) \u02c6\u2318j(x)|\uf8ff\n\nP(zi1 | x)r 2\n\nlXi=0\n\npreg`(fzi | zi1, x) ,\n\nwhere reg`(fzi | zi1, x) is a binary classi\ufb01cation regret for a strongly proper composite loss ` and\n is a constant speci\ufb01c for loss `.\nDue to \ufb01ltering of the distribution imposed by the PLT, the regret reg`(fzi | zi1, x) of a classi\ufb01er\nfzi exists only for x such that P(zi1 | x) > 0, therefore we condition the regret not only on x, but\nalso on zi1. The above result shows that the absolute error of estimating the marginal probability of\nlabel j can be upper bounded by the regret of the node classi\ufb01ers on the corresponding path from\nthe root to a leaf. The proof of Theorem 1 is given in Appendix A. Moreover, for zero-regret (i.e.,\noptimal) node classi\ufb01ers we obtain an optimal multi-label classi\ufb01er in terms of estimation of marginal\nprobabilities \u2318j(x). This result can be further extended for precision@k.\nTheorem 2. For any distribution P and classi\ufb01er h delivering estimates \u02c6\u2318j(x) of the marginal\nprobabilities of labels, the following holds:\n\nregp@k(h| x) =\n\n\u2318i(x) \n\n\u2318j(x) \uf8ff 2 max\n\nl\n\n|\u2318l(x) \u02c6\u2318l(x)|\n\n2For more detailed introduction to strongly proper composite losses, we refer the reader to (Agarwal, 2014).\n\n1\n\nk Xi2Yk\n\n1\n\nk Xj2 \u02c6Yk\n\n6\n\n\fThe proof is based on adding and subtracting the following terms 1\n\u02c6\u2318j(x)\nto the regret (a detailed proof is given in Appendix A). By getting together both theorems we get an\nupper bound of the precision@k regret expressed in terms of the regret of the node classi\ufb01ers. Again,\nfor the zero-regret node classi\ufb01ers, we get optimal solution in terms of precision@k.\n\n\u02c6\u2318i(x) and 1\n\nkPi2Yk\n\nkPj2 \u02c6Yk\n\n7\n\nImplementation details of PLTs\n\nGiven the tree structure, the node classi\ufb01ers of PLTs can be trained as logistic regression either in\nonline (Jasinska et al., 2016) or batch mode (Prabhu et al., 2018). Both training modes have their pros\nand cons, but the online implementation gives a possibility of learning more complex representation\nof input instances. The above cited implementations are both based on sparse representation, given\neither in a form of a bag-of-words or its TF-IDF variant. We opt here for training a PLT in the online\nmode along with the dense representation. We build our implementation upon FASTTEXT and refer\nto it as XT which stands for EXTREMETEXT.3 In this way, we succeeded to obtain a very powerful\nand compressed model. The small dense models are important for fast online prediction as they\ndo not need too much resources. The sparse models, in turn, can be slow and expensive in terms\nof memory usage as they need to decompress the node models to work fast. Remark also that, in\ngeneral, PLTs can be used as an output layer of any neural network architecture (also that one used\nin XML-CNN (Yen et al., 2017)) to speed up training and prediction time.\nIn contrast to the original implementation of FASTTEXT, we use L2 regularization for all parameters\nof the model. To obtain representation of input instances we do not compute simple averages of\nthe feature vectors, but use weights proportional to the TF-IDF scores of features. The competitive\nresults can be obtained with feature and instance vectors of size 500. If a node classi\ufb01cation task\ncontains only positive instances, we use a constant classi\ufb01er predicting 1 without any training. The\ntraining of PLT in either mode, online or batch, can be easily parallelized as each node classi\ufb01er can\nbe trained in isolation from the other classi\ufb01ers. In our current implementation, however, we follow\nthe parallelization on the level of training and test instances as in original FASTTEXT.\nOur implementation, because of the additional use of the L2 regularization, has more parameters\nthan original FASTTEXT. We have found, however, that our model is remarkably robust for the\nhyperparameter selection, since it achieves close to optimal performance for a large set of hyperpa-\nrameters that is in the vicinity of the optimal one. Moreover, the optimal hyperparameters are close\nto each other across all datasets. We report more information about the hyperparameter selection in\nAppendix E.4.\nThe tree structure of a PLT is a crucial modeling decision. The vanishing regret for probability\nestimates and precision@k holds regardless of the tree structure (see Theorem 1 and 2), however, this\ntheory requires the regret of the node classi\ufb01ers also to vanish. In practice, we can only estimate the\nconditional probabilities in the nodes, therefore the tree structure does indeed matter as it affects the\ndif\ufb01culty of the node learning problems. The original PLT paper (Jasinska et al., 2016) uses simple\ncomplete trees with labels assigned to leaves according to their frequencies. Another option, routinely\nused in HSM (Joulin et al., 2016), is the Huffman tree built over the label frequencies. Such tree takes\ninto account the computational complexity by putting the most frequent labels close to the root. This\napproach has been further extended to optimize GPU operations in (Grave et al., 2017). Unfortunately,\nit ignores the statistical properties of the tree structure. Furthermore, for multi-label case the Huffman\ntree is no longer optimal even in terms of computational cost as we show it in Appendix C. There\nexist, however, other methods that focus on building a tree with high overall accuracy (Tagami, 2017;\nPrabhu et al., 2018). In our work, we follow the later approach, which performs a simple top-down\nhierarchical clustering. Each label in this approach is represented by a pro\ufb01le vector being an average\nof the training vectors tagged by this label. Then the pro\ufb01le vectors are clustered using balanced\nk-means which divides the labels into two or more clusters with approximately the same size. This\nprocedure is then repeated recursively until the clusters are smaller than a given value (for example,\n100). The nodes of the resulting tree are then of different arities. The internal nodes up to the leaves\u2019\nparent nodes have k children, but the leaves\u2019 parent nodes are usually of higher arity. Thanks to this\nclustering, similar labels are close to each other in the tree. Moreover, the tree is balanced, so its\ndepth is logarithmic in terms of the number of labels.\n\n3Implementation of XT is available at https://github.com/mwydmuch/extremeText.\n\n7\n\n\f8 Empirical results\n\nWe carried out three sets of experiments. In the \ufb01rst, we compare exhaustively the performance of\nPLTs and HSM on synthetic and benchmark data. Due to lack of space, the results are deferred to\nAppendix E.1 and E.2. The results on synthetic data con\ufb01rm our theoretical \ufb01ndings: the models\nare the same in the case of multi-class data, the performance of HSM and PLTs is on par using\nmulti-label data with independent labels, and PLTs signi\ufb01cantly outperform HSM on multi-label\ndata with conditionally dependent labels. The results on the benchmark data clearly indicate the\nbetter performance of PLTs over HSM.\nIn the second experiment, we compare XT, the variant of PLTs discussed in the previous section, to\nthe state-of-the-art algorithms on \ufb01ve benchmark datasets taken from XMLC repository,4 and their text\nequivalents, by courtesy of Liu et al. (2017). We compare the models in terms of precision@{1, 3, 5},\nmodel size, training and test time. The competitors for our XT are original FASTTEXT, its vari-\nant LEARNED TREE, a PLT-like batch learning algorithm PARABEL (we use the variant that uses\na single tree instead of an ensemble), a XMLC-designed convolutional deep network XML-CNN,\na decision tree ensemble FASTXML, and two 1-vs-All approaches tailored to XMLC problems,\nPPD-SPARSE and DISMEC. The hyperparameters of the models have been tuned using grid search.\nThe range of the hyperparameters is reported in E.4.\nThe results presented in Table 1 demonstrate that XT outperforms the HSM approaches with the\npick-one-label heuristic, namely FASTTEXT and LEARNED TREE, with a large margin. This proves\nthe superiority of PLTs as the proper generalization of HSM to multi-label setting. In all the above\nmethods we use vectors of length 500 and we tune the other hyperparameters appropriately for a fair\ncomparison.\nMoreover, XT scales well to extreme datasets achieving performance close to the state-of-the-art,\nbeing at the same time 10000x and 100x faster compared to DISMEC and PPDSPARSE during\nprediction. XT always responds below 2ms, what makes it a competitive alternative for an online\nsetting. XT is also close to PARABEL in terms of performance. However, the reported times and\nmodel sizes of PARABEL are given for the batch prediction. The prediction times seem to be faster,\nbut PARABEL needs to decompress the model during prediction, what makes it less suitable for online\nprediction. It is only ef\ufb01cient when the batches are suf\ufb01ciently large. Finally, we would like to\nunderline that XT outperforms XML-CNN, the more complex neural network, in terms of predictive\nperformance with computational costs that are an order of magnitude smaller. Moreover, XML-CNN\nrequires pretrained embedding vectors, whereas XT can be used with random initialization.\nIn the third experiment we perform an ablation analysis in which we compare different components\nof the XT algorithm. We analyze the in\ufb02uence of the Huffman tree vs. top-down clustering, the\nsimple averaging of features vectors vs.\nthe TF-IDF-based weighting, and no regularization vs.\nL2 regularization. Figure 1 clearly shows that the components need to be combined together to\nobtain the best results. The best combination uses top-down clustering, TF-IDF-based weighting,\nand L2 regularization, while top-down clustering alone gets worse results than Huffman trees with\nTF-IDF-based weighting and L2 regularization. In Appendix E.3 we give more detailed results of the\nablation analysis performed on a larger spectrum of benchmark datasets.\n\n9 Conclusions\n\nIn this paper we have proven that probabilistic label trees (PLTs) are no-regret generalization of HSM\nto the multi-label setting. Our main theoretical contribution is the precision@k regret bound for PLTs.\nMoreover, we have shown that the pick-one-label heuristic commonly-used with HSM in multi-label\nproblems leads to inconsistent results in terms of marginal probability estimation and precision@k.\nOur implementation of PLTs referred to as XT, built upon FASTTEXT, gets state-of-the-art results,\nbeing signi\ufb01cantly better than the original FASTTEXT, LEARNED TREE, and XML-CNN. The XT\nresults are also close to the best known ones that are obtained by expensive 1-vs-All approaches, such\nas PPDSPARSE and DISMEC, and outperforms the other tree-based methods on many benchmarks.\nOur online variant has the advantage of producing very often much smaller models that can be\nef\ufb01ciently used in fast online prediction.\n\n4Additional statistics of these datasets are also included in Appendix F. Address of the XMLC repository:\n\nhttp://manikvarma.org/downloads/XC/XMLRepository.html\n\n8\n\n\fTable 1: Precision@k scores with k = {1, 3, 5} and statistics of FASTXML, PPDSPARSE, DISMEC,\nPARABEL (with 1 tree), FASTTEXT (FT), LEARNED TREE (LT), EXTREMETEXT (XT) and XML-CNN\nmethods. Notation: N \u2013 number of samples, T \u2013 CPU time, m \u2013 number of labels, d \u2013 number of features, \u21e4 \u2013\nresult of of\ufb02ine prediction, ? \u2013 calculated on GPU, \u2020 \u2013 not reported by authors, \u2021 \u2013 cannot be calculated due to\nlack of a text version of a dataset.\n\nMetrics\nP@1\nP@3\nP@5\nTtrain\nTtest/Ntest\nmodel size\nP@1\nP@3\nP@5\nTtrain\nTtest/Ntest\nmodel size\nP@1\nP@3\nP@5\nTtrain\nTtest/Ntest\nmodel size\nP@1\nP@3\nP@5\nTtrain\nTtest/Ntest\nmodel size\nP@1\nP@3\nP@5\nTtrain\nTtest/Ntest\nmodel size\n\nFASTXML\n\nPPDSPARSE DISMEC\n\n82.03\n67.47\n57.76\n16m\n3.00ms\n354M\n42.81\n38.76\n36.34\n458m\n4.86ms\n15.4G\n49.35\n32.69\n24.03\n724m\n2.17ms\n9.3G\n54.10\n29.45\n21.21\n3214m\n8.03ms\n63G\n34.24\n29.30\n26.12\n422m\n3.39ms\n10G\n\n73.80\n60.90\n50.40\n\u2020\n\u2020\n\u2020\n45.05\n38.34\n34.90\n4781m\n275ms\n9.4G\n64.08\n41.26\n30.12\n236m\n37.76ms\n5.2G\n70.16\n50.57\n39.66\n1771m\n113.70ms\n3.4G\n45.32\n40.37\n36.92\n102m\n66.09ms\n6.0G\n\n85.20\n74.60\n65.90\n\u2020\n\u2020\n\u2020\n44.71\n38.08\n34.7\n1080h\n5m\n18.0G\n64.94\n42.71\n31.5\n750h\n43m\n3.8G\n70.20\n50.60\n39.70\n7495h\n155m\n14.7G\n45.37\n40.40\n36.96\n373h\n23m\n3.8G\n\nFT\n80.78\n50.46\n36.79\n10m\n1.88ms\n513M\n42.22\n37.90\n35.05\n271m\n1.97ms\n9.0G\n41.13\n24.09\n17.44\n207\n1.25ms\n6.5G\n32.73\n19.02\n14.46\n496m\n2.05ms\n11G\n25.47\n21.47\n18.61\n162m\n7.84ms\n3.2G\n\nLT\n80.85\n50.59\n37.68\n12m\n10.09ms\n513M\n42.71\n36.27\n33.43\n563m\n1.98ms\n9.0G\n50.15\n31.95\n23.59\n212m\n4.76ms\n6.5G\n37.18\n21.62\n16.01\n531m\n6.43ms\n11G\n27.67\n20.96\n17.72\n182m\n5.13ms\n3.2G\n\nXT\n85.23\n73.18\n63.39\n18m\n0.83ms\n259M\n47.85\n42.08\n39.13\n502m\n1.41ms\n1.9G\n58.73\n39.24\n29.26\n550m\n0.81ms\n3.3G\n64.48\n45.84\n35.46\n1253m\n1.07ms\n5.5G\n39.90\n35.36\n32.04\n241m\n1.72ms\n1.5G\n\nPARABEL\n83.77\n71.96\n62.44\n5m\n1.63ms\u21e4\n109M\u21e4\n43.32\n38.49\n35.83\n105m\n1.31ms\u21e4\n1.8G\u21e4\n61.53\n40.07\n29.25\n34m\n0.92ms\u21e4\n1.1G\u21e4\n66.12\n47.02\n36.45\n168m\n4.68ms\u21e4\n2.0G\u21e4\n41.59\n37.18\n33.85\n8m\n0.68ms\u21e4\n0.7G\u21e4\n\nXML-CNN\n\n82.78\n66.34\n56.23\n88m?\n1.39ms?\n?\n\u2021\n\u2021\n\u2021\n\u2021\n\u2021\n\u2021\n\u2021\n\u2021\n\u2021\n\u2021\n\u2021\n\u2021\n59.85\n39.28\n29.81\n7032m?\n21.06ms?\n3.7G?\n35.39\n33.74\n32.64\n3134m?\n16.18ms?\n1.5G?\n\nWikiLSHTC\n\nWiki-500K\n\nAmazon-670K\n\nDataset\n\nWiki-30K\nNtrain = 14146\nNtest = 6616\nd = 101938\nm = 30938\n\nDelicious-200K\nNtrain = 196606\nNtest = 100095\nd = 782585\nm = 205443\n\nWikiLSHTC\nNtrain = 1778351\nNtest = 587084\nd = 617899\nm = 325056\n\nWiki-500K\nNtrain = 1813391\nNtest = 783743\nd = 2381304\nm = 501070\n\nAmazon-670K\nNtrain = 490449\nNtest = 153025\nd = 135909\nm = 670091\n\n1\n@\nn\no\ni\ns\ni\nc\ne\nr\np\n\n60\n\n40\n\n20\n\n0\n\nHuffman\n\ntree\n\nHuffman\n\n+L2\n\nHuffman\n+TF-IDF\n\nHuffman\n\n+TF-IDF+L2\n\ntop-down\nclustering\n\ntop-down\n\n+L2\n\ntop-down\n+TF-IDF\n\ntop-down\n\n+TF-IDF+L2\n\nFigure 1: The ablation analysis of different variants of XT on WIKILSHTC, WIKI-500K, and\nAMAZON-670K.\n\nAcknowledgements\n\nThe work of Kalina Jasinska was supported by the Polish National Science Center under grant no.\n2017/25/N/ST6/00747. The work of Krzysztof Dembczy\u00b4nski was supported by the Polish Ministry\nof Science and Higher Education under grant no. 09/91/DSPB/0651. Computational experiments\nhave been performed in Poznan Supercomputing and Networking Center.\n\n9\n\n\fReferences\nAgarwal, Shivani. Surrogate regret bounds for bipartite ranking via strongly proper losses. Journal\n\nof Machine Learning Research, 15(1):1653\u20131674, 2014.\n\nAgrawal, Rahul, Gupta, Archit, Prabhu, Yashoteja, and Varma, Manik. Multi-label learning with\nmillions of labels: Recommending advertiser bid phrases for web pages. In 22nd International\nWorld Wide Web Conference, WWW \u201913, Rio de Janeiro, Brazil, May 13-17, 2013, pp. 13\u201324.\nInternational World Wide Web Conferences Steering Committee / ACM, 2013.\n\nBabbar, Rohit and Sch\u00f6lkopf, Bernhard. DiSMEC: Distributed Sparse Machines for Extreme Multi-\nlabel Classi\ufb01cation. In Proceedings of the Tenth ACM International Conference on Web Search\nand Data Mining, WSDM 2017, Cambridge, United Kingdom, February 6-10, 2017, pp. 721\u2013729.\nACM, 2017.\n\nBengio, Samy, Weston, Jason, and Grangier, David. Label Embedding Trees for Large Multi-Class\nTasks. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on\nNeural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010,\nVancouver, British Columbia, Canada., pp. 163\u2013171. Curran Associates, Inc., 2010.\n\nBeygelzimer, Alina, Langford, John, Lifshits, Yury, Sorkin, Gregory B., and Strehl, Alexander L.\nConditional Probability Tree Estimation Analysis and Algorithms. In UAI 2009, Proceedings of\nthe Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, Montreal, QC, Canada, June\n18-21, 2009, pp. 51\u201358. AUAI Press, 2009a.\n\nBeygelzimer, Alina, Langford, John, and Ravikumar, Pradeep. Error-Correcting Tournaments. In\nAlgorithmic Learning Theory, 20th International Conference, ALT 2009, Porto, Portugal, October\n3-5, 2009. Proceedings, volume 5809 of Lecture Notes in Computer Science, pp. 247\u2013262. Springer,\n2009b.\n\nBhatia, Kush, Jain, Himanshu, Kar, Purushottam, Varma, Manik, and Jain, Prateek. Sparse Local\nEmbeddings for Extreme Multi-label Classi\ufb01cation. In Advances in Neural Information Processing\nSystems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12,\n2015, Montreal, Quebec, Canada, pp. 730\u2013738, 2015.\n\nChoromanska, Anna and Langford, John. Logarithmic Time Online Multiclass prediction.\n\nIn\nAdvances in Neural Information Processing Systems 28: Annual Conference on Neural Information\nProcessing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 55\u201363, 2015.\n\nDekel, Ofer and Shamir, Ohad. Multiclass-Multilabel Classi\ufb01cation with More Classes than Examples.\nIn Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\nAISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010, volume 9 of JMLR\nProceedings, pp. 137\u2013144. JMLR.org, 2010.\n\nDembczynski, Krzysztof, Cheng, Weiwei, and H\u00fcllermeier, Eyke. Bayes Optimal Multilabel Classi\ufb01-\ncation via Probabilistic Classi\ufb01er Chains. In Proceedings of the 27th International Conference on\nMachine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 279\u2013286. Omnipress, 2010.\n\nDembczynski, Krzysztof, Waegeman, Willem, and H\u00fcllermeier, Eyke. An Analysis of Chaining in\nMulti-Label Classi\ufb01cation. In ECAI 2012 - 20th European Conference on Arti\ufb01cial Intelligence.\nIncluding Prestigious Applications of Arti\ufb01cial Intelligence (PAIS-2012) System Demonstrations\nTrack, Montpellier, France, August 27-31 , 2012, volume 242 of Frontiers in Arti\ufb01cial Intelligence\nand Applications, pp. 294\u2013299. IOS Press, 2012.\n\nDembczynski, Krzysztof, Kotlowski, Wojciech, Waegeman, Willem, Busa-Fekete, R\u00f3bert, and\nH\u00fcllermeier, Eyke. Consistency of Probabilistic Classi\ufb01er Trees.\nIn Machine Learning and\nKnowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda,\nItaly, September 19-23, 2016, Proceedings, Part II, volume 9852 of Lecture Notes in Computer\nScience, pp. 511\u2013526. Springer, 2016.\n\nDembczynski, Krzysztof, Kotlowski, Wojciech, Koyejo, Oluwasanmi, and Natarajan, Nagarajan.\nConsistency Analysis for Binary Classi\ufb01cation Revisited. In Proceedings of the 34th Interna-\ntional Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017,\nvolume 70 of Proceedings of Machine Learning Research, pp. 961\u2013969. PMLR, 2017.\n\n10\n\n\fDeng, Jia, Satheesh, Sanjeev, Berg, Alexander C., and Li, Fei-Fei. Fast and Balanced: Ef\ufb01cient\nLabel Tree Learning for Large Scale Object Recognition. In Advances in Neural Information\nProcessing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011.\nProceedings of a meeting held 12-14 December 2011, Granada, Spain., pp. 567\u2013575, 2011.\n\nFan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang, Xiang-Rui, and Lin, Chih-Jen. LIBLINEAR:\nA Library for Large Linear Classi\ufb01cation. Journal of Machine Learning Research, 9:1871\u20131874,\n2008.\n\nFox, John. Applied regression analysis, linear models, and related methods. Sage, 1997.\n\nGrave, Edouard, Joulin, Armand, Ciss\u00e9, Moustapha, Grangier, David, and J\u00e9gou, Herv\u00e9. Ef\ufb01cient\nsoftmax approximation for GPUs. In Proceedings of the 34th International Conference on Machine\nLearning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of\nMachine Learning Research, pp. 1302\u20131310, International Convention Centre, Sydney, Australia,\n2017. PMLR.\n\nJasinska, Kalina, Dembczynski, Krzysztof, Busa-Fekete, R\u00f3bert, Pfannschmidt, Karlson, Klerx, Timo,\nand H\u00fcllermeier, Eyke. Extreme F-measure Maximization using Sparse Probability Estimates. In\nProceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York\nCity, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp.\n1435\u20131444. JMLR.org, 2016.\n\nJernite, Yacine, Choromanska, Anna, and Sontag, David. Simultaneous Learning of Trees and\nRepresentations for Extreme Classi\ufb01cation and Density Estimation. In Proceedings of the 34th\nInternational Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August\n2017, volume 70 of Proceedings of Machine Learning Research, pp. 1665\u20131674. PMLR, 2017.\n\nJoulin, Armand, Grave, Edouard, Bojanowski, Piotr, and Mikolov, Tomas. Bag of Tricks for Ef\ufb01cient\n\nText Classi\ufb01cation. CoRR, abs/1607.01759, 2016.\n\nKumar, Abhishek, Vembu, Shankar, Menon, Aditya Krishna, and Elkan, Charles. Beam search\n\nalgorithms for multilabel learning. Machine Learning, 92:65\u201389, 2013.\n\nKurzynski, Marek. On the multistage Bayes classi\ufb01er. Pattern Recognition, 21(4):355\u2013365, 1988.\nLangford, John, Strehl, Alex, and Li, Lihong. Vowpal Wabbit, 2007. http://hunch.net/~vw/.\nLi, Chun-Liang and Lin, Hsuan-Tien. Condensed Filter Tree for Cost-Sensitive Multi-Label Classi\ufb01-\ncation. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014,\nBeijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pp.\n423\u2013431. JMLR.org, 2014.\n\nLiu, Jingzhou, Chang, Wei-Cheng, Wu, Yuexin, and Yang, Yiming. Deep Learning for Extreme\nMulti-label Text Classi\ufb01cation. In Proceedings of the 40th International ACM SIGIR Conference\non Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11,\n2017, pp. 115\u2013124. ACM, 2017.\n\nMikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Gregory S., and Dean, Jeffrey. Distributed\nRepresentations of Words and Phrases and their Compositionality. In Advances in Neural Informa-\ntion Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems\n2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pp.\n3111\u20133119, 2013.\n\nMineiro, Paul and Karampatziakis, Nikos. Fast Label Embeddings via Randomized Linear Algebra.\nIn Machine Learning and Knowledge Discovery in Databases - European Conference, ECML\nPKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I, volume 9284 of Lecture\nNotes in Computer Science, pp. 37\u201351. Springer, 2015.\n\nMorin, Frederic and Bengio, Yoshua. Hierarchical Probabilistic Neural Network Language Model. In\nProceedings of the Tenth International Workshop on Arti\ufb01cial Intelligence and Statistics, AISTATS\n2005, Bridgetown, Barbados, January 6-8, 2005. Society for Arti\ufb01cial Intelligence and Statistics,\n2005.\n\n11\n\n\fNarasimhan, Harikrishna, Vaish, Rohit, and Agarwal, Shivani. On the Statistical Consistency of Plug-\nin Classi\ufb01ers for Non-decomposable Performance Measures. In Advances in Neural Information\nProcessing Systems 27: Annual Conference on Neural Information Processing Systems 2014,\nDecember 8-13 2014, Montreal, Quebec, Canada, pp. 1493\u20131501, 2014.\n\nNiculescu-Mizil, Alexandru and Abbasnejad, Ehsan. Label Filters for Large Scale Multilabel\nClassi\ufb01cation. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and\nStatistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings\nof Machine Learning Research, pp. 1448\u20131457. PMLR, 2017.\n\nPrabhu, Yashoteja and Varma, Manik. FastXML: a fast, accurate and stable tree-classi\ufb01er for extreme\nmulti-label learning. In The 20th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, KDD \u201914, New York, NY, USA - August 24 - 27, 2014, pp. 263\u2013272. ACM, 2014.\nPrabhu, Yashoteja, Kag, Anil, Harsola, Shrutendra, Agrawal, Rahul, and Varma, Manik. Parabel:\nPartitioned Label Trees for Extreme Classi\ufb01cation with Application to Dynamic Search Advertising.\nIn Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon,\nFrance, April 23-27, 2018, pp. 993\u20131002. ACM, 2018.\n\nShrivastava, Anshumali and Li, Ping. Improved Asymmetric Locality Sensitive Hashing (ALSH)\nfor Maximum Inner Product Search (MIPS). In Proceedings of the Thirty-First Conference on\nUncertainty in Arti\ufb01cial Intelligence, UAI 2015, July 12-16, 2015, Amsterdam, The Netherlands,\npp. 812\u2013821. AUAI Press, 2015.\n\nTagami, Yukihiro. AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label\nClassi\ufb01cation. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pp. 455\u2013464. ACM, 2017.\nVijayanarasimhan, Sudheendra, Shlens, Jonathon, Monga, Rajat, and Yagnik, Jay. Deep Networks\n\nWith Large Output Spaces. CoRR, abs/1412.7479, 2014.\n\nYe, Nan, Chai, Kian Ming Adam, Lee, Wee Sun, and Chieu, Hai Leong. Optimizing F-measure:\nA Tale of Two Approaches. In Proceedings of the 29th International Conference on Machine\nLearning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress,\n2012.\n\nYen, Ian En-Hsu, Huang, Xiangru, Dai, Wei, Ravikumar, Pradeep, Dhillon, Inderjit S., and Xing,\nEric P. PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classi\ufb01cation. In Pro-\nceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, Halifax, NS, Canada, August 13 - 17, 2017, pp. 545\u2013553. ACM, 2017.\n\nYu, Hsiang-Fu, Jain, Prateek, Kar, Purushottam, and Dhillon, Inderjit S. Large-scale Multi-label\nLearning with Missing Labels. In Proceedings of the 31th International Conference on Machine\nLearning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and\nConference Proceedings, pp. 593\u2013601. JMLR.org, 2014.\n\n12\n\n\f", "award": [], "sourceid": 3135, "authors": [{"given_name": "Marek", "family_name": "Wydmuch", "institution": "Poznan University of Technology"}, {"given_name": "Kalina", "family_name": "Jasinska", "institution": "Allegro.pl"}, {"given_name": "Mikhail", "family_name": "Kuznetsov", "institution": "Yahoo! Research"}, {"given_name": "R\u00f3bert", "family_name": "Busa-Fekete", "institution": "Yahoo! Research"}, {"given_name": "Krzysztof", "family_name": "Dembczynski", "institution": "Poznan University of Technology"}]}