{"title": "Label Embedding Trees for Large Multi-Class Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 163, "page_last": 171, "abstract": "Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a tree-structure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster.", "full_text": "Label Embedding Trees for Large Multi-Class Tasks\n\nSamy Bengio(1)\n\nJason Weston(1) David Grangier(2)\n\n(1) Google Research, New York, NY\n\n{bengio, jweston}@google.com\n(2)NEC Labs America, Princeton, NJ\n{dgrangier}@nec-labs.com\n\nAbstract\n\nMulti-class classi\ufb01cation becomes challenging at test time when the number of\nclasses is very large and testing against every possible class can become compu-\ntationally infeasible. This problem can be alleviated by imposing (or learning)\na structure over the set of classes. We propose an algorithm for learning a tree-\nstructure of classi\ufb01ers which, by optimizing the overall tree loss, provides superior\naccuracy to existing tree labeling methods. We also propose a method that learns\nto embed labels in a low dimensional space that is faster than non-embedding ap-\nproaches and has superior accuracy to existing embedding approaches. Finally\nwe combine the two ideas resulting in the label embedding tree that outperforms\nalternative methods including One-vs-Rest while being orders of magnitude faster.\n\n1\n\nIntroduction\n\nDatasets available for prediction tasks are growing over time, resulting in increasing scale in all their\nmeasurable dimensions: separate from the issue of the growing number of examples m and features\nd, they are also growing in the number of classes k. Current multi-class applications such as web\nadvertising [6], textual document categorization [11] or image annotation [12] have tens or hundreds\nof thousands of classes, and these datasets are still growing. This evolution is challenging traditional\napproaches [1] whose test time grows at least linearly with k.\nAt training time, a practical constraint is that learning should be feasible, i.e.\nit should not take\nmore than a few days, and must work with the memory and disk space requirements of the available\nhardware. Most algorithms\u2019 training time, at best, linearly increases with m, d and k; algorithms\nthat are quadratic or worse with respect to m or d are usually discarded by practitioners working on\nreal large scale tasks. At testing time, depending on the application, very speci\ufb01c time constraints\nare necessary, usually measured in milliseconds, for example when a real-time response is required\nor a large number of records need to be processed. Moreover, memory usage restrictions may also\napply. Classical approaches such as One-vs-Rest are at least O(kd) in both speed (of testing a single\nexample) and memory. This is prohibitive for large scale problems [6, 12, 26].\nIn this work, we focus on algorithms that have a classi\ufb01cation speed sublinear at testing time in k as\nwell as having limited dependence on d with best-case complexity O(de(log k + d)) with de (cid:28) d\nand de (cid:28) k. In experiments we observe no loss in accuracy compared to methods that are O(kd),\nfurther, memory consumption is reduced from O(kd) to O(kde). Our approach rests on two main\nideas: \ufb01rstly, an algorithm for learning a label tree: each node makes a prediction of the subset of\nlabels to be considered by its children, thus decreasing the number of labels k at a logarithmic rate\nuntil a prediction is reached. We provide a novel algorithm that both learns the sets of labels at each\nnode, and the predictors at the nodes to optimize the overall tree loss, and show that this approach is\nsuperior to existing tree-based approaches [7, 6] which typically lose accuracy compared to O(kd)\napproaches. Balanced label trees have O(d log k) complexity as the predictor at each node is still\n\n1\n\n\fAlgorithm 1 Label Tree Prediction Algorithm\n\nInput: test example x, parameters T .\nLet s = 0.\nrepeat\nuntil |\u2018s| = 1\nReturn \u2018s.\n\nLet s = argmax{c:(s,c)\u2208E}fc(x).\n\n- Start at the root node\n\n- Traverse to the most con\ufb01dent child.\n- Until this uniquely de\ufb01nes a single label.\n\nlinear in d. Our second main idea is to learn an embedding of the labels into a space of dimension\nde that again still optimizes the overall tree loss. Hence, we are required at test time to: (1) map\nthe test example in the label embedding space with cost O(dde) and then (2) predict using the label\ntree resulting in our overall cost O(de(log k + d)). We also show that our label embedding approach\noutperforms other recently proposed label embedding approaches such as compressed sensing [17].\nThe rest of the paper is organized as follows. Label trees are discussed and label tree learning\nalgorithms are proposed in Section 2. Label embeddings are presented in Section 3. Related prior\nwork is presented in Section 4. An experimental study on three large tasks is given in Section 5\nshowing the good performance of our proposed techniques. Finally, Section 6 concludes.\n\na subset of its parent label set with \u2018p =S\n\n2 Label Trees\nA label tree is a tree T = (N, E, F, L) with n+1 indexed nodes N = {0, . . . n}, a set of edges E =\n{(p1, c1), (p|E|, c|E|)} which are ordered pairs of parent and child node indices, label predictors\nF = {f1, . . . , fn} and label sets L = {\u20180, . . . , \u2018n} associated to each node. The root node is\nlabeled with index 0. The edges E are such that all other nodes have one parent, but they can have\nan arbitrary number of children (but still in all cases |E| = n). The label sets indicate the set of\nlabels to which a point should belong if it arrives at the given node, and progress from generic to\nspeci\ufb01c along the tree, i.e. the root label set contains all classes |\u20180| = k and each child label set is\n(p,c)\u2208E \u2018c. We differentiate between disjoint label trees\nwhere there are only k leaf nodes, one per class, and hence any two nodes i and j at the same depth\ncannot share any labels, \u2018i \u2229 \u2018j = \u2205, and joint label trees that can have more than k leaf nodes.\nClassifying an example with the label tree is achieved by applying Algorithm 1. Prediction begins\nat the root node (s = 0) and for each edge leading to a child (s, c) \u2208 E one computes the score\nof the label predictor fc(x) which predicts whether the example x belongs to the set of labels \u2018c.\nOne takes the most con\ufb01dent prediction, traverses to that child node, and then repeats the process.\nClassi\ufb01cation is complete when one arrives at a node that identi\ufb01es only a single label, which is the\npredicted class.\nInstances of label trees have been used in the literature before with various methods for choosing\nthe parameters (N, E, F, L). Due to the dif\ufb01culty of learning, many methods make approximations\nsuch as a random choice of E and optimization of F that does not take into account the overall loss\nof the entire system leading to suboptimal performance (see [7] for a discussion). Our goal is to\nprovide an algorithm to learn these parameters to optimize the overall empirical loss (called the tree\nloss) as accurately as possible for a given tree size (speed).\nWe can de\ufb01ne the tree loss we wish to minimize as:\n\nZ\n\nZ\n\nR(ftree) =\n\nI(ftree(x) 6= y)dP (x, y) =\n\ni\u2208B(x)={b1(x),...bD(x)(x)} I(y /\u2208 \u2018i)dP (x, y)\n\nmax\n\n(1)\n\nwhere I is the indicator function and\n\nbj(x) = argmax{c : (bj\u22121(x),c)\u2208E}fc(x)\n\nis the index of the winning (\u201cbest\u201d) node at depth j, b0(x) = 0, and D(x) is the depth in the tree\nof the \ufb01nal prediction for x, i.e.\nthe number of loops plus one of the repeat block when running\nAlgorithm 1. The tree loss measures an intermediate loss of 1 for each prediction at each depth j of\nthe label tree where the true label is not in the label set \u2018bj (x). The \ufb01nal loss for a single example\nis the max over these losses, because if any one of these classi\ufb01ers makes a mistake then regardless\n\n2\n\n\fof the other predictions the wrong class will still be predicted. Hence, any algorithm wishing to\noptimize the overall tree loss should train all the nodes jointly with respect to this maximum.\nWe will now describe how we propose to learn the parameters T of our label tree.\nIn the next\nsubsection we show how to minimize the tree loss for a given \ufb01xed tree (N, E and L are \ufb01xed, F is\nto be learned). In the following subsection, we will describe our algorithm for learning N, E and L.\n\n2.1 Learning with a Fixed Label Tree\n\nLet us suppose we are given a \ufb01xed label tree N, E, L chosen in advance. Our goal is simply to\nminimize the tree loss (1) over the variables F , given training data {(xi, yi)}i=1,...,m. We follow the\nstandard approach of minimizing the empirical loss over the data, while regularizing our solution.\nWe consider two possible algorithms for solving this problem.\nRelaxation 1: Independent convex problems The simplest (and poorest) procedure is to consider\nthe following relaxation to this problem:\n\nRemp(ftree) =\n\n1\nm\n\nmax\nj\u2208B(x)\n\nI(yi /\u2208 \u2018j) \u2264 1\nm\n\nI(sgn(fj(xi)) = Cj(yi))\n\nwhere Cj(y) = 1 if y \u2208 \u2018j and -1 otherwise. The number of errors counted by the approximation\ncannot be less than the empirical tree loss Remp as when, for a particular example, the loss is zero\nfor the approximation it is also zero for Remp. However, the approximation can be much larger\nbecause of the sum.\nOne then further approximates this by replacing the indicator function with the hinge loss and choos-\ning linear (or kernel) models of the form fi(x) = w>\ni \u03c6(x). We are then left with the following\nconvex problem: minimize\n\n \n\nnX\n\nj=1\n\n!\n\nmX\n\ni=1\n\n1\nm\n\n\u03b3||wj||2 +\n\ns.t. \u2200i, j,\n\n\u03beij\n\n(cid:26) Cj(yi)fj(xi) \u2265 1 \u2212 \u03beij\n\n\u03beij \u2265 0\n\nmX\n\ni=1\n\nmX\n\nnX\n\ni=1\n\nj=1\n\nwhere we also added a classical 2-norm regularizer controlled by the hyperparameter \u03b3. In fact, this\ncan be split into n independent convex problems because the hyperplanes wi, i = 1, . . . , n, do not\ninteract in the objective function. We consider this simple relaxation as a baseline approach.\nRelaxation 2: Tree Loss Optimization (Joint convex problem) We propose a tighter minimization\nof the tree loss with the following:\n\nmX\n\ni=1\n\n1\nm\n\n\u03be\u03b1\ni\n\n\u03bei \u2265 0, i = 1, . . . , m.\n\ns.t. fr(xi) \u2265 fs(xi) \u2212 \u03bei, \u2200r, s : yi \u2208 \u2018r \u2227 yi /\u2208 \u2018s \u2227 (\u2203p : (p, r) \u2208 E \u2227 (p, s) \u2208 E)\n\n(2)\n(3)\nWhen \u03b1 is close to zero, the shared slack variables simply count a single error if any of the pre-\ndictions at any depth of the tree are incorrect, so this is very close to the true optimization of the\ntree loss. This is measured by checking, out of all the nodes that share the same parent, if the one\ncontaining the true label in its label set is highest ranked. In practice we set \u03b1 = 1 and arrive at a\nconvex optimization problem. Nevertheless, unlike relaxation (1) the max is not approximated with\na sum. Again, using the hinge loss and a 2-norm regularizer, we arrive at our \ufb01nal optimization\nproblem:\n\n\u03bei\n\n(4)\n\nnX\n\n\u03b3\n\n||wj||2 +\n\n1\nm\n\nmX\n\ni=1\n\nsubject to constraints (2) and (3).\n\nj=1\n\n2.2 Learning Label Tree Structures\n\nThe previous section shows how to optimize the label predictors F while the nodes N, edges E and\nlabel sets L which specify the structure of the tree are \ufb01xed in advance. However, we want to be\nable to learn speci\ufb01c tree structures dependent on our prediction problem such that we minimize the\n\n3\n\n\fAlgorithm 2 Learning the Label Tree Structure\n\nTrain k One-vs-Rest classi\ufb01ers \u00aff1, . . . , \u00affk independently (no tree structure is used).\nCompute the confusion matrix \u00afCij = |{(x, yi) \u2208 V : argmaxr\n\u00affr(x) = j}| on validation set V.\nFor each internal node l of the tree, from root to leaf, partition its label set \u2018l between its chil-\ndren\u2019s label sets Ll = {\u2018c : c \u2208 Nl}, where Nl = {c \u2208 N : (l, c) \u2208 E} and \u222ac\u2208Nl \u2018c = \u2018l, by\nmaximizing:\n\nApq, where A =\n\n1\n2\n\n( \u00afC + \u00afC>) is the symmetrized confusion matrix,\n\nRl(Ll) = X\n\nX\n\nc\u2208Nl\n\nyp,yq\u2208\u2018c\n\nsubject to constraints preventing trivial solutions, e.g. putting all labels in one set (see [4]).\nThis optimization problem (including the appropriate constraints) is a graph cut problem and it\ncan be solved with standard spectral clustering, i.e. we use A as the af\ufb01nity matrix for step 1 of\nthe algorithm given in [21], and then apply all of its other steps (2-6).\nLearn the parameters f of the tree by minimizing (4) subject to constraints (2) and (3).\n\noverall tree loss. This section describes an algorithm for learning the parameters N, E and L, i.e.\noptimizing equation (1) with respect to these parameters.\nThe key to the generalization ability of a particular choice of tree structure is the learnability of the\nlabel sets \u2018. If some classes are often confused but are in different label sets the functions f may not\nbe easily learnable, and the overall tree loss will hence be poor. For example for an image labeling\ntask, a decision in the tree between two label sets, one containing tiger and jaguar labels versus one\ncontaining frog and toad labels is presumably more learnable than (tiger, frog) vs. (jaguar, toad).\nIn the following, we consider a learning strategy for disjoint label trees (the methods in the previous\nsection were for both joint and disjoint trees). We begin by noticing that Remp can be rewritten as:\n\nmX\n\ni=1\n\n\uf8eb\uf8edI(yi \u2208 \u2018j)X\n\n\u00afy /\u2208\u2018j\n\n\uf8f6\uf8f8\n\nC(xi, \u00afy)\n\nRemp(ftree) =\n\n1\nm\n\nmax\n\nj\n\nwhere C(xi, \u00afy) = I(ftree(xi) = \u00afy) is the confusion of labeling example xi (with true label yi) with\nlabel \u00afy instead. That is, the tree loss for a given example is 1 if there is a node j in the tree containing\nyi, but we predict a different node at the same depth leading to a prediction not in the label set of j.\nIntuitively, the confusion of predicting node i instead of j comes about because of the class confusion\nbetween the labels y \u2208 \u2018i and the labels \u00afy \u2208 \u2018j. Hence, to provide the smallest tree loss we\nwant to group together labels into the same label set that are likely to be confused at test time.\nUnfortunately we do not know the confusion matrix of a particular tree without training it \ufb01rst, but\nas a proxy we can use the class confusion matrix of a surrogate classi\ufb01er with the supposition that\nthe matrices will be highly correlated. This motivates the proposed Algorithm 2. The main idea is\nto recursively partition the labels into label sets between which there is little confusion (measuring\nconfusion using One-vs-Rest as a surrogate classi\ufb01er) solving at each step a graph cut problem\nwhere standard spectral clustering is applied [20, 21]. The objective function of spectral clustering\npenalizes unbalanced partitions, hence encouraging balanced trees. (To obtain logarithmic speed-\nups the tree has to be balanced; one could also enforce this constraint directly in the k-means step.)\nThe results in Section 5 show that our learnt trees outperform random structures and in fact match\nthe accuracy of not using a tree at all, while being orders of magnitude faster.\n\n3 Label Embeddings\n\nAn orthogonal angle of attack of the solution of large multi-class problems is to employ shared\nrepresentations for the labelings, which we term label embeddings. Introducing the function \u03c6(y) =\n(0, . . . , 0, 1, 0, . . . , 0) which is a k-dimensional vector with a 1 in the yth position and 0 otherwise,\nwe would like to \ufb01nd a linear embedding E(y) = V \u03c6(y) where V is a de \u00d7 k matrix assuming that\nlabels y \u2208 {1, . . . , k}. Without a tree structure, multi-class classi\ufb01cation is then achieved with:\n\nfembed(x) = argmaxi=1,...,k S (W x, V \u03c6(i))\n\n(5)\n\n4\n\n\fwhere W is a de \u00d7 d matrix of parameters and S(\u00b7,\u00b7) is a measure of similarity, e.g. an inner\nproduct or negative Euclidean distance. This method, unlike label trees, is unfortunately still linear\nwith respect to k. However, it does have better behavior with respect to the feature dimension d,\nwith O(de(d + k)) testing time, compared to methods such as One-vs-Rest which is O(kd). If the\nembedding dimension de is much smaller than d this gives a signi\ufb01cant saving.\nThere are several ways we could train such models. For example, the method of compressed sensing\n[17] has a similar form to (5), but the matrix V is not learnt but chosen randomly, and only W\nis learnt.\nIn the next section we will show how we can train such models so that the matrix V\ncaptures the semantic similarity between classes, which can improve generalization performance\nover random choices of V in an analogous way to the improvement of label trees over random\ntrees. Subsequently, we will show how to combine label embeddings with label trees to gain the\nadvantages of both approaches.\n\n3.1 Learning Label Embeddings (Without a Tree)\n\nWe consider two possibilities for learning V and W .\nSequence of Convex Problems Firstly, we consider learning the label embedding by solving a se-\nquence of convex problems using the following method. First, train independent (convex) classi\ufb01ers\nfi(x) for each class 1, . . . , k and compute the k\u00d7k confusion matrix \u00afC over the data (xi, yi), i.e. the\nsame as the \ufb01rst two steps of Algorithm 2. Then, \ufb01nd the label embedding vectors Vi that minimize:\n\n( \u00afC + \u00afC>) is the symmetrized confusion matrix,\n\nkX\n\nAij||Vi \u2212 Vj||2, where A =\n\nsubject to the constraint V >DV = I where Dii =P\n\ni,j=1\n\n1\n2\n\nj Aij (to prevent trivial solutions) which is the\nsame problem solved by Laplacian Eigenmaps [4]. We then obtain an embedding matrix V where\nsimilar classes i and j should have small distance between their vectors Vi and Vj. All that remains is\nto learn the parameters W of our model. To do this, we can then train a convex multi-class classi\ufb01er\nutilizing the label embedding V : minimize\n\n\u03b3||W||F RO +\n\n1\nm\n\nmX\n\ni=1\n\n\u03bei\n\nwhere ||.||F RO is the Frobenius norm, subject to constraints:\n\n||W xi \u2212 V \u03c6(i)||2 \u2264 ||W xi \u2212 V \u03c6(j)||2 + \u03bei,\n\n\u2200j 6= i\n\n(6)\n\nNote that the constraint (6) is linear as we can multiply out and subtract ||W xi||2 from both sides.\nAt test time we employ equation (5) with S(z, z0) = \u2212||z \u2212 z0||.\nNon-Convex Joint Optimization The second method is to learn W and V jointly, which requires\nnon-convex optimization. In that case we wish to directly minimize:\n\n\u03bei \u2265 0,\n\ni = 1, . . . , m.\n\n\u03b3||W||F RO +\n\n1\nm\n\nmX\n\ni=1\n\n\u03bei\n\nsubject to (W xi)>V \u03c6(i) \u2265 (W xi)>V \u03c6(j) \u2212 \u03bei,\n\n\u2200j 6= i\n\nand ||Vi|| \u2264 1 , \u03bei \u2265 0, i = 1, . . . , m. We optimize this using stochastic gradient descent (with\nrandomly initialized weights) [8]. At test time we employ equation (5) with S(z, z0) = z>z0.\n\n3.2 Learning Label Embedding Trees\n\nIn this work, we also propose to combine the use of embeddings and label trees to obtain the ad-\nvantages of both approaches, which we call the label embedding tree. At test time, the resulting\nlabel embedding tree prediction is given in Algorithm 3. The label embedding tree has potentially\nO(de(d + log(k))) testing speed, depending on the structure of the tree (e.g. being balanced).\n\n5\n\n\fAlgorithm 3 Label Embedding Tree Prediction Algorithm\n\nInput: test example x, parameters T .\nCompute z = W x.\nLet s = 0.\nrepeat\nuntil |\u2018s| = 1\nReturn \u2018s.\n\nLet s = argmax{c:(s,c)\u2208E}fc(x) = argmax{c:(s,c)\u2208E}z>E(c).\n\n- Cache prediction on example\n- Start at the root node\n- Traverse to the most\ncon\ufb01dent child.\n- Until this uniquely de\ufb01nes a single label.\n\nTo learn a label embedding tree we propose the following minimization problem:\n\n\u03b3||W||F RO +\n\n1\nm\n\n\u03bei\n\nmX\n\ni=1\n\nsubject to constraints:\n(W xi)>V \u03c6(r) \u2265 (W xi)>V \u03c6(s) \u2212 \u03bei, \u2200r, s : yi \u2208 \u2018r \u2227 yi /\u2208 \u2018s \u2227 (\u2203p : (p, r) \u2208 E \u2227 (p, s) \u2208 E)\n\n||Vi|| \u2264 1, \u03bei \u2265 0, i = 1, . . . , m.\n\nThis is essentially a combination of the optimization problems de\ufb01ned in the previous two Sections.\nLearning the tree structure for these models can still be achieved using Algorithm 2.\n\n4 Related Work\n\nMulti-class classi\ufb01cation is a well studied problem. Most of the prior approaches build upon binary\nclassi\ufb01cation and have a classi\ufb01cation cost which grows at least linearly with the number of classes\nk. Common multi-class strategies include one-versus-rest, one-versus-one, label ranking and Deci-\nsion Directed Acyclic Graph (DDAG). One-versus-rest [25] trains k binary classi\ufb01ers discriminating\neach class against the rest and predicts the class whose classi\ufb01er is the most con\ufb01dent, which yields\na linear testing cost O(k). One-versus-one [16] trains a binary classi\ufb01er for each pair of classes\nand predicts the class getting the most pairwise preferences, which yields a quadratic testing cost\nO(k \u00b7 (k \u2212 1)/2). Label ranking [10] learns to assign a score to each class so that the correct class\nshould get the highest score, which yields a linear testing cost O(k). DDAG [23] considers the same\nk \u00b7 (k \u2212 1)/2 classi\ufb01ers as one-versus-one but achieves a linear testing cost O(k). All these methods\nare reported to perform similarly in terms of accuracy [25, 23].\nOnly a few prior techniques achieve sub-linear testing cost. One way is to simply remove labels the\nclassi\ufb01er performs poorly on [11]. Error correcting code approaches [13] on the other hand represent\neach class with a binary code and learn a binary classi\ufb01er to predict each bit. This means that\nthe testing cost could potentially be O(log k). However, in practice, these approaches need larger\nredundant codes to reach competitive performance levels [19]. Decision trees, such as C4.5 [24], can\nalso yield a tree whose depth (and hence test cost) is logarithmic in k. However, testing complexity\nalso grows linearly with the number of training examples making these methods impractical for\nlarge datasets [22].\nFilter tree [7] and Conditional Probability Tree (CPT) [6] are logarithmic approaches that have been\nintroduced recently with motivations similar to ours, i.e. addressing large scale problems with a\nthousand classes or more. Filter tree considers a random binary tree in which each leaf is associated\nwith a class and each node is associated with a binary classi\ufb01er. A test example traverses the tree\nfrom the root. At each node, the node classi\ufb01er decides whether the example is directed to the\nright or to the left subtree, each of which are associated to half of the labels of the parent node.\nFinally, the label of the reached leaf is predicted. Conditional Probability Tree (CPT) relies on a\nsimilar paradigm but builds the tree during training. CPT considers an online setup in which the\nset of classes is discovered during training. Hence, CPT builds the tree greedily: when a new class\nis encountered, it is added by splitting an existing leaf. In our case, we consider that the set of\nclasses are available prior to training and propose to tessellate the class label sets such that the node\nclassi\ufb01ers are likely to achieve high generalization performance. This contribution is shown to have\na signi\ufb01cant advantage in practice, see Section 5.\n\n6\n\n\fFinally, we should mention that a related active area of research involves partitioning the feature\nspace rather than the label space, e.g. using hierarchical experts [18], hashing [27] and kd-trees [5].\nLabel embedding is another key aspect of our work when it comes to ef\ufb01ciently handling thousands\nof classes. Recently, [26] proposed to exploit class taxonomies via embeddings by learning to project\ninput vectors and classes into a common space such that the classes close in the taxonomy should\nhave similar representations while, at the same time, examples should be projected close to their\nclass representation. In our case, we do not rely on a pre-existing taxonomy: we also would like\nto assign similar representations to similar classes but solely relying on the training data. In that\nrespect, our work is closer to work in information retrieval [3], which proposes to embed documents\n\u2013 not classes \u2013 for the task of document ranking. Compressed sensing based approaches [17] do\npropose to embed class labels, but rely on a random projection for embedding the vector representing\nclass memberships, with the added advantages of handling problems for which multiple classes are\nactive for a given example. However, relying on a random projection does not allow for the class\nembedding to capture the relation between classes. In our experiments, this aspect is shown to be a\ndrawback, see Section 5. Finally, the authors of [2] do propose an embedding approach over class\nlabels, but it is not clear to us if their approach is scalable to our setting.\n\n5 Experimental Study\n\nWe consider three datasets: one publicly available image annotation dataset and two proprietary\ndatasets based on images and textual descriptions of products.\nImageNet Dataset ImageNet [12] is a new image dataset organized according to WordNet [14]\nwhere quality-controlled human-veri\ufb01ed images are tagged with labels. We consider the task of\nannotating images from a set of about 16 thousand labels. We split the data into 2.5M images for\ntraining, 0.8M for validation and 0.8M for testing, removing duplicates between training, validation\nand test sets by throwing away test examples which had too close a nearest neighbor training or\nvalidation example in feature space. Images in this database were represented by a large but sparse\nvector of color and texture features, known as visual terms, described in [15].\nProduct Datasets We had access to a large proprietary database of about 0.5M product descriptions.\nEach product is associated with a textual description, an image, and a label. There are \u224818 thousand\nunique labels. We consider two tasks: predicting the label given the textual description, and predict-\ning the label given the image. For the text task we extracted the most frequent set of 10 thousand\nwords (discounting stop words) to yield a textual dictionary, and represented each document by a\nvector of counts of these words in the document, normalized using tf-idf. For the image task, images\nwere represented by a dense vector of 1024 real values of texture and color features.\nTable 1 summarizes the various datasets. Next, we describe the approaches that we compared.\nFlat versus Tree Learning Approaches In Table 2 we compare label tree predictor training meth-\nods from Section 2.1: the baseline relaxation 1 (\u201cIndependent Optimization\u201d) versus our proposed\nrelaxation 2 (\u201cTree Loss Optimization\u201d), both of which learn the classi\ufb01ers for \ufb01xed trees; and we\ncompare our \u201cLearnt Label Tree\u201d structure learning algorithm from Section 2.2 to random struc-\ntures. In all cases we considered disjoint trees of depth 2 with 200 internal nodes. The results show\nthat learnt structure performs better than random structure and tree loss optimization is superior\nto independent optimization. We also compare to three other baselines: One-vs-Rest large margin\nclassi\ufb01ers trained using the passive aggressive algorithm [9], the Filter Tree [7] and the Conditional\nProbability Tree (CPT) [6]. For all algorithms, hyperparameters are chosen using the validation set.\nThe combination of Learnt Label Tree structure and Tree Loss Optimization for the label predictors\nis the only method that is comparable to or better than One-vs-Rest while being around 60\u00d7 faster\nto compute at test time.\nFor ImageNet one could wonder how well using WordNet (a graph of human annotated label sim-\nilarities) to build a tree would perform instead. We constructed a matrix C for Algorithm 2 where\nCij = 1 if there is an edge in the WordNet graph, and 0 otherwise, and used that to learn a label\ntree as before, obtaining 0.99% accuracy using \u201cIndependent Optimization\u201d. This is better than a\nrandom tree but not as good as using the confusion matrix, implying that the best tree to use is the\none adapted to the supervised task of interest.\n\n7\n\n\fTable 1: Summary Statistics of the Three Datasets Used in the Experiments.\n\nStatistics\nTask\nNumber of Training Documents\nNumber of Test Documents\nValidation Documents\nNumber of Labels\nType of Documents\nType of Features\nNumber of Features\nAverage Feature Sparsity\n\nImageNet\nimage annotation\n2518604\n839310\n837612\n15952\nimages\nvisual terms\n10000\n97.5%\n\nProduct Descriptions\nproduct categorization\n417484\n60278\n105572\n18489\ntexts\nwords\n10000\n99.6%\n\nProduct Images\nimage annotation\n417484\n60278\n105572\n18489\nimages\ndense image features\n1024\n0.0%\n\nTable 2: Flat versus Tree Learning Results Test set accuracies for various tree and non-tree meth-\nods on three datasets. Speed-ups compared to One-vs-Rest are given in brackets.\nProduct Desc.\nClassi\ufb01er\n37.0% [1\u00d7]\nOne-vs-Rest\n14.4% [1285\u00d7]\nFilter Tree\n26.3% [45\u00d7]\nConditional Prob. Tree (CPT) CPT\n21.3% [59\u00d7]\nIndependent Optimization\n27.1% [59\u00d7]\nIndependent Optimization\n39.6% [59\u00d7]\nTree Loss Optimization\n\nProduct Images\n12.6% [1\u00d7]\n0.73% [1320\u00d7]\n2.20% [115\u00d7]\n1.35% [61\u00d7]\n5.95% [61\u00d7]\n10.6% [61\u00d7]\n\nImageNet\n2.27% [1\u00d7]\n0.59% [1140\u00d7]\n0.74% [41\u00d7]\n0.72% [60\u00d7]\n1.25% [60\u00d7]\n2.37% [60\u00d7]\n\nRandom Tree\nLearnt Label Tree\nLearnt Label Tree\n\nTree Type\nNone (\ufb02at)\nFilter Tree\n\nTable 3: Label Embeddings and Label Embedding Tree Results\n\nImageNet\n\nProduct Images\n\nTree Type\nClassi\ufb01er\nNone (\ufb02at)\nOne-vs-Rest\nCompressed Sensing\nNone (\ufb02at)\nSeq. Convex Embedding None (\ufb02at)\nNon-Convex Embedding None (\ufb02at)\nLabel Tree\nLabel Embedding Tree\n\nAccuracy\n2.27%\n0.6%\n2.23%\n2.40%\n2.54%\n\nSpeed Memory Accuracy\n1\u00d7\n3\u00d7\n3\u00d7\n3\u00d7\n85\u00d7\n\n1.2 GB\n18 MB\n18 MB\n18 MB\n18 MB\n\n12.6%\n2.27%\n3.9%\n14.1%\n13.3%\n\nSpeed Memory\n1\u00d7\n170 MB\n10\u00d7\n20 MB\n10\u00d7\n20 MB\n10\u00d7\n20 MB\n142\u00d7 20 MB\n\nEmbedding and Embedding Tree Approaches In Table 3 we compare several label embedding\nmethods: (i) the convex and non-convex methods from Section 5; (ii) compressed sensing; and\n(iii) the label embedding tree from Section 3.2. In all cases we \ufb01xed the embedding dimension\nde = 100. The results show that the random embeddings given by compressed sensing are inferior\nto learnt embeddings and Non-Convex Embedding is superior to Sequential Convex Embedding,\npresumably as the overall loss which is dependent on both W and V is jointly optimized. The latter\ngives results as good or superior to One-vs-Rest with modest computational gain (3\u00d7 or 10\u00d7 speed-\nup). Note, we do not detail results on the product descriptions task because no speed-up is gained\nthere from embedding as the sparsity is already so high, however the methods still gave good test\naccuracy (e.g. Non-Convex Embedding yields 38.2%, which should be compared to the methods in\nTable 2). Finally, combining embedding and label tree learning using the \u201cLabel Embedding Tree\u201d\nof Section 3.2 yields our best method on ImageNet and Product Images with a speed-up of 85\u00d7 or\n142\u00d7 respectively with accuracy as good or better than any other method tested. Moreover, memory\nusage of this method (and other embedding methods) is signi\ufb01cantly less than One-vs-Rest.\n\n6 Conclusion\n\nWe have introduced an approach for fast multi-class classi\ufb01cation by learning label embedding trees\nby (approximately) optimizing the overall tree loss. Our approach obtained orders of magnitude\nspeedup compared to One-vs-Rest while yielding as good or better accuracy, and outperformed\nother tree-based or embedding approaches. Our method makes real-time inference feasible for very\nlarge multi-class tasks such as web advertising, document categorization and image annotation.\n\nAcknowledgements\n\nWe thank Ameesh Makadia for very useful discussions.\n\n8\n\n\fReferences\n[1] E. Allwein, R. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for margin\n\nclassi\ufb01ers. Journal of Machine Learning Research (JMLR), 1:113\u2013141, 2001.\n\n[2] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classi\ufb01cation. In\n\nProceedings of the 24th international conference on Machine learning, page 24. ACM, 2007.\n\n[3] B. Bai, J. Weston, D. Grangier, R. Collobert, C. Cortes, and M. Mohri. Half transductive ranking. In\n\nArti\ufb01cial Intelligence and Statistics (AISTATS), 2010.\n\n[4] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering.\n\nAdvances in neural information processing systems, 1:585\u2013592, 2002.\n\n[5] J.L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the\n\nACM, 18(9):517, 1975.\n\n[6] A. Beygelzimer, J. Langford, Y. Lifshits, G. Sorkin, and A. Strehl. Conditional probability tree estimation\n\nanalysis and algorithm. In Conference in Uncertainty in Arti\ufb01cial Intelligence (UAI), 2009.\n\n[7] A. Beygelzimer, J. Langford, and P. Ravikumar. Error-correcting tournaments. In International Confer-\n\nence on Algorithmic Learning Theory (ALT), pages 247\u2013262, 2009.\n\n[8] L\u00b4eon Bottou. Stochastic learning. In Olivier Bousquet and Ulrike von Luxburg, editors, Advanced Lec-\ntures on Machine Learning, Lecture Notes in Arti\ufb01cial Intelligence, LNAI 3176, pages 146\u2013168. Springer\nVerlag, Berlin, 2004.\n\n[9] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms.\n\nJournal of Machine Learning Research, 7:551\u2013585, 2006.\n\n[10] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector ma-\n\nchines. Journal of Machine Learning Research (JMLR), 2:265\u2013292, 2002.\n\n[11] O. Dekel and O. Shamir. Multiclass-Multilabel Learning when the Label Set Grows with the Number of\n\nExamples. In Arti\ufb01cial Intelligence and Statistics (AISTATS), 2010.\n\n[12] J. Deng, W. Dong, R. Socher, Li-Jia Li, K. Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image\n\ndatabase. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 248\u2013255, 2009.\n\n[13] T. Dietterich and G. Bakiri. On the algorithmic implementation of multiclass kernel-based vector ma-\n\nchines. Journal of Arti\ufb01cial Intelligence Research (JAIR), 2:263\u2013286, 1995.\n\n[14] C. Fellbaum, editor. WordNet: An Electronic Lexical Database. MIT Press, 1998.\n[15] David Grangier and Samy Bengio. A discriminative kernel-based model to rank images from text queries.\n\nTransactions on Pattern Analysis and Machine Intelligence, 30(8):1371\u20131384, 2008.\n\n[16] T. Hastie and R. Tibshirani. Classication by pairwise coupling. The Annals of Statistics, 26(2):451\u2013471,\n\n2001.\n\n[17] D. Hsu, S. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing. In Neural\n\nInformation Processing Systems (NIPS), 2009.\n\n[18] M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural computation,\n\n6(2):181\u2013214, 1994.\n\n[19] J. Langford and A. Beygelzimer. Sensitive error correcting output codes. In Conference on Learning\n\nTheory (COLT), pages 158\u2013172, 2005.\n\n[20] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):416, 2007.\n[21] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Advances in\n\nneural information processing systems, 2:849\u2013856, 2002.\n\n[22] T. Oates and D. Jensen. The effects of training set size on decision tree complexity. In International\n\nConference on Machine Learning (ICML), pages 254\u2013262, 1997.\n\n[23] J. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin dags for multiclass classi\ufb01cation. In NIPS,\n\npages 547\u2013553, 2000.\n\n[24] J. Quinlan. C4.5 : programs for machine learning. Morgan Kaufmann, 1993.\n[25] R. Rifkin and A. Klautau. In defense of one-vs-all classi\ufb01cation. Journal of Machine Learning Research\n\n(JMLR), 5:101\u2013141, 2004.\n\n[26] K. Weinberger and O. Chapelle. Large margin taxonomy embedding for document categorization. In\n\nNIPS, pages 1737\u20131744, 2009.\n\n[27] P.N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In\nProceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, page 321. Society for\nIndustrial and Applied Mathematics, 1993.\n\n9\n\n\f", "award": [], "sourceid": 269, "authors": [{"given_name": "Samy", "family_name": "Bengio", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "David", "family_name": "Grangier", "institution": null}]}