{"title": "Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 567, "page_last": 575, "abstract": "We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al.", "full_text": "Fast and Balanced: Ef\ufb01cient Label Tree Learning for\n\nLarge Scale Object Recognition\n\nJia Deng1,2, Sanjeev Satheesh1, Alexander C. Berg3, Li Fei-Fei1\n\nComputer Science Department, Stanford University1\nComputer Science Department, Princeton University2\n\nComputer Science Department, Stony Brook University3\n\nAbstract\n\nWe present a novel approach to ef\ufb01ciently learn a label tree for large scale clas-\nsi\ufb01cation with many classes. The key contribution of the approach is a technique\nto simultaneously determine the structure of the tree and learn the classi\ufb01ers for\neach node in the tree. This approach also allows \ufb01ne grained control over the ef-\n\ufb01ciency vs accuracy trade-off in designing a label tree, leading to more balanced\ntrees. Experiments are performed on large scale image classi\ufb01cation with 10184\nclasses and 9 million images. We demonstrate signi\ufb01cant improvements in test\naccuracy and ef\ufb01ciency with less training time and more balanced trees compared\nto the previous state of the art by Bengio et al.\n\n1\n\nIntroduction\n\nClassi\ufb01cation problems with many classes arise in many important domains and pose signi\ufb01cant\ncomputational challenges. One prominent example is recognizing tens of thousands of visual object\ncategories, one of the grand challenges of computer vision. The large number of classes renders the\nstandard one-versus-all multiclass approach too costly, as the complexity grows linearly with the\nnumber of classes, for both training and testing, making it prohibitive for practical applications that\nrequire low latency or high throughput, e.g. those in robotics or in image retrieval.\nClassi\ufb01cation with many classes has received increasing attention recently and most approaches\nappear to have converged to tree based models [2, 3, 9, 1]. In particular, Bengio et al. [1] proposes\na label tree model, which has been shown to achieve state of the art performance in testing. In a\nlabel tree, each node is associated with a subset of class labels and a linear classi\ufb01er that determines\nwhich branch to follow. In performing the classi\ufb01cation task, a test example travels from the root\nof the tree to a leaf node associated with a single class label. Therefore for a well balanced tree,\nthe time required for evaluation is reduced from O(DK) to O(D log K), where K is the number\nof classes and D is the feature dimensionality. The technique can be combined with an embedding\ntechnique, so that the evaluation cost can be further reduced to O( \u02dcD log K + D \u02dcD) where \u02dcD (cid:28) D\nis an embedded label space.\nDespite the success of label trees in addressing testing ef\ufb01ciency, the learning technique, critical to\nensuring good testing accuracy and ef\ufb01ciency, has several limitations. Learning the tree structure\n(determining how to split the classes into subsets) involves \ufb01rst training one-vs-all classi\ufb01ers for all\nK classes to obtain a confusion matrix, and then using spectral clustering to split the classes into dis-\njoint subsets. First, learning one-vs-all classi\ufb01ers is costly for large number of classes. Second, the\npartitioning of classes does not allow overlap, which can be unnecessarily dif\ufb01cult for classi\ufb01cation.\nThird, the tree structure may be unbalanced, which can result in sub-optimal test ef\ufb01ciency.\nIn this paper, we address these issues by observing that (1)determining the partition of classes and\nlearning a classi\ufb01er for each child can be performed jointly, and (2)allowing overlapping of class\n\n1\n\n\flabels among children leads to an ef\ufb01cient optimization that also enables precise control of the\naccuracy vs ef\ufb01ciency trade-off, which can in turn guarantee balanced trees. This leads to a novel\nlabel tree learning technique that is more ef\ufb01cient and effective. Speci\ufb01cally, we eliminate the one-\nvs-all training step while improving both ef\ufb01ciency and accuracy in testing.\n\n2 Related Work\n\nOur approach is directly motivated by the label tree embedding technique proposed by Bengio et\nal. in [1], which is among the few approaches that address sublinear testing cost for multi-class\nclassi\ufb01cation problems with a large number of classes and has been shown to outperform alternative\napproaches including Filter Tree [2] and Conditional Probability Tree(CPT) [3]. Our contribution is\na new technique to achieve more ef\ufb01cient and effective learning for label trees. For a comprehensive\ndiscussion on multi-class classi\ufb01cation techniques, we refer the reader to [1].\nClassifying a large number of object classes has received increasing attention in computer vision as\ndatasets with many classes such as ImageNet [7] become available. One line of work is concerned\nwith developing effective feature representations [13, 16, 15, 10] and achieving state of the art per-\nformances. Another direction of work, explores methods for exploiting the structure between object\nclasses. In particular, it has been observed that object classes can be organized in a tree-like structure\nboth semantically and visually [9, 11, 6], making tree based approaches especially attractive. Our\nwork follows this direction, focusing on effective learning methods for building tree models.\nOur framework of explicitly controlling accuracy or ef\ufb01ciency is connected to Weiss et al.\u2019s\nwork [14] on building a cascade of graphical models with increasing complexity for structured\nprediction. Our work differs in that we reduce the label space instead of the model space.\n\n3 Label Tree and Label Tree Learning by Bengio et al.\n\nHere we brie\ufb02y review the label tree learning technique proposed by Bengio et al. and then discuss\nthe limitations we attempt to address.\nA label tree is a tree T = (V, E) with nodes V and edges E. Each node r \u2208 V is associated with\na set of class labels \u03ba(r) \u2286 {1, . . . , K} . Let \u03c3(r) \u2282 V be the its set of children. For each child c,\nthere is a linear classi\ufb01er wc \u2208 RD and we require that its label set is a subset of its parent\u2019s, that is,\n\u03ba(c) \u2286 \u03ba(r),\u2200c \u2208 \u03c3(r).\nTo make a prediction given an input x \u2208 RD, we use Algorithm 1. We travel from the root until we\nreach a leaf node, at each node following the child that has the largest classi\ufb01er score. There is a\nslight difference than the algorithm in [1] in that the leaf node is not required to have only one class\nlabel. If there is more than one label, an arbitrary label from the set is predicted.\n\nAlgorithm 1 Predict the class of x given the root node r\n\ns \u2190 r.\nwhile \u03c3(s) (cid:54)= \u2205 do\n\ns \u2190 arg maxc\u2208\u03c3(s) wT\nc x\n\nend while\nreturn an arbitrary k \u2208 \u03ba(s) or NULL if \u03ba(s) = \u2205.\n\nLearning the tree structure is a fundamentally hard problem because brute force search for the op-\ntimal combination of tree structure and classi\ufb01er weights is intractable. Bengio et al. [1] instead\npropose to solve two subproblems: learning the tree structure and learning the classi\ufb01er weights.\nTo learn the tree structure, K one versus all classi\ufb01ers are trained \ufb01rst to obtain a confusion matrix\nC \u2208 RK\u00d7K on a validation set. The class labels are then clustered into disjoint sets by spectral clus-\ntering with the confusion between classes as af\ufb01nity measure. This procedure is applied recursively\nto build a complete tree. Given the tree structure, all classi\ufb01er weights are then learned jointly to\noptimize the misclassi\ufb01cation loss of the tree.\nWe \ufb01rst analyze the cost of learning by showing that training, with m examples, K classes and D di-\nmensional feature, costs O(mDK). Assume optimistically that the optimization algorithm converges\n\n2\n\n\fafter only one pass of the data and that we use \ufb01rst order methods that cost O(D) at each itera-\ntion, with feature dimensionality D. Therefore learning one versus all classi\ufb01ers costs O(mDK).\nSpectral clustering only depends on K and does not depend on D or m, and therefore its cost is neg-\nligible. In learning the classi\ufb01er weights on the tree, each training example is affected by only the\nclassi\ufb01ers on its path, i.e. O(Q log K) classi\ufb01ers, where Q (cid:28) K is the number of children for each\nnode. Hence the training cost is O(mDQ log K). This analysis indicates that learning K one versus\nall classi\ufb01ers dominates the cost. This is undesirable in large scale learning because with bounded\ntime, accommodating a large number of classes entails using less expressive and lower dimensional\nfeatures.\nMoreover, spectral clustering only produces disjoint subsets. It can be dif\ufb01cult to learn a classi\ufb01er\nfor disjoint subsets when examples of certain classes cannot be reliably classi\ufb01ed to one subset. If\nsuch mistakes are made at higher level of the tree, then it is impossible to recover later. Allowing\noverlap potentially yields more \ufb02exibility and avoids such errors. In addition, spectral clustering\ndoes not guarantee balanced clusters and thus cannot ensure a desired speedup. We seek a novel\nlearning technique that overcomes these limitations.\n\n4 New Label Tree Learning\n\nTo address the limitations, we start by considering simple and less expensive alternatives of gener-\nating the splits. For example, we can sub-sample the examples for one-vs-all training, or generate\nthe splits randomly, or use a human constructed semantic hierarchy(e.g. WordNet [8]). However, as\nshown in [1], improperly partitioning the classes can greatly reduce testing accuracy and ef\ufb01ciency.\nTo preserve accuracy, it is important to split the classes such that they can be easily separated. To\ngain ef\ufb01ciency, it is important to have balanced splits.\nWe therefore propose a new technique that jointly learns the splits and classi\ufb01er weights. By tightly\ncoupling the two, this approach eliminates the need of one-vs-all training and brings the total learn-\ning cost down to O(mDQ log K). By allowing overlapping splits and explicitly modeling the accu-\nracy and ef\ufb01ciency trade-off, this approach also improves testing accuracy and ef\ufb01ciency.\nOur approach processes one node of the tree a time, starting with the root node. It partitions the\nclasses into a \ufb01xed number of child nodes and learns the classi\ufb01er weights for each of the children.\nIt then recursively repeats for each child.\nIn learning a tree model, accuracy and ef\ufb01ciency are inherently con\ufb02icting goals and some trade-off\nmust be made. Therefore we pose the optimization problem as maximizing ef\ufb01ciency given a con-\nstraint on accuracy, i.e. requiring that the error rate cannot exceed a certain threshold. Alternatively\none can also optimize accuracy given ef\ufb01ciency constraints. We will \ufb01rst describe the accuracy con-\nstrained optimization and then brie\ufb02y discuss the ef\ufb01ciency constrained variant. In practice, one can\nchoose between the two formulations depending on convenience.\nFor the rest of this section, we \ufb01rst express all the desiderata in one single optimization prob-\nlem(Sec. 4.1), including de\ufb01ning the optimization variables(classi\ufb01er weights and partitions), objec-\ntives(ef\ufb01ciency) and constraints(accuracy). Then in Sec. 4.2& 4.3 we show how to solve the main\noptimization by alternating between learning the classi\ufb01er weights and determining the partitions.\nWe then summarize the complete algorithm(Sec. 4.4) and conclude with an alternative formulation\nusing ef\ufb01ciency constraints(Sec. 4.5).\n\n4.1 Main optimization\nFormally, let the current node r represent classes labels \u03ba(r) = {1, . . . , K} and let Q be the\nspeci\ufb01ed number of children we wish to follow. The goal is to determine: (1)a partition matrix\nP \u2208 {0, 1}Q\u00d7K that represents the assignment of classes to the children, i.e. Pqk = 1 if class label\nk appear in child q and Pqk = 0 otherwise; (2)the classi\ufb01er weights w \u2208 RD\u00d7Q, where a column\nwq is the classi\ufb01er weights for child q \u2208 \u03c3(r),\nWe measure accuracy by examining whether an example is classi\ufb01ed to the correct child, i.e. a child\nthat includes its true class label. Let x \u2208 RD be a training example and y \u2208 {1, . . . , K} be its true\nlabel. Let \u02c6q = arg maxq\u2208\u03c3(r) wT\nq x be the child that x follows. Given w, P, x, y, the classi\ufb01cation\n\n3\n\n\floss at the current node r is then\n\nL(w, x, y, P ) = 1 \u2212 P (\u02c6q, y).\n\n(1)\nNote that the \ufb01nal prediction of the example is made at a leaf node further down the tree, if the\nchild to follow is not already a leaf node. Therefore L is a lower bound of the actual loss. It is thus\nimportant to achieve a smaller L because it could be a bottleneck of the \ufb01nal accuracy.\nWe measure ef\ufb01ciency by how fast the set of possible class labels shrinks. Ef\ufb01ciency is maximized\nwhen each child has a minimal number of class labels so that an unambiguous prediction can be\nmade, otherwise we incur further cost for traveling down the tree. Given a test example, we de\ufb01ne\nambiguity as our ef\ufb01ciency measure, i.e. the size of label set of the child that the example follows,\nrelative to its parent\u2019s size. Speci\ufb01cally, given w and P , the ambiguity for an example x is\n\nk=1\n\nP (\u02c6q, k).\n\nA(w, x, P ) =\n\n(2)\nNote that A \u2208 [0, 1]. A perfectly balanced K-nary tree would result in an ambiguity of 1/K for all\nexamples at each node.\nOne important note is that the classi\ufb01cation loss(accuracy) and ambiguity(ef\ufb01ciency) measures as\nde\ufb01ned in Eqn. 1 and Eqn. 2 are local to the current node being considered in greedily building the\ntree. They serve as proxies to the global accuracy and ef\ufb01ciency of the entire tree. For the rest of\nthis paper, we will omit the \u201clocal\u201d and \u201cglobal\u201d quali\ufb01cations if it is clear according to the context.\nLet \u0001 > 0 be the maximum classi\ufb01cation loss we are willing to tolerate. Given a training set\n(xi, yi), i = 1, . . . , m, we seek to minimize the average ambiguity of all examples while keeping\nthe classi\ufb01cation loss below \u0001, which leads to the following optimization problem:\n\nK(cid:88)\n\n1\nK\n\nOP1. Optimizing ef\ufb01ciency with accuracy constraints.\n\nm(cid:88)\nm(cid:88)\n\ni=1\n\nminimize\n\nw,P\n\n1\nm\n\nA(w, xi, P )\n\nsubject to\n\nL(w, xi, yi, P ) \u2264 \u0001\n\n1\nm\nP \u2208 {0, 1}Q\u00d7K.\n\ni=1\n\nThere are no further constraints on P other than that its entries are integers 0 and 1. We do not\nrequire that the children cover all the classes in the parent. It is legal that one class in the parent\ncan be assigned to none of the children, in which case we give up on the training examples from the\nclass. In doing so, we pay a price on accuracy, i.e. those examples will have a misclassi\ufb01cation loss\nof 1. Therefore a partition P with all zeros is unlikely to be a good solution. We also allow overlap\nof label sets between children. If we cannot classify the examples from a class perfectly into one\nof the children, we allow them to go to more than one child. We pay a price on ef\ufb01ciency since we\nmake less progress in eliminating possible class labels. This is different from the disjoint label sets\nin [1]. Overlapping label sets gives more \ufb02exibility and in fact leads to simpler optimization, as will\nbecome clear in Sec. 4.3.\nDirectly solving OP1 is intractable. However, with proper relaxation, we can alternate between\noptimizing over w and over P where each is a convex program.\n\n4.2 Learning classi\ufb01er weights w given partitions P\n\nObserve that \ufb01xing P and optimizing over w is similar to learning a multi-class classi\ufb01er except for\nthe overlapping classes. We relax the loss L by a convex loss \u02dcL similar to the hinge loss.\n\n\u02dcL(w, xi, yi, P ) = max{0, 1 + max\n\n{wT\n\nr xi \u2212 wT\n\nq xi)}}\n\nq\u2208Ai,r\u2208Bi\n\nwhere Ai = {q|Pq,yi = 1} and Bi = {r|Pr,yi = 0}. Here Ai is the set of children that contain\nclass yi and Bi is the rest of the children. The responses of the classi\ufb01ers in Ai are encouraged to\nbe bigger than those in Bi, otherwise the loss \u02dcL increases. It is easily veri\ufb01able that \u02dcL upperbounds\nL. We then obtain the following convex optimization problem.\n\n4\n\n\fOP2. Optimizing over w given P .\n\nminimize\n\nw\n\n\u03bb\n\nQ(cid:88)\n\nq=1\n\n(cid:107)wq(cid:107)2\n\n2 +\n\n1\nm\n\nm(cid:88)\n\ni=1\n\n\u02dcL(w, xi, yi, P )\n\nNote that here the objective is no longer the ambiguity A. This is because the in\ufb02uence of w on A\nis typically very small. When the partition P is \ufb01xed, w can lower A by classifying examples into\nthe child with the smallest label set. However, the way w classi\ufb01es examples is mostly constrained\nby the accuracy cap \u0001, especially for small \u0001. Empirically we also found that in optimizing \u02dcL over\nw, A remains almost constant. Therefore for simplicity we assume that A is constant w.r.t w and\nthe optimization becomes minimizing classi\ufb01cation loss to move w to the feasible region. We also\n\nadded a regularization term(cid:80)Q\n\nq=1 (cid:107)wq(cid:107)2\n2.\n\n4.3 Determining partitions P given classi\ufb01er weights w\n\nIf we \ufb01x w and optimize over P , rearranging terms gives the following integer program.\n\nOP3. Optimizing over P .\n\nP\n\nA(P ) =\n\nminimize\n\nsubject to 1 \u2212(cid:88)\n\nm(cid:88)\n\ni=1\n\n1( \u02c6qi = q)\n\n1( \u02c6qi = q \u2227 yi = k) \u2264 \u0001\n\n(cid:88)\n\nq,k\n\nPqk\n\n1\nm\n\nPqk\n\n1\n\nmK\n\nm(cid:88)\n\ni=1\n\nq,k\n\nPqk \u2208 {0, 1},\u2200q, k.\n\nInteger programming in general is NP-hard. However, for this integer program, we can solve it\nby relaxing it to a linear program and then taking the ceiling of the solution. We show that this\nsolution is in fact near optimal by showing that the number of non-integers can be very few, due\nto the fact that the LP has few constraints other than that the variables lie in [0, 1] and most of the\n[0, 1] constraints will be active. Speci\ufb01cally we use Lemma 4.1(proof in supplementary materials)\nto bound the rounded LP solution in Theorem 4.2.\nLemma 4.1. For LP problem\n\ncT x\n\nminimize\nsubject to Ax \u2264 b\n\nx\n\n0 \u2264 x \u2264 1,\n\nwhere A \u2208 Rm\u00d7n, m < n, if it is feasible, then there exists an optimal solution with at most m\nnon-integer entries and such a solution can be found in polynomial time.\nTheorem 4.2. Let A\u2217 be an optimal value of OP3. A solution P (cid:48) can be computed within polynomial\ntime such that A(P (cid:48)) \u2264 A\u2217 + 1\nK .\nProof. We relax OP3 to an LP by replacing the constraint Pqk \u2208 {0, 1},\u2200q, k with Pqk \u2208\n[0, 1],\u2200q, k. Apply Lemma 4.1 and we obtain an optimal solution P (cid:48)(cid:48) of the LP with at most 1\nnon-integer. We take the ceiling of the fraction and obtain an integer solution P (cid:48) to OP3. The value\nof the LP, a lower bound of A\u2217, increases by at most 1\n\n(cid:80)m\ni=1 1( \u02c6qi = q) \u2264 1\n\nK , since 1\n\nK ,\u2200q.\n\nmK\n\nNote that the ambiguity is a quantity in [0, 1] and K is the number of classes. Therefore for large\nnumbers of classes the rounded solution is almost optimal.\n\n4.4 Summary of algorithm\n\nNow all ingredients are in place for an iterative algorithm to build the tree, except that we need to\ninitialize the partition P or the weights w. We \ufb01nd that a random initialization of P works well in\npractice. Speci\ufb01cally, for each child, we randomly pick one class, without replacement, from the\n\n5\n\n\flabel set of the parent. That is, for each row of P , randomly pick a column and set the column to 1.\nThis is analogous to picking the cluster seeds in the K-means algorithm.\nWe summarize the algorithm for building one level of tree nodes in Algorithm 2. The procedure is\napplied recursively from the root. Note that each training example only affects classi\ufb01ers on one\npath of the tree, hence the training cost is O(mD log K) for a balanced tree.\n\nAlgorithm 2 Grow a single node r\n\nInput: Q,\u0001 and training examples classi\ufb01ed into node r by its ancestors.\nInitialize P . For each child, randomly pick one class label from the parent, without replacement.\nfor t = 1 \u2192 T do\n\nFix P , solve OP2 and update w.\nFix w, solve OP3 and update P .\n\nend for\n\n4.5 Ef\ufb01ciency constrained formulations\n\nAs mentioned earlier, we can also optimize accuracy given explicit ef\ufb01ciency constraints. Let \u03b4 be\nthe maximum ambiguity we can tolerate. Let OP1\u2019, OP2\u2019, OP3\u2019 be the counterparts of OP1, OP2\nand OP3. We obtain OP1\u2019 by replacing \u0001 with \u03b4 and switching L(w, xi, yi, P ) and A(w, xi, p) in\nOP1. OP2\u2019 is the same as OP2 because we also treat A as constant and minimize the classi\ufb01cation\nloss L unconstrained. OP3\u2019 can also be formulated in a straightforward manner, and solved nearly\noptimally by rounding from LP(Theorem 4.3).\nTheorem 4.3. Let L\u2217 be the optimal value of OP3\u2019. A solution P (cid:48) can be computed within polyno-\nmial time such that L(P (cid:48)) \u2264 L\u2217 + maxk \u03c8k, where \u03c8k = 1\ni=1 1(yi = k), is the percentage of\ntraining examples from class k.\n\n(cid:80)m\n\nm\n\n(cid:80)\nProof. We relax OP3\u2019 to an LP. Apply Lemma 4.1 and obtain an optimal solution P (cid:48)(cid:48) with at most\n1 non-integer. We take the \ufb02oor of P (cid:48)(cid:48) and obtain a feasible solution P (cid:48) to OP 3(cid:48). The value of\ni 1( \u02c6qi = q \u2227 yi = k) \u2264\nthe LP, a lower bound of L\u2217, increases by at most maxk \u03c8k, since 1\n\n(cid:80)m\ni=1 1(yi = k) \u2264 maxk \u03c8k,\u2200k, q.\n\nm\n\n1\nm\n\nFor uniform distribution of examples among classes, maxk \u03c8k = 1/K and the rounded solution\nis near optimal for large K. If the distribution is highly skewed, for example, a heavy tail, then\nthe rounding can give poor approximation. One simple workaround is to split the big classes into\narti\ufb01cial subclasses or treat the classes in the tail as one big class, to \u201cequalize\u201d the distribution.\nThen the same learning techniques can be applied. In this paper we focus on the near uniform case\nand leave further discussion on the skewed case as future work.\n\n5 Experiments\n\nWe use two datasets for evaluation: ILSVRC2010 [12] and ImageNet10K [6]. In ILSVRC2010,\nthere are 1.2M images from 1k classes for training, 50k images for validation and 150k images\nfor test. For each image in ILSVRC2010 we compute the LLC [13] feature with SIFT on a 10k\ncodebook and use a two level spatial pyramid(1x1 and 2x2 grids) to obtain a 50k dimensional feature\nvector. In ImageNet10K, there are 9M images from 10184 classes. We use 50% for training, 25%\nfor validation, and the rest 25% for testing. For ImageNet10K, We compute LLC similarly except\nthat we use no spatial pyramid, obtaining a 10k dimensional feature vector.\nWe use parallel stochastic gradient descent(SGD) [17] for training. SGD is especially suited for\nlarge scale learning [4] where the learning is bounded by the time and the features can no longer \ufb01t\ninto memory (the LLC features take 80G in sparse format). Parallelization makes it possible to use\nmultiple CPUs to improves wall time.\nWe compare our algorithm with the original label tree learning method by Bengio et al. [1]. For\nboth algorithms, we \ufb01x two parameters, the number of children Q for each node, and the maximum\ndepth H of the tree. The depth of each node is de\ufb01ned as the maximum distance to the root(the root\n\n6\n\n\fT32,2\nAcc% Ctr\n259\n11.9\n8.33\n321\n\nSte\n10.3\n10.3\n\nT10,3\nAcc% Ctr\n104\n8.92\n5.99\n193\n\nSte\n18.2\n15.2\n\nT6,4\nAcc% Ctr\n50.2\n5.62\n5.88\n250\n\nSte\n31.3\n9.32\n\nT101,2\nAcc% Ctr\n3.4\n685\n1191\n2.7\n\nSte\n32.4\n32.4\n\nOurs\n[1]\n\nTable 1: Global accuracy(Acc), training cost(Ctr), and test speedup(Ste) on ILSVRC2010 1K\nclasses (T32,2, T10,3, T6,4) and on ImageNet10K(T101,2) classes. Training and test costs are mea-\nsured as the average number of vector operations performed per example. Test speedup is the one-\nvs-all test cost divided by the label tree test cost. Ours outperforms the Bengio et al. [1] approach\nby achieving comparable or better accuracy and ef\ufb01ciency, with less training cost, compared with\nthe training cost for Bengio et al. [1] with the one-vs-all training cost excluded.\n\nTree\nDepth\nClassi\ufb01cation loss(%)\n\nAmbiguity(%)\n\nOurs\n\nBengio [1]\n\nOurs\n\nBengio [1]\n\nT32,2\n\n0\n\n49.9\n76.6\n6.49\n6.49\n\n1\n\n76.1\n64.8\n1.55\n1.87\n\nT10,3\n\n1\n\n52.6\n53.7\n18.4\n25.9\n\n0\n\n34.6\n62.8\n18.9\n19.0\n\n2\n\n71.2\n65.3\n2.96\n2.95\n\n0\n\n30.0\n56.2\n24.7\n24.7\n\nT6,4\n\n1\n\n48.8\n34.8\n24.1\n59.6\n\n2\n\n55.9\n37.3\n23.5\n56.5\n\n3\n\n64.4\n65.8\n7.15\n2.02\n\nTable 2: Local classi\ufb01cation loss(Eqn. 1) and ambiguity(Eqn. 2) measured at different depth levels\nfor all trees on the ILSVRC2010 test set(1k classes). T6,4 of Bengio et al. is less balanced(large am-\nbiguity). Our trees are more balanced as ef\ufb01ciency is explicitly enforced by capping the ambiguity\nthroughout all levels.\n\nhas depth 0). We require every internal node to split into Q children, with two exceptions: nodes at\ndepth H \u2212 1(parent of leaves) and nodes with fewer than Q classes. In both cases, we split the node\nfully, i.e. grow one child node per class. We use TQ,H to denote a tree built with parameters Q and\nH. We set Q and H such that for a well balanced tree, the number of leaf nodes QH approximate\nthe number of classes K.\nWe evaluate the global classi\ufb01cation accuracy and computational cost in both training and test. The\nmain costs of learning consist of two operations, evaluating the gradient and updating the weights,\ni.e. vector dot products and vector additions(possibly with scaling). We treat both operations as\ncosting the same 1. To measure the cost, we count the number of vector operations performed per\ntraining example. For instance, running SGD one-versus-all(either independent or single machine\nSVMs [5]) for K classes costs 2K per example for going through data once, as in each iteration all\nK classi\ufb01ers are evaluated against the feature vector(dot product) and updated(addition).\nFor both algorithms, we build three trees T32,2, T10,3, T6,4 for the ILSVRC2010 1k classes and build\none tree T101,2 for ImageNet10K classes. For the Bengio et al. method, we \ufb01rst train one-versus-all\nclassi\ufb01ers with one pass of parallel SGD. This results in a cost of 2000 per example for ISVRC2010\nand 20368 for ImageNet10K. After forming the tree skeleton by spectral clustering using confusion\nmatrix from the validation set, we learn the weights by solving a joint optimization(see [1]) with two\npasses of parallel SGD. For our method, we do three iterations in Algorithm 2. In each iteration,\nwe do one pass of parallel SGD to solve OP3\u2019, such that the computation is comparable to that of\nBengio et al. (excluding the one-versus-all training). We then solve OP3\u2019 on the validation set to\nupdate the partition. To set the ef\ufb01ciency constraint, we measure the average (local) ambiguity of\nthe root node of the tree generated by the Bengio et al. approach, on the validation set. We use it as\nour ambiguity cap throughout our learning, in an attempt to produce a similarly structured tree.\nWe report the test results in Table 1. The results show that for all types of trees, our method achieves\ncomparable or signi\ufb01cantly better accuracy while achieving better speed-up with much less training\ncost, even after excluding the 1-versus-all training in Bengio et al.\u2019s. It\u2019s worth noting that for the\nBengio et al. approach, T6,4 fails to further speed-up testing compared to the other shallower trees.\nThe reason is that at depth 1(one level down from root), the splits became highly imbalanced and\ndoes not shrink the class sets faster enough until the height limit is reached. This is revealed in\nTable 2, where we measure the average local ambiguity(Eq. 2) and classi\ufb01cation loss(Eq. 1) at each\ndepth on the test set to shed more light on the structure of the trees. Observe that our trees have\n\n1This is inconsequential as a vector addition always pairs with a dot product for all training in this paper.\n\n7\n\n\fFigure 1: Comparison of partition matrices(32 \u00d7 1000) of the root node of T32,2 for our ap-\nproach(top) and the Bengio et al. approach(bottom). Each entry represents the membership of a\nclass label(column) in a child(row). The columns are ordered by a depth \ufb01rst search of WordNet.\nColumns belonging to certain WordNet subtrees are marked by red boxes.\n\nFigure 2: Paths of the tree T6,4 taken by two test examples. The class labels shown are randomly\nsubsampled to \ufb01t into the space.\n\nalmost constant average ambiguity at each level, as enforced in learning. This shows an advantage\nof our algorithm since we are able to explicitly enforce balanced tree while in Bengio et al. [1] no\nsuch control is possible, although spectral clustering encourages balanced splits.\nIn Fig. 1, we visualize the partition matrices of the root of T32,2, for both algorithms. The columns\nare ordered by a depth \ufb01rst search of the WordNet tree so that neighboring columns are likely to\nbe semantic similar classes. We observe that for both methods, there is visible alignment of the\nWordNet ordering. We further illustrate the semantic alignment by showing with the paths of our\nT6,4 traveled by two test examples. Also observe that our partition is notably \u201cnoisier\u201d, despite that\nboth partitions have the same average ambiguity. This is a result of overlapping partitions, which\nin fact improves accuracy(as shown in Table 2) because it avoids the mistakes made by forcing all\nexamples of a class commit to one child.\nAlso note that Bengio et al. showed in [1] that optimizing the classi\ufb01ers on the tree jointly is signif-\nicantly better than independently training the classi\ufb01ers for each node, as it encodes the dependency\nof the classi\ufb01ers along a tree path. This does not contradict our results. Although we have no explicit\njoint learning of classi\ufb01ers over the entire tree, we train the classi\ufb01ers of each node using examples\nalready \ufb01ltered by classi\ufb01ers of the ancestors, thus implicitly enforcing the dependency.\n\n6 Conclusion\n\nWe have presented a novel approach to ef\ufb01ciently learn a label tree for large scale classi\ufb01cation with\nmany classes, allowing \ufb01ne grained ef\ufb01ciency-accuracy tradeoff. Experimental results demonstrate\nmore ef\ufb01cient trees at better accuracy with less training cost compared to previous work.\n\nAcknowledgment\n\nL. F-F is partially supported by an NSF CAREER grant (IIS-0845230), the DARPA CSSG grant,\nand a Google research award.\n\n8\n\n\fReferences\n[1] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2010.\n\n[2] A. Beygelzimer, J. Langford, and P. Ravikumar. Multiclass classi\ufb01cation with \ufb01lter trees.\n\nPreprint, June, 2007.\n\n[3] Alina Beygelzimer, John Langford, Yuri Lifshits, Gregory B. Sorkin, and Alexander L. Strehl.\nConditional probability tree estimation analysis and algorithms. Computing Research Reposi-\ntory, 2009.\n\n[4] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. Advances in neural informa-\n\ntion processing systems, 20:161\u2013168, 2008.\n\n[5] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based\n\nvector machines. The Journal of Machine Learning Research, 2:265\u2013292, 2002.\n\n[6] J. Deng, A.C. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image\n\ncategories tell us? In ECCV10.\n\n[7] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierar-\n\nchical image database. In CVPR09, 2009.\n\n[8] C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.\n[9] Gregory Grif\ufb01n and Pietro Perona. Learning and using taxonomies for fast visual categoriza-\n\ntion. CVPR08, 2008.\n\n[10] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang. Large-scale image\nclassi\ufb01cation: Fast feature extraction and svm training. In Conference on Computer Vision and\nPattern Recognition, page (to appear), volume 1, page 3, 2011.\n\n[11] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: A large data set for non-\nparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, pages 1958\u20131970, 2008.\n\n[12] http://www.image-net.org/challenges/LSVRC/2010/.\n[13] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for\n\nimage classi\ufb01cation. 2010.\n\n[14] D. Weiss, B. Sapp, and B. Taskar. Sidestepping intractable inference with structured ensemble\n\ncascades. In NIPS, volume 1281, pages 1282\u20131284, 2010.\n\n[15] K. Yu and T. Zhang. Improved local coordinate coding using local tangents. ICML09, 2010.\n[16] X. Zhou, K. Yu, T. Zhang, and T. Huang. Image classi\ufb01cation using super-vector coding of\n\nlocal image descriptors. Computer Vision\u2013ECCV 2010, pages 141\u2013154, 2010.\n\n[17] M. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. In\nJ. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances\nin Neural Information Processing Systems 23, pages 2595\u20132603. 2010.\n\n9\n\n\f", "award": [], "sourceid": 391, "authors": [{"given_name": "Jia", "family_name": "Deng", "institution": null}, {"given_name": "Sanjeev", "family_name": "Satheesh", "institution": null}, {"given_name": "Alexander", "family_name": "Berg", "institution": null}, {"given_name": "Fei", "family_name": "Li", "institution": null}]}