{"title": "Logarithmic Time Online Multiclass prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 55, "page_last": 63, "abstract": "We study the problem of multiclass classification with an extremely large number of classes (k), with the goal of obtaining train and test time complexity logarithmic in the number of classes. We develop top-down tree construction approaches for constructing logarithmic depth trees. On the theoretical front, we formulate a new objective function, which is optimized at each node of the tree and creates dynamic partitions of the data which are both pure (in terms of class labels) and balanced. We demonstrate that under favorable conditions, we can construct logarithmic depth trees that have leaves with low label entropy. However, the objective function at the nodes is challenging to optimize computationally. We address the empirical problem with a new online decision tree construction procedure. Experiments demonstrate that this online algorithm quickly achieves improvement in test error compared to more common logarithmic training time approaches, which makes it a plausible method in computationally constrained large-k applications.", "full_text": "Logarithmic Time Online Multiclass prediction\n\nCourant Institute of Mathematical Sciences\n\nAnna Choromanska\nNew York, NY, USA\n\nachoroma@cims.nyu.edu\n\nJohn Langford\n\nMicrosoft Research\nNew York, NY, USA\n\njcl@microsoft.com\n\nAbstract\n\nWe study the problem of multiclass classi\ufb01cation with an extremely large number\nof classes (k), with the goal of obtaining train and test time complexity logarith-\nmic in the number of classes. We develop top-down tree construction approaches\nfor constructing logarithmic depth trees. On the theoretical front, we formulate a\nnew objective function, which is optimized at each node of the tree and creates\ndynamic partitions of the data which are both pure (in terms of class labels) and\nbalanced. We demonstrate that under favorable conditions, we can construct loga-\nrithmic depth trees that have leaves with low label entropy. However, the objective\nfunction at the nodes is challenging to optimize computationally. We address the\nempirical problem with a new online decision tree construction procedure. Exper-\niments demonstrate that this online algorithm quickly achieves improvement in\ntest error compared to more common logarithmic training time approaches, which\nmakes it a plausible method in computationally constrained large-k applications.\n\n1\n\nIntroduction\n\nThe central problem of this paper is computational complexity in a setting where the number of\nclasses k for multiclass prediction is very large. Such problems occur in natural language (Which\ntranslation is best?), search (What result is best?), and detection (Who is that?) tasks. Almost all\nmachine learning algorithms (with the exception of decision trees) have running times for multiclass\nclassi\ufb01cation which are O(k) with a canonical example being one-against-all classi\ufb01ers [1].\nIn this setting, the most ef\ufb01cient possible accurate approach is given by information theory [2].\nIn essence, any multiclass classi\ufb01cation algorithm must uniquely specify the bits of all labels that\nit predicts correctly on. Consequently, Kraft\u2019s inequality ([2] equation 5.6) implies that the ex-\npected computational complexity of predicting correctly is \u2326(H(Y )) per example where H(Y ) is\nthe Shannon entropy of the label. For the worst case distribution on k classes, this implies \u2326(log(k))\ncomputation is required.\nHence, our goal is achieving O(log(k)) computational time per example1 for both training and\ntesting, while effectively using online learning algorithms to minimize passes over the data.\nThe goal of logarithmic (in k) complexity naturally motivates approaches that construct a logarith-\nmic depth hierarchy over the labels, with one label per leaf. While this hierarchy is sometimes\navailable through prior knowledge, in many scenarios it needs to be learned as well. This naturally\nleads to a partition problem which arises at each node in the hierarchy. The partition problem is\n\ufb01nding a classi\ufb01er: c : X ! {1, 1} which divides examples into two subsets with a purer set of\nlabels than the original set. De\ufb01nitions of purity vary, but canonical examples are the number of\nlabels remaining in each subset, or softer notions such as the average Shannon entropy of the class\nlabels. Despite resulting in a classi\ufb01er, this problem is fundamentally different from standard binary\nclassi\ufb01cation. To see this, note that replacing c(x) with c(x) is very bad for binary classi\ufb01cation,\nbut has no impact on the quality of a partition2. The partition problem is fundamentally non-convex\n\n1Throughout the paper by logarithmic time we mean logarithmic time per example.\n2The problem bears parallels to clustering in this regard.\n\n1\n\n\ffor symmetric classes since the average c(x)c(x)\nof c(x) and c(x) is a poor partition (the always-0\n2\nfunction places all points on the same side).\nThe choice of partition matters in problem dependent ways. For example, consider examples on a\nline with label i at position i and threshold classi\ufb01ers. In this case, trying to partition class labels\n{1, 3} from class label 2 results in poor performance.\nThe partition problem is typically solved for decision tree learning via an enumerate-and-test ap-\nproach amongst a small set of possible classi\ufb01ers (see e.g. [3]).\nIn the multiclass setting, it is\ndesirable to achieve substantial error reduction for each node in the tree which motivates us-\ning a richer set of classi\ufb01ers in the nodes to minimize the number of nodes, and thereby de-\ncrease the computational complexity. The main theoretical contribution of this work is to es-\ntablish a boosting algorithm for learning trees with O(k) nodes and O(log k) depth, thereby ad-\ndressing the goal of logarithmic time train and test complexity. Our main theoretical result,\npresented in Section 2.3, generalizes a binary boosting-by-decision-tree theorem [4] to multi-\nclass boosting. As in all boosting results, performance is critically dependent on the quality\nof the weak learner, supporting intuition that we need suf\ufb01ciently rich partitioners at nodes.\nThe approach uses a new objective for decision tree learning, which we optimize at each\nnode of the tree. The objective and its theoretical properties are presented in Section 2.\n\n1\n\n \n\n105\n\n1000\n\n0\n\n \n\n26\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nc\na\nr\nu\nc\nc\na\n\n21841 105033\n\nOAA\nLOMtree\n\nnumber of classes\n\nLOMtree vs one\u2212against\u2212all\n\nA complete system with multiple partitions\ncould be constructed top down (as the boost-\ning theorem) or bottom up (as Filter tree [5]).\nA bottom up partition process appears impossi-\nble with representational constraints as shown\nin Section 6 in the Supplementary material so\nwe focus on top-down tree creation.\nWhenever there are representational constraints\non partitions (such as linear classi\ufb01ers), \ufb01nd-\ning a strong partition function requires an ef-\n\ufb01cient search over this set of classi\ufb01ers. Ef-\n\ufb01cient searches over large function classes are\nroutinely performed via gradient descent tech-\nniques for supervised learning, so they seem\nlike a natural candidate. In existing literature,\nexamples for doing this exist when the problem\nFigure 1:\nA comparison of One-Against-\nis indeed binary, or when there is a prespeci-\nAll (OAA) and the Logarithmic Online Multi-\n\ufb01ed hierarchy over the labels and we just need\nclass Tree (LOMtree) with One-Against-All con-\nto \ufb01nd partitioners aligned with that hierarchy.\nstrained to use the same training time as the\nNeither of these cases applies\u2014we have multi-\nLOMtree by dataset truncation and LOMtree con-\nple labels and want to dynamically create the\nstrained to use the same representation complex-\nchoice of partition, rather than assuming that\nity as One-Against-All. As the number of class\none was handed to us. Does there exist a pu-\nlabels grows, the problem becomes harder and the\nrity criterion amenable to a gradient descent ap-\nLOMtree becomes more dominant.\nproach? The precise objective studied in theory\nfails this test due to its discrete nature, and even natural approximations are challenging to tractably\noptimize under computational constraints. As a result, we use the theoretical objective as a moti-\nvation and construct a new Logarithmic Online Multiclass Tree (LOMtree) algorithm for empirical\nevaluation.\nCreating a tree in an online fashion creates a new class of problems. What if some node is initially\ncreated but eventually proves useless because no examples go to it? At best this results in a wasteful\nsolution, while in practice it starves other parts of the tree which need representational complexity.\nTo deal with this, we design an ef\ufb01cient process for recycling orphan nodes into locations where\nthey are needed, and prove that the number of times a node is recycled is at most logarithmic in the\nnumber of examples. The algorithm is described in Section 3 and analyzed in Section 3.1.\nAnd is it effective? Given the inherent non-convexity of the partition problem this is unavoidably\nan empirical question which we answer on a range of datasets varying from 26 to 105K classes in\nSection 4. We \ufb01nd that under constrained training times, this approach is quite effective compared\nto all baselines while dominating other O(log k) train time approaches.\nWhat\u2019s new? To the best of our knowledge, the splitting criterion, the boosting statement, the\nLOMtree algorithm, the swapping guarantee, and the experimental results are all new here.\n\n2\n\n\f1.1 Prior Work\nOnly a few authors address logarithmic time training. The Filter tree [5] addresses consistent (and\nrobust) multiclass classi\ufb01cation, showing that it is possible in the statistical limit. The Filter tree\ndoes not address the partition problem as we do here which as shown in our experimental section is\noften helpful. The partition \ufb01nding problem is addressed in the conditional probability tree [6], but\nthat paper addresses conditional probability estimation. Conditional probability estimation can be\nconverted into multiclass prediction [7], but doing so is not a logarithmic time operation.\nQuite a few authors have addressed logarithmic testing time while allowing training time to be O(k)\nor worse. While these approaches are intractable on our larger scale problems, we describe them\nhere for context. The partition problem can be addressed by recursively applying spectral clustering\non a confusion graph [8] (other clustering approaches include [9]). Empirically, this approach has\nbeen found to sometimes lead to badly imbalanced splits [10]. In the context of ranking, another\napproach uses k-means hierarchical clustering to recover the label sets for a given partition [11].\nThe more recent work [12] on the multiclass classi\ufb01cation problem addresses it via sparse output\ncoding by tuning high-cardinality multiclass categorization into a bit-by-bit decoding problem. The\nauthors decouple the learning processes of coding matrix and bit predictors and use probabilistic\ndecoding to decode the optimal class label. The authors however specify a class similarity which is\nO(k2) to compute (see Section 2.1.1 in [12]), and hence this approach is in a different complexity\nclass than ours (this is also born out experimentally). The variant of the popular error correcting\noutput code scheme for solving multi-label prediction problems with large output spaces under the\nassumption of output sparsity was also considered in [13]. Their approach in general requires O(k)\nrunning time to decode since, in essence, the \ufb01t of each label to the predictions must be checked\nand there are O(k) labels. Another approach [14] proposes iterative least-squares-style algorithms\nfor multi-class (and multi-label) prediction with relatively large number of examples and data di-\nmensions, and the work of [15] focusing in particular on the cost-sensitive multiclass classi\ufb01cation.\nBoth approaches however have O(k) training time.\nDecision trees are naturally structured to allow logarithmic time prediction. Traditional decision\ntrees often have dif\ufb01culties with a large number of classes because their splitting criteria are not\nwell-suited to the large class setting. However, newer approaches [16, 17] have addressed this ef-\nfectively at signi\ufb01cant scales in the context of multilabel classi\ufb01cation (multilabel learning, with\nmissing labels, is also addressed in [18]). More speci\ufb01cally, the \ufb01rst work [16] performs brute force\noptimization of a multilabel variant of the Gini index de\ufb01ned over the set of positive labels in the\nnode and assumes label independence during random forest construction. Their method makes fast\npredictions, however has high training costs [17]. The second work [17] optimizes a rank sensitive\nloss function (Discounted Cumulative Gain). Additionally, a well-known problem with hierarchical\nclassi\ufb01cation is that the performance signi\ufb01cantly deteriorates lower in the hierarchy [19] which\nsome authors solve by biasing the training distribution to reduce error propagation while simultane-\nously combining bottom-up and top-down approaches during training [20].\nThe reduction approach we use for optimizing partitions implicitly optimizes a differential objective.\nA non-reductive approach to this has been tried previously [21] on other objectives yielding good\nresults in a different context.\n\n2 Framework and theoretical analysis\n\nIn this section we describe the essential elements of the approach, and outline the theoretical prop-\nerties of the resulting framework. We begin with high-level ideas.\n\n2.1 Setting\nWe employ a hierarchical approach for learning a multiclass decision tree structure, training this\nstructure in a top-down fashion. We assume that we receive examples x 2X\u2713 Rd, with labels\ny 2{ 1, 2, . . . , k}. We also assume access to a hypothesis class H where each h 2H is a binary\nclassi\ufb01er, h : X 7! {1, 1}. The overall objective is to learn a tree of depth O(log k), where\neach node in the tree consists of a classi\ufb01er from H. The classi\ufb01ers are trained in such a way that\nhn(x) = 1 (hn denotes the classi\ufb01er in node n of the tree3) means that the example x is sent to the\nright subtree of node n, while hn(x) = 1 sends x to the left subtree. When we reach a leaf, we\npredict according to the label with the highest frequency amongst the examples reaching that leaf.\n3Further in the paper we skip index n whenever it is clear from the context that we consider a \ufb01xed tree\n\nnode.\n\n3\n\n\fIn the interest of computational complexity, we want to encourage the number of examples going\nto the left and right to be fairly balanced. For good statistical accuracy, we want to send examples\nof class i almost exclusively to either the left or the right subtree, thereby re\ufb01ning the purity of the\nclass distributions at subsequent levels in the tree. The purity of a tree node is therefore a measure\nof whether the examples of each class reaching the node are then mostly sent to its one child node\n(pure split) or otherwise to both children (impure split). The formal de\ufb01nitions of balancedness and\npurity are introduced in Section 2.2. An objective expressing both criteria4 and resulting theoretical\nproperties are illustrated in the following sections. A key consideration in picking this objective is\nthat we want to effectively optimize it over hypotheses h 2H , while streaming over examples in\nan online fashion5. This seems unsuitable with some of the more standard decision tree objectives\nsuch as Shannon or Gini entropy, which leads us to design a new objective. At the same time, we\nshow in Section 2.3 that under suitable assumptions, optimizing the objective also leads to effective\nreduction of the average Shannon entropy over the entire tree.\n\n2.2 An objective and analysis of resulting partitions\nWe now de\ufb01ne a criterion to measure the quality of a hypothesis h 2H in creating partitions at a\n\ufb01xed node n in the tree. Let \u21e1i denotes the proportion of label i amongst the examples reaching this\nnode. Let P (h(x) > 0) and P (h(x) > 0|i) denote the fraction of examples reaching n for which\nh(x) > 0, marginally and conditional on class i respectively. Then we de\ufb01ne the objective6:\n\nkXi=1\n\nJ(h) = 2\n\n\u21e1i |P (h(x) > 0)  P (h(x) > 0|i)| .\n\n(1)\n\nWe aim to maximize the objective J(h) to obtain high quality partitions. Intuitively, the objective\nencourages the fraction of examples going to the right from class i to be substantially different from\nthe background fraction for each class i. As a concrete simple scenario, if P (h(x) > 0) = 0.5 for\nsome hypothesis h, then the objective prefers P (h(x) > 0|i) to be as close to 0 or 1 as possible for\neach class i, leading to pure partitions. We now make these intuitions more formal.\nDe\ufb01nition 1 (Purity). The hypothesis h 2H induces a pure split if\n\n\u21b5 :=\n\n\u21e1i min(P (h(x) > 0|i), P (h(x) < 0|i)) \uf8ff ,\n\nkXi=1\n\n|\n\n=\n\n{z\n\n\uf8ff 1  c,\n\nwhere  2 [0, 0.5), and \u21b5 is called the purity factor.\nIn particular, a partition is called maximally pure if \u21b5 = 0, meaning that each class is sent exclusively\nto the left or the right. We now de\ufb01ne a similar de\ufb01nition for the balancedness of a split.\nDe\ufb01nition 2 (Balancedness). The hypothesis h 2H induces a balanced split if\n\nc \uf8ff P (h(x) > 0)\n}\nwhere c 2 (0, 0.5], and  is called the balancing factor.\nA partition is called maximally balanced if  = 0.5, meaning that an equal number of examples\nare sent to the left and right children of the partition. The balancing factor and the purity factor\nare related as shown in Lemma 1 (the proofs of Lemma 1 and the following lemma (Lemma 2) are\ndeferred to the Supplementary material).\nLemma 1. For any hypothesis h, and any distribution over examples (x, y), the purity factor \u21b5 and\nthe balancing factor  satisfy \u21b5 \uf8ff min{(2  J(h))/(4)  , 0.5}.\nA partition is called maximally pure and balanced if it satis\ufb01es both \u21b5 = 0 and  = 0.5. We see\nthat J(h) = 1 for a hypothesis h inducing a maximally pure and balanced partition as captured in\nthe next lemma. Of course we do not expect to have hypotheses producing maximally pure and\nbalanced splits in practice.\nLemma 2. For any hypothesis h : X 7! {1, 1}, the objective J(h) satis\ufb01es J(h) 2 [0, 1].\nFurthermore, if h induces a maximally pure and balanced partition then J(h) = 1.\n4We want an objective to achieve its optimum for simultaneously pure and balanced split. The standard\nentropy-based criteria, such as Shannon or Gini entropy, as well as the criterion we will propose, posed in\nEquation 1, satisfy this requirement (for the entropy-based criteria see [4], for our criterion see Lemma 2).\n\n5Our algorithm could also be implemented as batch or streaming, where in case of the latter one can for\nexample make one pass through the data per every tree level, however for massive datasets making multiple\npasses through the data is computationally costly, further justifying the need for an online approach.\n\n6The proposed objective function exhibits some similarities with the so-called Carnap\u2019s measure [22, 23]\n\nused in probability and inductive logic.\n\n4\n\n\f2.3 Quality of the entire tree\nThe above section helps us understand the quality of an individual split produced by effectively\nmaximizing J(h). We next reason about the quality of the entire tree as we add more and more\nnodes. We measure the quality of trees using the average entropy over all the leaves in the tree, and\ntrack the decrease of this entropy as a function of the number of nodes. Our analysis extends the\ntheoretical analysis in [4], originally developed to show the boosting properties of the decision trees\nfor binary classi\ufb01cation problems, to the multiclass classi\ufb01cation setting.\nGiven a tree T , we consider the entropy function Gt as the measure of the quality of tree:\n\nGt =Xl2L\n\nwl\n\n\u21e1l,i ln\u2713 1\n\u21e1l,i\u25c6\n\nkXi=1\n\nwhere \u21e1l,i\u2019s are the probabilities that a randomly chosen data point x drawn from P, where P is\na \ufb01xed target distribution over X , has label i given that x reaches node l, L denotes the set of all\ntree leaves, t denotes the number of internal tree nodes, and wl is the weight of leaf l de\ufb01ned as the\nprobability a randomly chosen x drawn from P reaches leaf l (note thatPl2L wl = 1).\nWe next state the main theoretical result of this paper (it is captured in Theorem 1). We adopt\nthe weak learning framework. The weak hypothesis assumption, captured in De\ufb01nition 3, posits that\neach node of the tree T has a hypothesis h in its hypothesis class H which guarantees simultaneously\na \u201dweak\u201d purity and a \u201dweak\u201d balancedness of the split on any distribution P over X . Under this\nassumption, one can use the new decision tree approach to drive the error below any threshold.\nDe\ufb01nition 3 (Weak Hypothesis Assumption). Let m denote any node of the tree T , and let m =\nP (hm(x) > 0) and Pm,i = P (hm(x) > 0|i). Furthermore, let  2 R+ be such that for all m,\n 2 (0, min(m, 1  m)]. We say that the weak hypothesis assumption is satis\ufb01ed when for any\ndistribution P over X at each node m of the tree T there exists a hypothesis hm 2H such that\nJ(hm)/2 =Pk\nTheorem 1. Under the Weak Hypothesis Assumption, for any \u21b5 2 [0, 1], to obtain Gt \uf8ff \u21b5 it suf\ufb01ces\nto make t  (1/\u21b5)\nWe defer the proof of Theorem 1 to the Supplementary material and provide its sketch now. The\nanalysis studies a tree construction algorithm where we recursively \ufb01nd the leaf node with the highest\nweight, and choose to split it into two children. Let n be the heaviest leaf at time t. Consider splitting\nit to two children. The contribution of node n to the tree entropy changes after it splits. This change\n(entropy reduction) corresponds to a gap in the Jensen\u2019s inequality applied to the concave function,\nand thus can further be lower-bounded (we use the fact that Shannon entropy is strongly concave\nwith respect to `1-norm (see e.g., Example 2.5 in Shalev-Shwartz [24])). The obtained lower-bound\nturns out to depend proportionally on J(hn)2. This implies that the larger the objective J(hn)\nis at time t, the larger the entropy reduction ends up being, which further reinforces intuitions to\nmaximize J. In general, it might not be possible to \ufb01nd any hypothesis with a large enough objective\nJ(hn) to guarantee suf\ufb01cient progress at this point so we appeal to a weak learning assumption. This\nassumption can be used to further lower-bound the entropy reduction and prove Theorem 1.\n\ni=1 \u21e1m,i|Pm,i  m| .\n\n4(1)2 ln k\n\n2\n\nsplits.\n\n3 The LOMtree Algorithm\n\nThe objective function of Section 2 has another convenient form which yields a simple online algo-\nrithm for tree construction and training. Note that Equation 1 can be written (details are shown in\nSection 12 in the Supplementary material) as\n\nMaximizing this objective is a discrete optimization problem that can be relaxed as follows\n\nJ(h) = 2Ei[|Ex[1(h(x) > 0)]  Ex[1(h(x) > 0|i)]|].\n\nJ(h) = 2Ei[|Ex[h(x)]  Ex[h(x)|i]|],\n\nwhere Ex[h(x)|i] is the expected score of class i.\nWe next explain our empirical approach for maximizing the relaxed objective. The empirical esti-\nmates of the expectations can be easily stored and updated online in every tree node. The decision\nwhether to send an example reaching a node to its left or right child node is based on the sign of the\ndifference between the two expectations: Ex[h(x)] and Ex[h(x)|y], where y is a label of the data\npoint, i.e. when Ex[h(x)]Ex[h(x)|y] > 0 the data point is sent to the left, else it is sent to the right.\nThis procedure is conveniently demonstrated on a toy example in Section 13 in the Supplement.\nDuring training, the algorithm assigns a unique label to each node of the tree which is currently a\nleaf. This is the label with the highest frequency amongst the examples reaching that leaf. While\n\n5\n\n\fregression algorithm R, max number of tree non-leaf nodes T , swap resistance RS\n\nAlgorithm 1 LOMtree algorithm (online tree training)\nInput:\nSubroutine SetNode (v)\nmv = ; (mv(y) - sum of the scores for class y)\nlv = ; (lv(y) - number of points of class y reaching v)\nnv = ; (nv(y) - number of points of class y which are used to train regressor in v)\nev = ; (ev(y) - expected score for class y)\nEv = 0 (expected total score)\nCv = 0 (the size of the smallest leaf7 in the subtree with root v)\nSubroutine UpdateC (v)\nWhile (v 6= r AND CPARENT(v) 6= Cv)\nSubroutine Swap (v)\nFind a leaf s for which (Cs = Cr)\nsPA=PARENT(s); sGPA= GRANDPA(s); sSIB=SIBLING(s)9\nIf (sPA = LEFT(sGPA)) LEFT(sGPA) = sSIB Else RIGHT(sGPA) = sSIB\nUpdateC (sSIB); SetNode (s); LEFT(v) = s; SetNode (sPA); RIGHT(v) = sPA\nCreate root r = 0: SetNode (r);\nFor each example (x, y) do\n\nv = PARENT(v); Cv = min(CLEFT(v), CRIGHT(v))8\n\nt = 1\n\nSet j = r\nDo\n\nIf (lj(y) = ;)\nmj(y) = 0;\nlj(y)++\nIf(j is a leaf)\n\nlj(y) = 0; nj(y) = 0; ej(y) = 0\n\nIf(lj has at least 2 non-zero entries)\n\nIf(t<T OR Cjmaxi lj(i)>RS(Cr+1))\n\nSetNode (LEFT(j)); SetNode (RIGHT(j)); t++\n\nIf (t<T )\nElse Swap(j)\nCLEFT(j)=bCj/2c; CRIGHT(j)=CjCLEFT(j); UpdateC (LEFT(j))\n\nIf(j is not a leaf)\n\nIf (Ej > ej(y)) c =1 Else c = 1\nTrain hj with example (x, c): R(x, c)\nnj(y) ++; mj(y) += hj(x); ej(y) = mj(y)/nj(y); Ej = Pk\nPk\nSet j to the child of j corresponding to hj\nCj++\nbreak\n\nElse\n\ni=1 mj (i)\ni=1 nj (i)\n\n10\n\ntesting, a test example is pushed down the tree along the path from the root to the leaf, where in each\nnon-leaf node of the path its regressor directs the example either to the left or right child node. The\ntest example is then labeled with the label assigned to the leaf that this example descended to.\nThe training algorithm is detailed in Algorithm 1 where each tree node contains a classi\ufb01er (we use\nlinear classi\ufb01ers), i.e. hj is the regressor stored in node j and hj(x) is the value of the prediction\nof hj on example x11. The stopping criterion for expanding the tree is when the number of non-leaf\nnodes reaches a threshold T .\n\n3.1 Swapping\nConsider a scenario where the current training example descends to leaf j. The leaf can split (create\ntwo children) if the examples that reached it in the past were coming from at least two different\n\n7The smallest leaf is the one with the smallest total number of data points reaching it in the past.\n8PARENT(v), LEFT(v) and RIGHT(v) denote resp. the parent, and the left and right child of node v.\n9GRANDPA(v) and SIBLING(v) denote respectively the grandparent of node v and the sibling of node v, i.e.\n10In the implementation both sums are stored as variables thus updating Ev takes O(1) computations.\n11We also refer to this prediction value as the \u2019score\u2019 in this section.\n\nthe node which has the same parent as v.\n\n6\n\n\fr\n\n. . .\n\nj\n\n. . .\n\n. . .\n\n. . .\n\n. . .\n\nsGPA\n\nsPA\n\ns\n\nsSIB\n\n. . .\n\n. . .\n\nr\n\n. . .\n\n. . .\n\nj\n\n. . .\n\n. . .\n\nsGPA\n\ns\n\nsPA\n\n. . .\n\nsSIB\n\n. . .\n\n. . .\n\nFigure 2: Illustration of the swapping procedure. Left: before the swap, right: after the swap.\n\nclasses. However, if the number of non-leaf nodes of the tree reaches threshold T , no more nodes\ncan be expanded and thus j cannot create children. Since the tree construction is done online, some\nnodes created at early stages of training may end up useless because no examples reach them later\non. This prevents potentially useful splits such as at leaf j. This problem can be solved by recycling\norphan nodes (subroutine Swap in Algorithm 1). The general idea behind node recycling is to allow\nnodes to split if a certain condition is met. In particular, node j splits if the following holds:\n\nCj  max\n\ni2{1,2,...,k}\n\nlj(i) > RS(Cr + 1),\n\n(2)\n\nwhere r denotes the root of the entire tree, Cj is the size of the smallest leaf in the subtree with root\nj, where the smallest leaf is the one with the smallest total number of data points reaching it in the\npast, lj is a k-dimensional vector of non-negative integers where the ith element is the count of the\nnumber of data points with label i reaching leaf j in the past, and \ufb01nally RS is a \u201cswap resistance\u201d.\nThe subtraction of maxi2{1,2,...,k} lj(i) in Equation 2 ensures that a pure node will not be recycled.\nIf the condition in Inequality 2 is satis\ufb01ed, the swap of the nodes is performed where an orphan\nleaf s, which was reached by the smallest number of examples in the past, and its parent sPA are\ndetached from the tree and become children of node j whereas the old sibling sSIB of an orphan node\ns becomes a direct child of the old grandparent sGPA. The swapping procedure is shown in Figure 2.\nThe condition captured in the Inequality 2 allows us to prove that the number of times any given\nnode is recycled is upper-bounded by the logarithm of the number of examples whenever the swap\nresistance is 4 or more (Lemma 3).\nLemma 3. Let the swap resistance RS be greater or equal to 4. Then for all sequences of examples,\nthe number of times Algorithm 1 recycles any given node is upper-bounded by the logarithm (with\nbase 2) of the sequence length.\n\n4 Experiments\n\nWe address several hypotheses experimentally.\n\n1. The LOMtree algorithm achieves true logarithmic time computation in practice.\n2. The LOMtree algorithm is competitive with or better than all other logarithmic train/test\n\ntime algorithms for multiclass classi\ufb01cation.\n\n26\n\n105\n\nTable 1: Dataset sizes.\nIsolet Sector Aloi\nImNet ODP\n52.3MB19MB17.7MB104GB12 3GB\n0.5M\n\n54K 128\n\n6144\n\n617\n\nsize\n\nproaches.\n\n3. The LOMtree algorithm has statistical performance close to more common O(k) ap-\nTo address these hypotheses, we con-\nducted experiments on a variety of\nbenchmark multiclass datasets: Iso-\nlet, Sector, Aloi,\n(Im-\nNet) and ODP13. The details of the\ndatasets are provided in Table 1. The\ndatasets were divided into training\n(90%) and testing (10%). Further-\nmore, 10% of the training dataset was\n\n# features\n# examples 7797 9619 108K 14.2M 1577418\n# classes\n1000 \u21e022K \u21e0105K\n\nImageNet\n\nused as a validation set.\nThe baselines we compared LOMtree with are a balanced random tree of logarithmic depth (Rtree)\nand the Filter tree [5]. Where computationally feasible, we also compared with a one-against-all\nclassi\ufb01er (OAA) as a representative O(k) approach. All methods were implemented in the Vowpal\nWabbit [25] learning system and have similar levels of optimization. The regressors in the tree nodes\nfor LOMtree, Rtree, and Filter tree as well as the OAA regressors were trained by online gradient\ndescent for which we explored step sizes chosen from the set {0.25, 0.5, 0.75, 1, 2, 4, 8}. We used\n12compressed\n13The details of the source of each dataset are provided in the Supplementary material.\n\n7\n\n\flinear regressors. For each method we investigated training with up to 20 passes through the data and\nwe selected the best setting of the parameters (step size and number of passes) as the one minimizing\nthe validation error. Additionally, for the LOMtree we investigated different settings of the stopping\ncriterion for the tree expansion: T = {k  1, 2k  1, 4k  1, 8k  1, 16k  1, 32k  1, 64k  1},\nand swap resistance RS = {4, 8, 16, 32, 64, 128, 256}.\nIn Table 2 and 3 we report respectively train time and per-example test time (the best performer is\nindicated in bold). Training time (and later reported test error) is not provided for OAA on ImageNet\nand ODP due to intractability14-both are petabyte scale computations15.\nTable 2: Training time on selected problems.\n\nTable 3: Per-example test time on all problems.\n\nIsolet Sector\nLOMtree 16.27s 12.77s\n\nAloi\n51.86s\n19.58s 18.37s 11m2.43s\n\nOAA\n\nIsolet Sector Aloi\n\nImNet ODP\nLOMtree 0.14ms 0.13ms 0.06ms 0.52ms 0.26ms\n1.05s\n\nOAA 0.16 ms 0.24ms 0.33ms 0.21s\n\nLOMtree vs one\u2212against\u2212all\n\nThe \ufb01rst hypothesis is consistent with the experimental results. Time-wise LOMtree signi\ufb01cantly\noutperforms OAA due to building only close-to logarithmic depth trees. The improvement in the\ntraining time increases with the number of classes in the classi\ufb01cation problem. For instance on Aloi\ntraining with LOMtree is 12.8 times faster than with OAA. The same can be said about the test time,\nwhere the per-example test time for Aloi, ImageNet and ODP are respectively 5.5, 403.8 and 4038.5\ntimes faster than OAA. The signi\ufb01cant advantage of LOMtree over OAA is also captured in Figure 3.\nNext, in Table 4 (the best logarithmic time per-\nformer is indicated in bold) we report test error\nof logarithmic train/test time algorithms. We\nalso show the binomial symmetrical 95% con\ufb01-\ndence intervals for our results. Clearly the sec-\nond hypothesis is also consistent with the ex-\nperimental results. Since the Rtree imposes a\nrandom label partition, the resulting error it ob-\ntains is generally worse than the error obtained\nby the competitor methods including LOMtree\nwhich learns the label partitioning directly from\nthe data. At the same time LOMtree beats Fil-\nter tree on every dataset, though for ImageNet\nand ODP (both have a high level of noise) the\nadvantage of LOMtree is not as signi\ufb01cant.\n\nFigure 3: Logarithm of the ratio of per-example\ntest times of OAA and LOMtree on all problems.\n\n14\nlog2(number of classes)\n\n12\n10\n8\n6\n4\n2\n\n)\no\ni\nt\na\nr\n \n\ng\no\n\n10\n\n12\n\ne\nm\n\ni\nt\n(\n\n2\n\nl\n\n6\n\n8\n\n16\n\nTable 4: Test error (%) and con\ufb01dence interval on all problems.\n\nRtree\n\nFilter tree\nLOMtree\n6.36\u00b11.71\n16.92\u00b12.63 15.10\u00b12.51\nIsolet\nSector 16.19\u00b12.33 15.77\u00b12.30 17.70\u00b12.41\n16.50\u00b10.70 83.74\u00b10.70 80.50\u00b10.75\nAloi\nImNet 90.17\u00b10.05 96.99\u00b10.03 92.12\u00b10.04\n93.46\u00b10.12 93.85\u00b10.12 93.76\u00b10.12\nODP\n\nOAA\n\n3.56\u00b11.30%\n9.17\u00b11.82%\n13.78\u00b10.65%\n\nNA\nNA\n\nThe third hypothesis is weakly consistent with the empirical results. The time advantage of LOMtree\ncomes with some loss of statistical accuracy with respect to OAA where OAA is tractable. We\nconclude that LOMtree signi\ufb01cantly closes the gap between other logarithmic time methods and\nOAA, making it a plausible approach in computationally constrained large-k applications.\n5 Conclusion\n\nThe LOMtree algorithm reduces the multiclass problem to a set of binary problems organized in a\ntree structure where the partition in every tree node is done by optimizing a new partition criterion\nonline. The criterion guarantees pure and balanced splits leading to logarithmic training and testing\ntime for the tree classi\ufb01er. We provide theoretical justi\ufb01cation for our approach via a boosting\nstatement and empirically evaluate it on multiple multiclass datasets. Empirically, we \ufb01nd that this\nis the best available logarithmic time approach for multiclass classi\ufb01cation problems.\n\n14Note however that the mechanics of testing datastes are much easier - one can simply test with effectively\nuntrained parameters on a few examples to measure the test speed thus the per-example test time for OAA on\nImageNet and ODP is provided.\n\n15Also to the best of our knowledge there exist no state-of-the-art results of the OAA performance on these\n\ndatasets published in the literature.\n\n8\n\n\fAcknowledgments\nWe would like to thank Alekh Agarwal, Dean Foster, Robert Schapire and Matus Telgarsky for\nvaluable discussions.\n\nReferences\n[1] R. Rifkin and A. Klautau. In defense of one-vs-all classi\ufb01cation. J. Mach. Learn. Res., 5:101\u2013141, 2004.\n[2] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., 1991.\n[3] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classi\ufb01cation and Regression Trees. CRC\n\nPress LLC, Boca Raton, Florida, 1984.\n\n[4] M. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. Journal\n\nof Computer and Systems Sciences, 58(1):109\u2013128, 1999 (also In STOC, 1996).\n\n[5] A. Beygelzimer, J. Langford, and P. D. Ravikumar. Error-correcting tournaments. In ALT, 2009.\n[6] A. Beygelzimer, J. Langford, Y. Lifshits, G. B. Sorkin, and A. L. Strehl. Conditional probability tree\n\nestimation analysis and algorithms. In UAI, 2009.\n\n[7] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[8] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In NIPS, 2010.\n[9] G. Madzarov, D. Gjorgjevikj, and I. Chorbev. A multi-class svm classi\ufb01er utilizing binary decision tree.\n\nInformatica, 33(2):225\u2013233, 2009.\n\n[10] J. Deng, S. Satheesh, A. C. Berg, and L. Fei-Fei. Fast and balanced: Ef\ufb01cient label tree learning for large\n\nscale object recognition. In NIPS, 2011.\n\n[11] J. Weston, A. Makadia, and H. Yee. Label partitioning for sublinear ranking. In ICML, 2013.\n[12] B. Zhao and E. P. Xing. Sparse output coding for large-scale visual recognition. In CVPR, 2013.\n[13] D. Hsu, S. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing. In NIPS,\n\n2009.\n\n[14] A. Agarwal, S. M. Kakade, N. Karampatziakis, L. Song, and G. Valiant. Least squares revisited: Scalable\n\napproaches for multi-class prediction. In ICML, 2014.\n\n[15] O. Beijbom, M. Saberian, D. Kriegman, and N. Vasconcelos. Guess-averse loss functions for cost-\n\nsensitive multiclass boosting. In ICML, 2014.\n\n[16] R. Agarwal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels: Recom-\n\nmending advertiser bid phrases for web pages. In WWW, 2013.\n\n[17] Y. Prabhu and M. Varma. Fastxml: A fast, accurate and stable tree-classi\ufb01er for extreme multi-label\n\nlearning. In ACM SIGKDD, 2014.\n\n[18] H.-F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale multi-label learning with missing labels. In ICML,\n\n2014.\n\n[19] T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma. Support vector machines classi\ufb01cation\n\nwith a very large-scale taxonomy. In SIGKDD Explorations, 2005.\n\n[20] P. N. Bennett and N. Nguyen. Re\ufb01ned experts: improving classi\ufb01cation in large taxonomies. In SIGIR,\n\n2009.\n\n[21] A. Montillo, J. Tu, J. Shotton, J. Winn, J.E. Iglesias, D.N. Metaxas, and A. Criminisi. Entanglement and\ndifferentiable information gain maximization. Decision Forests for Computer Vision and Medical Image\nAnalysis, 2013.\n\n[22] K. Tentori, V. Crupi, N. Bonini, and D. Osherson. Comparison of con\ufb01rmation measures. Cognition,\n\n103(1):107 \u2013 119, 2007.\n\n[23] R. Carnap. Logical Foundations of Probability. 2nd ed. Chicago: University of Chicago Press. Par. 87\n\n(pp. 468-478), 1962.\n\n[24] S. Shalev-Shwartz. Online learning and online convex optimization. Found. Trends Mach. Learn.,\n\n4(2):107\u2013194, 2012.\n\n[25] J. Langford, L. Li, and A. Strehl. http://hunch.net/\u02dcvw, 2007.\n[26] Y. Nesterov. Introductory lectures on convex optimization : a basic course. Applied optimization, Kluwer\n\nAcademic Publ., 2004.\n\n[27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009.\n\n9\n\n\f", "award": [], "sourceid": 34, "authors": [{"given_name": "Anna", "family_name": "Choromanska", "institution": "Courant Institute, NYU"}, {"given_name": "John", "family_name": "Langford", "institution": "Microsoft Research New York"}]}