{"title": "Decision Jungles: Compact and Rich Models for Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 234, "page_last": 242, "abstract": "Randomized decision trees and forests have a rich history in machine learning and have seen considerable success in application, perhaps particularly so for computer vision. However, they face a fundamental limitation: given enough data, the number of nodes in decision trees will grow exponentially with depth. For certain applications, for example on mobile or embedded processors, memory is a limited resource, and so the exponential growth of trees limits their depth, and thus their potential accuracy. This paper proposes decision jungles, revisiting the idea of ensembles of rooted decision directed acyclic graphs (DAGs), and shows these to be compact and powerful discriminative models for classification. Unlike conventional decision trees that only allow one path to every node, a DAG in a decision jungle allows multiple paths from the root to each leaf. We present and compare two new node merging algorithms that jointly optimize both the features and the structure of the DAGs efficiently. During training, node splitting and node merging are driven by the minimization of exactly the same objective function, here the weighted sum of entropies at the leaves. Results on varied datasets show that, compared to decision forests and several other baselines, decision jungles require dramatically less memory while considerably improving generalization.", "full_text": "Decision Jungles:\n\nCompact and Rich Models for Classi\ufb01cation\n\nJamie Shotton\n\nSebastian Nowozin\n\nToby Sharp\nJohn Winn\n\nMicrosoft Research\n\nPushmeet Kohli\nAntonio Criminisi\n\nAbstract\n\nRandomized decision trees and forests have a rich history in machine learning and\nhave seen considerable success in application, perhaps particularly so for com-\nputer vision. However, they face a fundamental limitation: given enough data,\nthe number of nodes in decision trees will grow exponentially with depth. For\ncertain applications, for example on mobile or embedded processors, memory is\na limited resource, and so the exponential growth of trees limits their depth, and\nthus their potential accuracy. This paper proposes decision jungles, revisiting the\nidea of ensembles of rooted decision directed acyclic graphs (DAGs), and shows\nthese to be compact and powerful discriminative models for classi\ufb01cation. Unlike\nconventional decision trees that only allow one path to every node, a DAG in a\ndecision jungle allows multiple paths from the root to each leaf. We present and\ncompare two new node merging algorithms that jointly optimize both the features\nand the structure of the DAGs ef\ufb01ciently. During training, node splitting and node\nmerging are driven by the minimization of exactly the same objective function,\nhere the weighted sum of entropies at the leaves. Results on varied datasets show\nthat, compared to decision forests and several other baselines, decision jungles\nrequire dramatically less memory while considerably improving generalization.\n\n1\n\nIntroduction\n\nDecision trees have a long history in machine learning and were one of the \ufb01rst models proposed\nfor inductive learning [14]. Their use for classi\ufb01cation and regression was popularized by the work\nof Breiman [6]. More recently, they have become popular in \ufb01elds such as computer vision and\ninformation retrieval, partly due to their ability to handle large amounts of data and make ef\ufb01cient\npredictions. This has led to successes in tasks such as human pose estimation in depth images [29].\nAlthough trees allow making predictions ef\ufb01ciently, learning the optimal decision tree is an NP-hard\nproblem [15]. In his seminal work, Quinlan proposed ef\ufb01cient approximate methods for learning\ndecision trees [27, 28]. Some researchers have argued that learning optimal decision trees could\nbe harmful as it may lead to over\ufb01tting [21]. Over\ufb01tting may be reduced by controlling the model\ncomplexity, e.g. via various stopping criteria such as limiting the tree depth, and post-hoc pruning.\nThese techniques for controlling model complexity impose implicit limits on the type of classi\ufb01-\ncation boundaries and feature partitions that can be induced by the decision tree. A number of\napproaches have been proposed in the literature to regularize tree models without limiting their\nmodelling power. The work in [7] introduced a non-greedy Bayesian sampling-based approach for\nconstructing decision trees. A prior over the space of trees and their parameters induces a posterior\ndistribution, which can be used, for example, to marginalize over all tree models. There are similari-\nties between the idea of randomly drawing multiple trees via a Bayesian procedure and construction\nof random tree ensembles (forests) using bagging, a method shown to be effective in many applica-\ntions [1, 5, 9]. Another approach to improve generalization is via large-margin tree classi\ufb01ers [4].\n\n1\n\n\fWhile the above-mentioned methods can reduce over\ufb01tting, decision trees face a fundamental limi-\ntation: their exponential growth with depth. For large datasets where deep trees have been shown to\nbe more accurate than large ensembles (e.g. [29]), this exponential growth poses a problem for im-\nplementing tree models on memory-constrained hardware such as embedded or mobile processors.\nIn this paper, we investigate the use of randomized ensembles of rooted decision directed acyclic\ngraphs (DAGs) as a means to obtain compact and yet accurate classi\ufb01ers. We call these ensembles\n\u2018decision jungles\u2019, after the popular \u2018decision forests\u2019. We formulate the task of learning each DAG\nin a jungle as an energy minimization problem. Building on the information gain measure commonly\nused for training decision trees, we propose an objective that is de\ufb01ned jointly over the features of the\nsplit nodes and the structure of the DAG. We then propose two minimization methods for learning\nthe optimal DAG. Both methods alternate between optimizing the split functions at the nodes of the\nDAG and optimizing the placement of the branches emanating from the parent nodes. As detailed\nlater, they differ in the way they optimize the placement of branches.\nWe evaluate jungles on a number of challenging labelling problems. Our experiments below quantify\na substantially reduced memory footprint for decision jungles compared to standard decision forests\nand several baselines. Furthermore, the experiments also show an important side-bene\ufb01t of jungles:\nour optimization strategy is able to achieve considerably improved generalization for only a small\nextra cost in the number of features evaluated per test example.\nBackground and Prior Work. The use of rooted decision DAGs (\u2018DAGs\u2019 for short) has been\nexplored by a number of papers in the literature.\nIn [16, 26], DAGs were used to combine the\noutputs of C \u00d7 C binary 1-v-1 SVM classi\ufb01ers into a single C-class classi\ufb01er. More recently, in [3],\nDAGs were shown to be a generalization of cascaded boosting.\nIt has also been shown that DAGs lead to accurate predictions while having lower model complex-\nity, subtree replication, and training data fragmentation compared to decision trees. Most existing\nalgorithms for learning DAGs involve training a conventional tree that is later manipulated into a\nDAG. For instance [17] merges same-level nodes which are associated with the same split function.\nThey report performance similar to that of C4.5-trained trees, but with a much reduced number of\nnodes. Oliveira [23] used local search method for constructing DAGs in which tree nodes are re-\nmoved or merged together based on similarity of the underlying sub-graphs and the corresponding\nmessage length reduction. A message-length criterion is also employed by the node merging al-\ngorithm in [24]. Chou [8] investigated a k-means clustering for learning decision trees and DAGs\n(similar \u2018ClusterSearch\u2019 below), though did not jointly optimize the features with the DAG struc-\nture. Most existing work on DAGs have focused on showing how the size and complexity of the\nlearned tree model can be reduced without substantially degrading its accuracy. However, their use\nfor increasing test accuracy has attracted comparatively little attention [10, 20, 23].\nIn this paper we show how jungles, ensembles of DAGs, optimized so as to reduce a well de\ufb01ned\nobjective function, can produce results which are superior to those of analogous decision tree en-\nsembles, both in terms of model compactness as well as generalization. Our work is related to [25],\nwhere the authors achieve compact classi\ufb01cation DAGs via post-training removal of redundant sub-\ntrees in forests. In contrast, our probabilistic node merging is applied directly and ef\ufb01ciently during\ntraining, and both saves space as well as achieves greater generalization for multi-class classi\ufb01cation.\nContributions. In summary, our contributions are: (i) we highlight that traditional decision trees\ngrow exponentially in memory with depth, and propose decision jungles as a means to avoid this;\n(ii) we propose and compare two learning algorithms that, within each level, jointly optimize an\nobjective function over both the structure of the graph and the features; (iii) we show that not only\ndo the jungles dramatically reduce memory consumption, but can also improve generalization.\n\n2 Forests and Jungles\n\nBefore delving into the details of our method for learning decision jungles, we \ufb01rst brie\ufb02y discuss\nhow decision trees and forests are used for classi\ufb01cation problems and how they relate to jungles.\nBinary decision trees. A binary decision tree is composed of a set of nodes each with an in-degree\nof 1, except the root node. The out-degree for every internal (split) node of the tree is 2 and for the\nleaf nodes is 0. Each split node contains a binary split function (\u2018feature\u2019) which decides whether an\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: Motivation and notation. (a) An example use of a rooted decision DAG for classifying\nimage patches as belonging to grass, cow or sheep classes. Using DAGs instead of trees reduces the\nnumber of nodes and can result in better generalization. For example, differently coloured patches\nof grass (yellow and green) are merged together into node 4, because of similar class statistics. This\nmay encourage generalization by representing the fact that grass may appear as a mix of yellow and\ngreen. (b) Notation for a DAG, its nodes, features and branches. See text for details.\n\ninput instance that reaches that node should progress through the left or right branch emanating from\nthe node. Prediction in binary decision trees involves every input starting at the root and moving\ndown as dictated by the split functions encountered at the split nodes. Prediction concludes when\nthe instance reaches a leaf node, each of which contains a unique prediction. For classi\ufb01cation trees,\nthis prediction is a normalized histogram over class labels.\nRooted binary decision DAGs. Rooted binary DAGs have a different architecture compared to\ndecision trees and were introduced by Platt et al. [26] as a way of combining binary classi\ufb01er for\nmulti-class classi\ufb01cation tasks. More speci\ufb01cally a rooted binary DAG has: (i) one root node, with\nin-degree 0; (ii) multiple split nodes, with in-degree \u2265 1 and out-degree 2; (iii) multiple leaf nodes,\nwith in-degree \u2265 1 and out-degree 0. Note that in contrast to [26], if we have a C-class classi\ufb01cation\nproblem, here we do not necessarily expect to have C DAG leaves. In fact, the leaf nodes are not\nnecessarily pure; And each leaf remains associated with an empirical class distribution.\nClassi\ufb01cation DAGs vs classi\ufb01cation trees. We explain the relationship between decision trees and\ndecision DAGs using the image classi\ufb01cation task illustrated in Fig. 1(a) as an example. We wish\nto classify image patches into the classes: cow, sheep or grass. A labelled set of patches is used to\ntrain a DAG. Since patches corresponding to different classes may have different average intensity,\nthe root node may decide to split them according to this feature. Similarly, the two child nodes may\ndecide to split the patches further based on their chromaticity. This results in grass patches with\ndifferent intensity and chromaticity (bright yellow and dark green) ending up in different subtrees.\nHowever, if we detect that two such nodes are associated with similar class distributions (peaked\naround grass in this case) and merge them, then we get a single node with training examples from\nboth grass types. This helps capture the degree of variability intrinsic to the training data, and reduce\nthe classi\ufb01er complexity. While this is clearly a toy example, we hope it gives some intuition as to\nwhy rooted DAGs are expected to achieve the improved generalization demonstrated in Section 4.\n\n3 Learning Decision Jungles\n\nWe train each rooted decision DAG in a jungle independently, though there is scope for merging\nacross DAGs as future work. Our method for training DAGs works by growing the DAG one level\nat a time.1 At each level, the algorithm jointly learns the features and branching structure of the\nnodes. This is done by minimizing an objective function de\ufb01ned over the predictions made by the\nchild nodes emanating from the nodes whose split features are being learned.\nConsider the set of nodes at two consecutive levels of the decision DAG (as shown in Fig. 1b). This\nset consist of the set of parent nodes Np and a set of child nodes Nc. We assume in this work a known\nvalue for M = |Nc|. M is a parameter of our method and may vary per level. Let \u03b8i denote the\nparameters of the split feature function f for parent node i \u2208 Np, and Si denote the set of labelled\ntraining instances (x, y) that reach node i. Given \u03b8i and Si, we can compute the set of instances\ni (\u03b8i) = {(x, y) \u2208 Si | f (\u03b8i, x) \u2264 0}\nfrom node i that travel through its left and right branches as SL\n\n1Jointly training all levels of the tree simultaneously remains an expensive operation [15].\n\n3\n\n2grassgrasscowsheepc s gc s gc s gc s gc s gc s gTrainingpatchesc s g01345\u2026\fi (\u03b8i) = Si \\ SL\n\ni (\u03b8i), respectively. We use li \u2208 Nc to denote the current assignment of the left\nand SR\noutwards edge from parent node i \u2208 Np to a child node, and similarly ri \u2208 Nc for the right outward\nedge. Then, the set of instances that reach any child node j \u2208 Nc is:\n\nSj({\u03b8i},{li},{ri}) =\n\n\uf8ee\uf8f0 (cid:91)\n\n\uf8f9\uf8fb \u222a\n\n\uf8ee\uf8f0 (cid:91)\n\nSL\n\ni (\u03b8i)\n\ni\u2208Np s.t. li=j\n\ni\u2208Np s.t. ri=j\n\n\uf8f9\uf8fb .\n\nSR\n\ni (\u03b8i)\n\n(1)\n\nThe objective function E associated with the current level of the DAG is a function of {Sj}j\u2208Nc.\nWe can now formulate the problem of learning the parameters of the decision DAG as a joint mini-\nmization of the objective over the split parameters {\u03b8i} and the child assignments {li},{ri}. Thus,\nthe task of learning the current level of a DAG can be written as:\n\n{\u03b8i},{li},{ri} E({\u03b8i},{li},{ri}) .\n\nmin\n\n(2)\n\nMaximizing the Information Gain. Although our method can be used for optimizing any objective\nE that decomposes over nodes, including in theory a regression-based objective, for the sake of\nsimplicity we focus in this work on the information gain objective commonly used for classi\ufb01cation\nproblems. The information gain objective requires the minimization of the total weighted entropy\nof instances, de\ufb01ned as:\n\nE({\u03b8i},{li},{ri}) =\n\n|Sj| H(Sj)\n\n(3)\n\n(cid:88)\n\nj\u2208Nc\n\nwhere Sj is de\ufb01ned in (1), and H(S) is the Shannon entropy of the class labels y in the training\ninstances (x, y) \u2208 S.\nNote that if the number of child nodes M is equal to twice the number of parent nodes i.e. M =\n2|Np|, then the DAG becomes a tree and we can optimize the parameters of the different nodes\nindependently, as done in standard decision tree training, to achieve optimal results.\n\n3.1 Optimization\n\nThe minimization problem described in (2) is hard to solve exactly. We propose two local search\nbased algorithms for its solution: LSearch and ClusterSearch. As local optimizations, neither are\nlikely to reach a global minimum, but in practice both are effective at minimizing the objective. The\nexperiments below show that the simpler LSearch appears to be more effective.\nLSearch. The LSearch method starts from a feasible assignment of the parameters, and then alter-\nnates between two coordinate descent steps. In the \ufb01rst (split-optimization) step, it sequentially goes\nover every parent node k in turn and tries to \ufb01nd the split function parameters \u03b8k that minimize the\nobjective function, keeping the values of {li},{ri} and the split parameters of all other nodes \ufb01xed:\n\nfor k \u2208 Np\n\n\u03b8k \u2190 argmin\n\nE(\u03b8(cid:48)\n\nk \u222a {\u03b8i}i\u2208Np\\{k},{li},{ri})\n\n\u03b8(cid:48)\n\nk\n\nThis minimization over \u03b8(cid:48)\nk is done by random sampling in a manner similar to decision forest train-\ning [9]. In the second (branch-optimization) step, the algorithm redirects the branches emanating\nfrom each parent node to different child nodes, so as to yield a lower objective:\n\nfor k \u2208 Np\n\nlk \u2190 argmin\nk\u2208Nc\nl(cid:48)\nrk \u2190 argmin\nk\u2208Nc\nr(cid:48)\n\nE({\u03b8i}, l(cid:48)\nE({\u03b8i},{li}, r(cid:48)\n\nk \u222a {li}i\u2208Np\\{k},{ri})\nk \u222a {ri}i\u2208Np\\{k})\n\nThe algorithm terminates when no changes are made, and is guaranteed to converge. We found that\na greedy initialization of LSearch (allocating splits to the most energetic parent nodes \ufb01rst) resulted\nin a lower objective after optimization than a random initialization. We also found that a stochastic\nversion of the above algorithm where only a single randomly chosen node was optimized at a time\nresulted in similar reductions in the objective for considerably less compute.\n\n4\n\n\fClusterSearch. The ClusterSearch algorithm also alternates between optimizing the branching vari-\nables and the split parameters, but differs in that it optimizes the branching variables more globally.\nFirst, 2|Np| temporary child nodes are built via conventional tree-based, training-objective mini-\nmization procedures. Second, the temporary nodes are clustered into M = |Nc| groups to produce a\nDAG. Node clustering is done via the Bregman information objective optimization technique in [2].\n\n4 Experiments and results\n\nThis section compares testing accuracy and computational performance of our decision jungles with\nstate-of-the-art forests of binary decision trees and their variants on several classi\ufb01cation problems.\n\n4.1 Classi\ufb01cation Tasks and Datasets\n\nc | = min(M, 2D).\n\nWe focus on semantic image segmentation (pixel-wise classi\ufb01cation) tasks, where decision forests\nhave proven very successful [9, 19, 29]. We evaluate our jungle model on the following datasets:\n(A) Kinect body part classi\ufb01cation [29] (31 classes). We train each tree or DAG in the ensemble on\na separate 1000 training images with 250 example pixels randomly sampled per image. Following\n[29], 3 trees or DAGs are used unless otherwise speci\ufb01ed. We test on (a common set of) 1000\nimages drawn randomly from the MSRC-5000 test set [29]. We use a DAG merging schedule of\n|N D\nc | = min(M, 2min(5,D) \u00b7 1.2max(0,D\u22125)), where M is a \ufb01xed constant maximum width and D is\nthe current level (depth) in the tree.\n(B) Facial features segmentation [18] (8 classes including background). We train each of 3 trees or\nDAGs in the ensemble on a separate 1000 training images using every pixel. We use a DAG merging\nschedule of |N D\n(C) Stanford background dataset [12] (8 classes). We train on all 715 labelled images, seeding\nour feature generator differently for each of 3 trees or DAGs in the ensemble. Again, we use a DAG\nmerging schedule of |N D\n(D) UCI data sets [22]. We use 28 classi\ufb01cation data sets from the UCI corpus as prepared on the\nlibsvm data set repository.2 For each data set all instances from the training, validation, and test set,\nif available, are combined to a large set of instances. We repeat the following procedure \ufb01ve times:\nrandomly permute the instances, and divide them 50/50 into training and testing set. Train on the\ntraining set, evaluate the multiclass accuracy on the test set. We use 8 trees or DAGs per ensemble.\nFurther details regarding parameter choices can be found in the supplementary material.\nFor all segmentation tasks we use the Jaccard index (intersection over union) as adopted in PASCAL\nVOC [11]. Note that this measure is stricter than e.g. the per class average metric reported in [29].\nOn the UCI dataset we report the standard classi\ufb01cation accuracy numbers. In order to keep training\ntime low, the training sets are somewhat reduced compared to the original sources, especially for\n(A). However, identical trends were observed in limited experiments with more training data.\n\nc | = min(M, 2D).\n\n4.2 Baseline Algorithms\n\nWe compare our decision jungles with several tree-based alternatives, listed below.\nStandard Forests of Trees. We have implemented standard classi\ufb01cation forests, as described in [9]\nand building upon their publically available implementation.\nBaseline 1: Fixed-Width Trees (A). As a \ufb01rst variant on forests, we train binary decision trees\nwith an enforced maximum width M at each level, and thus a reduced memory footprint. This is\nuseful to tease out whether the improved generalization of jungles is due more to the reduced model\ncomplexity or to the node merging. Training a tree with \ufb01xed width is achieved by ranking the leaf\nnodes i at each level by decreasing value of E(Si) and then greedily splitting only the M/2 nodes\nwith highest value of the objective. The leaves that are not split are discarded.\nBaseline 2: Fixed-Width Trees (B). A related, second tree-based variant is obtained by greedily\noptimizing the best split candidate for all leaf nodes, then ranking the leaves by reduction in the\n\n2http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/\n\n5\n\n\fFigure 2: Accuracy comparisons. Each graph compares Jaccard scores for jungles vs. standard\ndecision forests and three other baselines. (a, b, c) Segmentation accuracy as a function of the total\nnumber of nodes in the ensemble (i.e. memory usage) for three different datasets. (d, e, f) Segmenta-\ntion accuracy as a function of the maximum number of test comparisons per pixel (maximum depth\n\u00d7 size of ensemble), for the same datasets. Jungles achieve the same accuracy with fewer nodes.\nJungles also improve the overall generalization of the resulting classi\ufb01er.\n\nobjective, and greedily taking only the M/2 splits that most reduce the objective.3 The leaf nodes\nthat are not split are discarded from further consideration.\nBaseline 3: Priority Scheduled Trees. As a \ufb01nal variant, we consider priority-driven tree train-\nining. Current leaf nodes are ranked by the reduction in the objective that would be achieved by\nsplitting them. At each iteration, the top M nodes are split, optimal splits computed and the new\nchildren added into the priority queue. This baseline is identical to the baseline 2 above, except that\nnodes that are not split at a particular iteration are part of the ranking at subsequent iterations. This\ncan be seen as a form of tree pruning [13], and in the limit, will result in standard binary decision\ntrees. As shown later, the trees at intermediate iterations can give surprisingly good generalization.\n\n4.3 Comparative Experiments\n\nPrediction Accuracy vs. Model Size. One of our two main hypotheses is that jungles can reduce the\namount of memory used compared to forests. To investigate this we compared jungles to the baseline\nforests on three different datasets. The results are shown in Fig. 2 (top row). Note that the jungles\nof merged DAGs achieve the same accuracy as the baselines with substantially fewer total nodes.\nFor example, on the Kinect dataset, to achieve an accuracy of 0.2, the jungle requires around 3000\nnodes whereas the standard forest require around 22000 nodes. We use the total number of nodes as\na proxy for memory usage; the two are strongly linked, and the proxy works well in practice. For\nexample, the forest of 3 trees occupied 80MB on the Kinect dataset vs. 9MB for a jungle of 3 DAGs.\nOn the Faces dataset the forest of 3 trees occupied 7.17MB vs. 1.72MB for 3 DAGs.\nA second hypothesis is that merging provides a good way to regularize the training and thus increases\ngeneralization. Firstly, observe how all tree-based baselines saturate and in some cases start to\nover\ufb01t as the trees become larger. This is a common effect with deep trees and small ensembles.\nOur merged DAGs appear to be able to avoid this over\ufb01tting (at least in as far as we have trained\nthem here), and further, actually have increased the generalization quite considerably.\n\n3In other words, baseline 1 optimizes the most energetic nodes, whereas baseline 2 optimizes all nodes and\n\ntakes only the splits that most reduce the objective.\n\n6\n\n00.10.20.30.40.51100100001000000Test segmentation accuracyTotal number of nodesStanford Background datasetStandard TreesBaseline 3: Priority Scheduled TreesMerged DAGs00.10.20.30.40.51101001000Test segmentation accuracyMax. no. feature evaluations / pixelStanford Background datasetStandard TreesBaseline 3: Priority Scheduled TreesMerged DAGs00.10.20.30.40.50.60.70.8110100100010000Test segmentation accuracyMax. no. feature evaluations / pixelFaces datasetStandard TreesBaseline 3: Priority Scheduled TreesMerged DAGs(c)(f)00.050.10.150.20.250.31101001000100001000001000000Test segmentation accuracyTotal number of nodesKinect datasetStandard TreesBaseline 1: Fixed-Width Trees (A)Baseline 2: Fixed-Width Trees (B)Baseline 3: Priority Scheduled TreesMerged DAGs(a)(e)00.050.10.150.20.250.3050100150200Test segmentation accuracyMax. no. feature evaluations / pixelKinect datasetStandard TreesBaseline 1: Fixed-Width Trees (A)Baseline 2: Fixed-Width Trees (B)Merged DAGs(d)00.10.20.30.40.50.60.70.81101001000100001000001000000Test segmentation accuracyTotal number of nodesFaces datasetStandard TreesBaseline 3: Priority Scheduled TreesMerged DAGs(b)\fFigure 3: (a, b) Effect of ensemble size on test accuracy.\n(a) plots accuracy against the total\nnumber of nodes in the ensemble, whereas (b) plots accuracy against the maximum number of com-\nputations required at test time. For a \ufb01xed ensemble size jungles of DAGs achieve consistently\nbetter generalization than conventional forests. (c) Effect of merging parameter M on test accu-\nracy. The model width M has a regularizing effect on our DAG model. For other results shown on\nthis dataset, we set M = 256. See text for details.\n\nInterestingly, the width-limited tree-based baselines perform substantially better than the standard\ntree training algorithm, and in particular the priority scheduling appears to work very well, though\nstill inferior to our DAG model. This suggests that both reducing the model size and node merging\nhave a substantial positive effect on generalization.\nPrediction Accuracy vs. Depth. We do not expect the reduction in memory given by merging to\ncome for free: there is likely to be a cost in terms of the number of nodes evaluated for any individual\ntest example. Fig. 2 (bottom row) shows this trade-off. The large gains in memory footprint and\naccuracy come at a relatively small cost in the number of feature evaluations at test time. Again,\nhowever, the improved generalization is also evident. The need to train deeper also has some effect\non training time. For example, training 3 trees for Kinect took 32mins vs. 50mins for 3 DAGs.\nEffect of Ensemble Size. Fig. 3 (a, b) compares results for 1, 3, and 9 trees/DAGs in a forest/jungle.\nNote from (a) that in all cases, a jungle of DAGs uses substantially less memory than a standard\nforest for the same accuracy, and also that the merging consistently increases generalization. In\n(b) we can see again that this comes at a cost in terms of test time evaluations, but note that the\nupper-envelope of the curves belongs in several regions to DAGs rather than trees.\nLSearch vs. ClusterSearch Optimization. In experiments we observed the LSearch algorithm to\nperform better than the ClusterSearch optimization, both in terms of the objective achieved (reported\nin the table below for the face dataset) and also in test accuracy. The difference is slight, yet very\nconsistent. In our experiments the LSearch algorithm was used with 250 iterations.\n\nNumber of nodes\nLSearch objective\nClusterSearch objective\n\n2047\n0.735\n0.739\n\n5631\n0.596\n0.605\n\n10239\n0.514\n0.524\n\n20223\n0.423\n0.432\n\n30207\n0.375\n0.382\n\n40191\n0.343\n0.351\n\nEffect of Model Width. We performed an experiment investigating changes to M, the maximum\ntree width. Fig. 3 (c) shows the results. The merged DAGs consistently outperform the standard\ntrees both in terms of memory consumption and generalization, for all settings of M evaluated.\nSmaller values of M improve accuracy while keeping memory constant, but must be trained deeper.\nQualitative Image Segmentation Results. Fig. 4 shows some randomly chosen segmentation re-\nsults on both the Kinect and Faces data. On the Kinect data, forests of 9 trees are compared to\njungles of 9 DAGs. The jungles appear to give smoother segmentations than the standard forests,\nperhaps more so than the quantitative results would suggest. On the Faces data, small forests of 3\ntrees are compared to jungles of 3 DAGs, with each model containing only 48k nodes in total.\nResults on UCI Datasets. Figure 5 reports the test classi\ufb01cation accuracy as a function of model\nsize for two UCI data sets. The full results for all UCI data sets are reported in the supplementary\nmaterial. Overall using DAGs allows us to achieve higher accuracies at smaller model sizes, but in\n\n7\n\n00.10.20.30.40.50.60.70.81101001000100001000001000000Test segmentation accuracyTotal number of nodesFaces datasetStandard TreesMerged DAGs (M=128)Merged DAGs (M=256)Merged DAGs (M=512)00.050.10.150.20.250.31100100001000000Test segmentation accuracyTotal number of nodesKinect dataset1 Standard Tree3 Standard Trees9 Standard Trees1 Merged DAG3 Merged DAGs9 Merged DAGs00.050.10.150.20.250.31101001000Test segmentation accuracyMax. no. feature evaluations / pixelKinect dataset1 Standard Tree3 Standard Trees9 Standard Trees1 Merged DAG3 Merged DAGs9 Merged DAGs(a)(b)(c)\fFigure 4: Qualitative results. A few example results on the Kinect body parts and face segmentation\ntasks, comparing standard trees and merged DAGs with the same number of nodes.\n\nFigure 5: UCI classi\ufb01cation results for two data sets, MNIST-60k and Poker, eight trees or DAGs\nper ensemble. The MNIST result is typical in that the accuracy improvements of DAGs over trees\nis small but achieved at a smaller number of nodes (memory). The largest UCI data set (Poker, 1M\ninstances) pro\ufb01ts most from the use of randomized DAGs.\n\nmost cases the generalization performance is not improved or only slightly improved. The largest\nimprovements for DAGs over trees is reported for the largest dataset (Poker).\n\n5 Conclusion\n\nThis paper has presented decision jungles as ensembles of rooted decision DAGs. These DAGs are\ntrained, level-by-level, by jointly optimizing an objective function over both the choice of split func-\ntion and the structure of the DAG. Two local optimization strategies were evaluated, with an ef\ufb01cient\nmove-making algorithm producing the best results. Our evaluation on a number of diverse and chal-\nlenging classi\ufb01cation tasks has shown jungles to improve both memory ef\ufb01ciency and generalization\nfor several tasks compared to conventional decision forests and their variants.\nWe believe that decision jungles can be extended to regression tasks. We also plan to investigate\nmultiply rooted trees and merging between DAGs within a jungle.\nAcknowledgements. The authors would like to thank Albert Montillo for initial investigation of\nrelated ideas.\n\n8\n\n                Input Image Ground Truth Merged DAGs Segmentation Standard Trees Segmentation Input Image Ground Truth Merged DAGs Segmentation Standard Trees Segmentation 10110210310400.10.20.30.40.50.60.70.80.91Total number of nodesMulticlass accuracyDataset \"mnist\u221260k\", 10 classes, 5 folds  8 Standard Trees8 Merged DAGs10210410600.10.20.30.40.50.60.70.80.91Total number of nodesMulticlass accuracyDataset \"poker\", 10 classes, 5 folds  8 Standard Trees8 Merged DAGs\fReferences\n[1] Y. Amit and D. Geman. Randomized inquiries about shape; an application to handwritten digit recogni-\n\ntion. Technical Report 401, Dept. of Statistics, University of Chicago, IL, Nov 1994.\n\n[2] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of\n\nMachine Learning Research, 6:1705\u20131749, Oct. 2005.\n\n[3] D. Benbouzid, R. Busa-Fekete, and B. K\u00b4egl. Fast classi\ufb01cation using sparse decision DAGs. In Proc. Intl\n\nConf. on Machine Learning (ICML), New York, NY, USA, 2012. ACM.\n\n[4] K. P. Bennett, N. Cristianini, J. Shawe-Taylor, and D. Wu. Enlarging the margins in perceptron decision\n\ntrees. Machine Learning, 41(3):295\u2013313, 2000.\n\n[5] L. Breiman. Random forests. Machine Learning, 45(1), 2001.\n[6] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classi\ufb01cation and Regression Trees. Chapman\n\nand Hall/CRC, 1984.\n\n[7] H. Chipman, E. I. George, and R. E. Mcculloch. Bayesian CART model search. Journal of the American\n\nStatistical Association, 93:935\u2013960, 1997.\n\n[8] P. Chou. Optimal partitioning for classi\ufb01cation and regression trees. IEEE Trans. PAMI, 13(4), 1991.\n[9] A. Criminisi and J. Shotton. Decision Forests for Computer Vision and Medical Image Analysis. Springer,\n\n2013.\n\n[10] T. Elomaa and M. K\u00a8a\u00a8ari\u00a8ainen. On the practice of branching program boosting. In European Conf. on\n\nMachine Learning (ECML), 2001.\n\n[11] M. Everingham, L. van Gool, C. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes\n\n(VOC) Challenge. http://www.pascal-network.org/challenges/VOC/.\n\n[12] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent\n\nregions. In Proc. IEEE ICCV, 2009.\n\n[13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001.\n[14] E. B. Hunt, J. Marin, and P. T. Stone. Experiments in Induction. Academic Press, New York, 1966.\n[15] L. Hya\ufb01l and R. L. Rivest. Constructing optimal binary decision trees is NP-complete.\n\nInformation\n\nProcessing Letters, 5(1):15\u201317, 1976.\n\n[16] B. Kijsirikul, N. Ussivakul, and S. Meknavin. Adaptive directed acyclic graphs for multiclass classi\ufb01ca-\n\ntion. In Paci\ufb01c Rim Intl Conference on Arti\ufb01cial Intelligence (PRICAI), 2002.\n\n[17] R. Kohavi and C.-H. Li. Oblivious decision trees, graphs, and top-down pruning. In Intl Joint Conf. on\n\nArti\ufb01cal Intelligence (IJCAI), 1995.\n\n[18] P. Kontschieder, P. Kohli, J. Shotton, and A. Criminisi. GeoF: Geodesic forests for learning coupled\n\npredictors. In Proc. IEEE CVPR, 2013.\n\n[19] V. Lepetit and P. Fua. Keypoint recognition using randomized trees. IEEE Trans. PAMI, 2006.\n[20] J. Mahoney and R. J. Mooney. Initializing ID5R with a domain theory: some negative results. Technical\n\nReport 91-154, Dept. of Computer Science, University of Texas, Austin, TX, 1991.\n\n[21] K. V. S. Murthy and S. L. Salzberg. On growing better decision trees from data. PhD thesis, John Hopkins\n\nUniversity, 1995.\n\n[22] D. Newman, S. Hettich, C. Blake, and C. Merz. UCI repository of machine learning databases. Technical\n\nReport 28, University of California, Irvine, Department of Information and Computer Science, 1998.\n\n[23] A. L. Oliveira and A. Sangiovanni-Vincentelli. Using the minimum description length principle to infer\n\nreduced ordered decision graphs. Machine Learning, 12, 1995.\n\n[24] J. J. Oliver. Decision graphs \u2013 an extension of decision trees. Technical Report 92/173, Dept. of Computer\n\nScience, Monash University, Victoria, Australia, 1992.\n\n[25] A. H. Peterson and T. R. Martinez. Reducing decision trees ensemble size using parallel decision DAGs.\n\nIntl Journ. on Arti\ufb01cial Intelligence Tools, 18(4), 2009.\n\n[26] J. C. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classi\ufb01cation. In Proc.\n\nNIPS, pages 547\u2013553, 2000.\n\n[27] J. R. Quinlan. Induction of decision trees. Machine Learning, 1986.\n[28] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.\n[29] J. Shotton, R. Girshick, A. Fitzgibbon, T. Sharp, M. Cook, M. Finocchio, R. Moore, P. Kohli, A. Criminisi,\nA. Kipman, and A. Blake. Ef\ufb01cient human pose estimation from single depth images. IEEE Trans. PAMI,\n2013.\n\n9\n\n\f", "award": [], "sourceid": 207, "authors": [{"given_name": "Jamie", "family_name": "Shotton", "institution": "Microsoft Research"}, {"given_name": "Toby", "family_name": "Sharp", "institution": "Microsoft Research"}, {"given_name": "Pushmeet", "family_name": "Kohli", "institution": "Microsoft Research"}, {"given_name": "Sebastian", "family_name": "Nowozin", "institution": "Microsoft Research"}, {"given_name": "John", "family_name": "Winn", "institution": "Microsoft Research"}, {"given_name": "Antonio", "family_name": "Criminisi", "institution": "Microsoft Research"}]}