{"title": "A Refined Margin Distribution Analysis for Forest Representation Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5530, "page_last": 5540, "abstract": "In this paper, we formulate the forest representation learning approach called \\textsc{CasDF} as an additive model which boosts the augmented feature instead of the prediction. We substantially improve the upper bound of the generalization gap from $\\mathcal{O}(\\sqrt{\\ln m/m})$ to $\\mathcal{O}(\\ln m/m)$, while the margin ratio of the margin standard deviation to the margin mean is sufficiently small. This tighter upper bound inspires us to optimize the ratio. Therefore, we design a margin distribution reweighting approach for deep forest to achieve a small margin ratio by boosting the augmented feature. Experiments confirm the correlation between the margin distribution and generalization performance. We remark that this study offers a novel understanding of \\textsc{CasDF} from the perspective of the margin theory and further guides the layer-by-layer forest representation learning.", "full_text": "A Re\ufb01ned Margin Distribution Analysis for Forest\n\nRepresentation Learning\n\nShen-Huan Lyu, Liang Yang, Zhi-Hua Zhou\n\nNational Key Laboratory for Novel Software Technology\n\nNanjing University, Nanjing, 210023, China\n{lvsh,yangl,zhouzh}@lamda.nju.edu.cn\n\nAbstract\n\nIn this paper, we formulate the forest representation learning approach named\ncasForest as an additive model, and show that the generalization error can be\nbounded by O(ln m/m), when the margin ratio related to the margin standard\ndeviation against the margin mean is suf\ufb01ciently small. This inspires us to optimize\nthe ratio. To this end, we design a margin distribution reweighting approach for the\ndeep forest model to attain a small margin ratio. Experiments con\ufb01rm the relation\nbetween the margin distribution and generalization performance. We remark that\nthis study offers a novel understanding of casForest from the perspective of the\nmargin theory and further guides the layer-by-layer forest representation learning.\n\n1\n\nIntroduction\n\nIn recent years, deep neural networks have achieved excellent performance in many application\nscenarios such as face recognition and automatic speech recognition [19]. It is well known that\ndeep neural networks are dif\ufb01cult to be interpreted. This severely restricts the development of deep\nlearning in application scenarios where the interpretability of the model is crucial. Moreover, deep\nneural networks are data-hungry and the performance will degrades signi\ufb01cantly when the size of the\ntraining data is not big enough [12, 20]. In real-world tasks, due to the high cost of data collection\nand labeling, the amount of labeled training data may be insuf\ufb01cient for deep neural networks.\nIn such a situation, conventional learning methods such as support-vector machines (SVMs) [7],\nrandom forests (RFs) [3], gradient boosting decision trees (GBDTs) [15, 5], etc., are still good choices.\nBy realizing that the essence of deep learning lies in the layer-by-layer processing, in-model feature\ntransformation, and suf\ufb01cient model complexity, recently Zhou & Feng [32, 33] propose the deep\nforest model and the gcForest algorithm that incorporate forest representation learning. It can achieve\nexcellent performance on a broad range of tasks, and even perform well on small or middle-scale\ndata. Later on, a more ef\ufb01cient improvement is made by Pang et al. [21]. Feng & Zhou [13] show\nthat forests are able to do auto-encoder which was considered as a specialty of neural networks.\nThe tree-based multi-layer model can do hierarchical distributed representation learning which was\nthought to be a special feature of neural networks [14]. Utkin & Ryabinin [25] propose a Siamese\ndeep forest as an alternative to the Siamese neural network for metric learning tasks.\nThe cascade forest (abbr. casForest) structure plays an important role in Deep Forest, and it is crucial\nfor the layer-by-layer processing. This paper attempts to explain the bene\ufb01ts of casForest from the\nperspective of the margin theory.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Our Results\n\nIn Section 2, we formulate casForest (see the structure in Figure 1) as an additive model (the additive\ncasForest model) to optimize the margin distribution:\n\nF (x) =\n\n\u03b1tht (x) ,\n\n(1)\n\nforests block function \u03c6t is the raw feature x and the (t\u2212 1)-th augmented feature ft\u22121 =(cid:80)t\u22121\n\nwhere \u03b1t is a scalar determined by the margin distribution loss function (cid:96)md. The input of each random\nl=1 \u03b1lhl:\n\nt=1\n\nT(cid:88)\n\n(cid:35)(cid:33)\n\n(cid:32)(cid:34)\n\nt\u22121(cid:88)\n\nl=1\n\nht(x) = \u03c6t ([x, ft\u22121(x)]) = \u03c6t\n\nx,\n\n\u03b1lhl(x)\n\n,\n\n(2)\n\n[yF (x) < r] \u2264 ln(cid:80)T\n\nso that the t-layer casForest model ht \u2208 Ht is de\ufb01ned by such a recursive formula. Unlike all the\nweak classi\ufb01ers of traditional boosting are chosen from the same hypothesis set H, the hypothesis\nset of the t-layer casForest model contains the (t \u2212 1)-layer1, i.e., Ht\u22121 \u2282 Ht,\u2200t \u2265 2.\nIn Section 3, we provide a margin distribution upper bound for the generalization error of the additive\nmodel above:\n\nt=1 \u03b1t |Ht|\nr2\n\nS\n\n\u00b7 ln m\nm\n\nPrD [yF (x) < 0] \u2212 Pr\nwhere m is the size of training set, r is a margin parameter, \u03bb =\nS [yF (x)] is a ratio related to the\nE2\nmargin standard deviation against the margin mean, and yF (x) denotes the margin of the sample x.\nInspired by our theoretical result, we propose an effective algorithm named margin distribution Deep\nForest (see mdDF in Algorithm 2) to encourage optimizing the margin ratio. Extensive experiments\nvalidate that mdDF can effectively improve the performance on classi\ufb01cation tasks, especially for\ncategorical and mixed modeling tasks.\n\n\u00b7 ln m\nm\n\n(3)\n\n,\n\n(cid:115)\nln(cid:80)T\n(cid:113) Var[yF (x)]\n\n+ \u03bb\n\nt=1 \u03b1t |Ht|\nr2\n\n1.2 Related Work\n\nDeep Forest. Deep Forest [32, 33] is a non-neural network deep learning model which builds\nupon decision trees and does not rely on BP algorithm and gradient-based approach. The earliest\ndeep forest algorithm gcForest [33], is constructed by the multi-grained scanning operation and\nthe casForest structure. The multi-grained scanning operation aims to deal with the raw data with\nspatial or sequential relations. The casForest structure aims at the layer-by-layer processing with\nin-model feature transformation. It can be viewed as an ensemble approach that utilizes almost all\ncategories of well-known strategies for diversity enhancement, e.g., input feature manipulation and\noutput representation manipulation [30].\n\nMargin theory. The margin theory was used by Schapire et al. [24] to explain the resistance of\nAdaBoost to over\ufb01tting, but then attacked almost to death by the construction of the Arc algorithm by\nBreiman [2]. Later on, it was found that the empirical attack to margin theory of Adaboost might\nbe misleading [22], and many theoretical studies tried to get more understanding, ended by Gao &\nZhou [16]. They \ufb01nally proved that the margin distribution, which can be improved by increasing the\nmargin mean while decreasing the margin variance, is crucial to the performance of AdaBoost. This\nhas inspired the birth of a series of new statistical learning algorithms named ODM [31, 28, 29].\n\n2 Cascade Forest\n\nIn Figure 1, the casForest structure is composed of stacked entities named random forests blocks\n\u03c6ts. Each random forests block consists of several forest modules, e.g., commonly random forests\n1The hypothesis of the random forests block in the t-th layer contains that in the (t \u2212 1)-th layer without\nupdating the augmented features, i.e., \u03b1t = 0. In other words, the in-model transformation [33] is crucial for the\nrecursive formulation.\n\n2\n\n\fFigure 1: The standard cascade structure of the deep forest model [33] can be viewed as a layer-\nby-layer process. This feature augmentation can achieve feature enrichment by concatenating the\nprediction vector with the input feature vector, which is named \"PRECONC\".\n\n(abbr. RF) [3] and completely-random forests (abbr. CRF) [32]. Suppose f1 denotes the function of\nthe \ufb01rst-layer forests, then given the input x to the \ufb01rst layer, the input to the second layer will be\n[x, f1(x)], where [a, b] denotes the concatenation of a and b to form a feature vector. Considering\nthat the f1(x) is the prediction from the \ufb01rst layer, we name this process as \u201cPRECONC\u201d (PREdiction\nCONCatenation), which is crucial for the feature learning process in deep forest. PRECONC is different\nfrom the stacking operation [27, 1] in traditional ensemble learning, where the second-level learners\nact on the prediction space composed of different base learners, whereas the information of the\noriginal input feature space is ignored. Using the stacking operation with more than two layers would\nseriously suffer from over\ufb01tting, and cannot enable a deep model. In this paper we do not study the\nfactors which enable deep forest to become a deep model, only focus on the cascade structure.\nFirstly we formulate casForest as an additive model in this section. We consider training and test\nsamples generated i.i.d. from distribution D over X \u00d7 Y, where X \u2208 Rn is the input space and\nY \u2208 {1, 2, . . . , s} is the output space. We denote a training set of m samples drawn from Dm by S.\nThe casForest model can be formalized as follows. We use a quadruple form (\u03c6, f ,D, h) where\n\nrandom forests block in the t-th layer which is de\ufb01ned by (4);\n(5), and ht drawn from the hypothesis set Ht;\n\n\u2022 Forest block: \u03c6 = (\u03c61, \u03c62, . . . , \u03c6T ), where \u03c6t denotes the function computed by the\n\u2022 casForest: h = (h1, h2, . . . , hT ), where ht denotes the t-layer casForest model de\ufb01ned by\n\u2022 Augmented feature: f = (f1, f2, . . . , fT ), where ft denotes the output in the t-th layer,\n\u2022 Sample distribution: D = (D1,D2, . . . ,DT ), where Dt is the updated sample distribution\n\nwhich is de\ufb01ned by (6);\nin the t-th layer, and D1 = D.\n\n\u03c6t is the function returned by the random forests block (Algorithm 1). The input of the algorithm\nis the raw training sample S = {(x1, y1), . . . , (xm, ym)}, the augmented feature from the previous\nlayer ft\u22121(xi), i \u2208 [m], and the reweighting distribution Dt:\n\n\u03c6t =\n\nArfb ([xi, ft\u22121(xi); yi]m\n\ni=1,Dt)\n\ni=1,D1)\n\nt = 1,\nt > 1.\n\nUsing these random forests block functions \u03c6ts, we can de\ufb01ne the t-layer casForest model as:\n\nft : X \u2192 C is de\ufb01ned as follows:\n\n(cid:26)Arfb ([xi; yi]m\n(cid:26)\u03c6t(x)\n(cid:26)\u03b1tht(x)\n\nht(x) =\n\nft(x) =\n\nt = 1,\nt > 1,\n\nt = 1,\nt > 1,\n\n\u03c6t ([x, ft\u22121(x)])\n\n\u03b1tht (x) + ft\u22121(x)\n\n3\n\n(4)\n\n(5)\n\n(6)\n\nRFCRFAve.Input featuresRFCRFRFCRFRFCRFLayer 1Layer 2. . . Layer TRFCRFRFCRFfAAAB6HicdZDLSgMxFIbPeK31VnXpJlgEVyUpYttdwY3LFuwF2qFk0kwbm8kMSUYopU/gxoUibn0kd76NmbaCiv4Q+Pj/c8g5J0ikMBbjD29tfWNzazu3k9/d2z84LBwdt02casZbLJax7gbUcCkUb1lhJe8mmtMokLwTTK6zvHPPtRGxurXThPsRHSkRCkats5rhoFDEJYwxIQRlQCpX2EGtVi2TKiJZ5FSElRqDwnt/GLM04soySY3pEZxYf0a1FUzyeb6fGp5QNqEj3nOoaMSNP1sMOkfnzhmiMNbuKYsW7veOGY2MmUaBq4yoHZvfWWb+lfVSG1b9mVBJarliy4/CVCIbo2xrNBSaMyunDijTws2K2Jhqyqy7Td4d4WtT9D+0yyXiuHlZrOPVOXJwCmdwAQQqUIcbaEALGHB4gCd49u68R+/Fe12WrnmrnhP4Ie/tExKOjQ4=AAAB6HicdZDLSgMxFIbPeK31VnXpJlgEVyUpYttdwY3LFuwF2qFk0kwbm8kMSUYopU/gxoUibn0kd76NmbaCiv4Q+Pj/c8g5J0ikMBbjD29tfWNzazu3k9/d2z84LBwdt02casZbLJax7gbUcCkUb1lhJe8mmtMokLwTTK6zvHPPtRGxurXThPsRHSkRCkats5rhoFDEJYwxIQRlQCpX2EGtVi2TKiJZ5FSElRqDwnt/GLM04soySY3pEZxYf0a1FUzyeb6fGp5QNqEj3nOoaMSNP1sMOkfnzhmiMNbuKYsW7veOGY2MmUaBq4yoHZvfWWb+lfVSG1b9mVBJarliy4/CVCIbo2xrNBSaMyunDijTws2K2Jhqyqy7Td4d4WtT9D+0yyXiuHlZrOPVOXJwCmdwAQQqUIcbaEALGHB4gCd49u68R+/Fe12WrnmrnhP4Ie/tExKOjQ4=AAAB6HicdZDLSgMxFIbPeK31VnXpJlgEVyUpYttdwY3LFuwF2qFk0kwbm8kMSUYopU/gxoUibn0kd76NmbaCiv4Q+Pj/c8g5J0ikMBbjD29tfWNzazu3k9/d2z84LBwdt02casZbLJax7gbUcCkUb1lhJe8mmtMokLwTTK6zvHPPtRGxurXThPsRHSkRCkats5rhoFDEJYwxIQRlQCpX2EGtVi2TKiJZ5FSElRqDwnt/GLM04soySY3pEZxYf0a1FUzyeb6fGp5QNqEj3nOoaMSNP1sMOkfnzhmiMNbuKYsW7veOGY2MmUaBq4yoHZvfWWb+lfVSG1b9mVBJarliy4/CVCIbo2xrNBSaMyunDijTws2K2Jhqyqy7Td4d4WtT9D+0yyXiuHlZrOPVOXJwCmdwAQQqUIcbaEALGHB4gCd49u68R+/Fe12WrnmrnhP4Ie/tExKOjQ4=AAAB6HicdZDLSgMxFIbPeK31VnXpJlgEVyUpYttdwY3LFuwF2qFk0kwbm8kMSUYopU/gxoUibn0kd76NmbaCiv4Q+Pj/c8g5J0ikMBbjD29tfWNzazu3k9/d2z84LBwdt02casZbLJax7gbUcCkUb1lhJe8mmtMokLwTTK6zvHPPtRGxurXThPsRHSkRCkats5rhoFDEJYwxIQRlQCpX2EGtVi2TKiJZ5FSElRqDwnt/GLM04soySY3pEZxYf0a1FUzyeb6fGp5QNqEj3nOoaMSNP1sMOkfnzhmiMNbuKYsW7veOGY2MmUaBq4yoHZvfWWb+lfVSG1b9mVBJarliy4/CVCIbo2xrNBSaMyunDijTws2K2Jhqyqy7Td4d4WtT9D+0yyXiuHlZrOPVOXJwCmdwAQQqUIcbaEALGHB4gCd49u68R+/Fe12WrnmrnhP4Ie/tExKOjQ4=Final prediction~concatenatehAAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY8OKxBdMW2lA220m7drMJuxuhhP4CLx4U8epP8ua/cdvmoK0vLDy8M8POvGEquDau++2UNja3tnfKu5W9/YPDo+rxSVsnmWLos0QkqhtSjYJL9A03ArupQhqHAjvh5G5e7zyh0jyRD2aaYhDTkeQRZ9RYqzUeVGtu3V2IrINXQA0KNQfVr/4wYVmM0jBBte55bmqCnCrDmcBZpZ9pTCmb0BH2LEoaow7yxaIzcmGdIYkSZZ80ZOH+nshprPU0Dm1nTM1Yr9bm5n+1Xmai2yDnMs0MSrb8KMoEMQmZX02GXCEzYmqBMsXtroSNqaLM2GwqNgRv9eR1aF/VPcut61rDLeIowxmcwyV4cAMNuIcm+MAA4Rle4c15dF6cd+dj2VpyiplT+CPn8wfIkYzaAAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY8OKxBdMW2lA220m7drMJuxuhhP4CLx4U8epP8ua/cdvmoK0vLDy8M8POvGEquDau++2UNja3tnfKu5W9/YPDo+rxSVsnmWLos0QkqhtSjYJL9A03ArupQhqHAjvh5G5e7zyh0jyRD2aaYhDTkeQRZ9RYqzUeVGtu3V2IrINXQA0KNQfVr/4wYVmM0jBBte55bmqCnCrDmcBZpZ9pTCmb0BH2LEoaow7yxaIzcmGdIYkSZZ80ZOH+nshprPU0Dm1nTM1Yr9bm5n+1Xmai2yDnMs0MSrb8KMoEMQmZX02GXCEzYmqBMsXtroSNqaLM2GwqNgRv9eR1aF/VPcut61rDLeIowxmcwyV4cAMNuIcm+MAA4Rle4c15dF6cd+dj2VpyiplT+CPn8wfIkYzaAAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY8OKxBdMW2lA220m7drMJuxuhhP4CLx4U8epP8ua/cdvmoK0vLDy8M8POvGEquDau++2UNja3tnfKu5W9/YPDo+rxSVsnmWLos0QkqhtSjYJL9A03ArupQhqHAjvh5G5e7zyh0jyRD2aaYhDTkeQRZ9RYqzUeVGtu3V2IrINXQA0KNQfVr/4wYVmM0jBBte55bmqCnCrDmcBZpZ9pTCmb0BH2LEoaow7yxaIzcmGdIYkSZZ80ZOH+nshprPU0Dm1nTM1Yr9bm5n+1Xmai2yDnMs0MSrb8KMoEMQmZX02GXCEzYmqBMsXtroSNqaLM2GwqNgRv9eR1aF/VPcut61rDLeIowxmcwyV4cAMNuIcm+MAA4Rle4c15dF6cd+dj2VpyiplT+CPn8wfIkYzaAAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8FQSEfRY8OKxBdMW2lA220m7drMJuxuhhP4CLx4U8epP8ua/cdvmoK0vLDy8M8POvGEquDau++2UNja3tnfKu5W9/YPDo+rxSVsnmWLos0QkqhtSjYJL9A03ArupQhqHAjvh5G5e7zyh0jyRD2aaYhDTkeQRZ9RYqzUeVGtu3V2IrINXQA0KNQfVr/4wYVmM0jBBte55bmqCnCrDmcBZpZ9pTCmb0BH2LEoaow7yxaIzcmGdIYkSZZ80ZOH+nshprPU0Dm1nTM1Yr9bm5n+1Xmai2yDnMs0MSrb8KMoEMQmZX02GXCEzYmqBMsXtroSNqaLM2GwqNgRv9eR1aF/VPcut61rDLeIowxmcwyV4cAMNuIcm+MAA4Rle4c15dF6cd+dj2VpyiplT+CPn8wfIkYzaAAAB63icbZDLSgMxFIZP6q3WW9Wlm2ARXJUZEXRZcOOygr1AO5RMmumEJpkhyQhl6Cu4caGIW1/InW9jpp2Ftv4Q+PjPOeScP0wFN9bzvlFlY3Nre6e6W9vbPzg8qh+fdE2Saco6NBGJ7ofEMMEV61huBeunmhEZCtYLp3dFvffEtOGJerSzlAWSTBSPOCW2sIZpzEf1htf0FsLr4JfQgFLtUf1rOE5oJpmyVBBjBr6X2iAn2nIq2Lw2zAxLCZ2SCRs4VEQyE+SLXef4wjljHCXaPWXxwv09kRNpzEyGrlMSG5vVWmH+VxtkNroNcq7SzDJFlx9FmcA2wcXheMw1o1bMHBCqudsV05hoQq2Lp+ZC8FdPXofuVdN3/HDdaHllHFU4g3O4BB9uoAX30IYOUIjhGV7hDUn0gt7Rx7K1gsqZU/gj9PkDDiCOLQ==AAAB63icbZDLSgMxFIZP6q3WW9Wlm2ARXJUZEXRZcOOygr1AO5RMmumEJpkhyQhl6Cu4caGIW1/InW9jpp2Ftv4Q+PjPOeScP0wFN9bzvlFlY3Nre6e6W9vbPzg8qh+fdE2Saco6NBGJ7ofEMMEV61huBeunmhEZCtYLp3dFvffEtOGJerSzlAWSTBSPOCW2sIZpzEf1htf0FsLr4JfQgFLtUf1rOE5oJpmyVBBjBr6X2iAn2nIq2Lw2zAxLCZ2SCRs4VEQyE+SLXef4wjljHCXaPWXxwv09kRNpzEyGrlMSG5vVWmH+VxtkNroNcq7SzDJFlx9FmcA2wcXheMw1o1bMHBCqudsV05hoQq2Lp+ZC8FdPXofuVdN3/HDdaHllHFU4g3O4BB9uoAX30IYOUIjhGV7hDUn0gt7Rx7K1gsqZU/gj9PkDDiCOLQ==AAAB63icbZDLSgMxFIZP6q3WW9Wlm2ARXJUZEXRZcOOygr1AO5RMmumEJpkhyQhl6Cu4caGIW1/InW9jpp2Ftv4Q+PjPOeScP0wFN9bzvlFlY3Nre6e6W9vbPzg8qh+fdE2Saco6NBGJ7ofEMMEV61huBeunmhEZCtYLp3dFvffEtOGJerSzlAWSTBSPOCW2sIZpzEf1htf0FsLr4JfQgFLtUf1rOE5oJpmyVBBjBr6X2iAn2nIq2Lw2zAxLCZ2SCRs4VEQyE+SLXef4wjljHCXaPWXxwv09kRNpzEyGrlMSG5vVWmH+VxtkNroNcq7SzDJFlx9FmcA2wcXheMw1o1bMHBCqudsV05hoQq2Lp+ZC8FdPXofuVdN3/HDdaHllHFU4g3O4BB9uoAX30IYOUIjhGV7hDUn0gt7Rx7K1gsqZU/gj9PkDDiCOLQ==AAAB63icbZDLSgMxFIZP6q3WW9Wlm2ARXJUZEXRZcOOygr1AO5RMmumEJpkhyQhl6Cu4caGIW1/InW9jpp2Ftv4Q+PjPOeScP0wFN9bzvlFlY3Nre6e6W9vbPzg8qh+fdE2Saco6NBGJ7ofEMMEV61huBeunmhEZCtYLp3dFvffEtOGJerSzlAWSTBSPOCW2sIZpzEf1htf0FsLr4JfQgFLtUf1rOE5oJpmyVBBjBr6X2iAn2nIq2Lw2zAxLCZ2SCRs4VEQyE+SLXef4wjljHCXaPWXxwv09kRNpzEyGrlMSG5vVWmH+VxtkNroNcq7SzDJFlx9FmcA2wcXheMw1o1bMHBCqudsV05hoQq2Lp+ZC8FdPXofuVdN3/HDdaHllHFU4g3O4BB9uoAX30IYOUIjhGV7hDUn0gt7Rx7K1gsqZU/gj9PkDDiCOLQ==PRECONCAugmented featuresOriginal features\fAlgorithm 1 Random forests block Arfb [33]\nInput: A training set S drawn from Dt and the augmented feature ft\u22121(xi),\u2200i \u2208 [m].\nOutput: The function computed by the random forests block in the t-th layer: \u03c6t.\n1: Divide S to k-fold subsets {S1, . . . , Sk} randomly.\n2: for Si in {S1, S2, . . . , Sk} do\nUsing S/Si to train two random forests and two completely random forests.\n3:\nCompute the prediction rate pi\nt(j) for the j-th leaf node generated by S/Si.\n4:\n\u03c6t([x, ft\u22121(x)]) \u2190 Ej[pi\n5:\n6: end for\n7: \u03c6t([x, ft\u22121(x)]) \u2190 Ei,j[pi\n8: return The function computed by the random forests block in the t-th layer: \u03c6t.\n\nt(j)], for any training sample (x, y) \u2208 Si.\nt(j)], for any test sample (x, y) \u2208 D.\n\nwhere \u03b1t and Dt need to be optimized and updated.\nHere, we \ufb01nd that the t-layer casForest model is de\ufb01ned by a recursive formula:\n\n(cid:32)(cid:34)\n\nt\u22121(cid:88)\n\n(cid:35)(cid:33)\n\nx,\n\n\u03b1lhl(x)\n\nht(x) = \u03c6t ([x, ft\u22121(x)]) = \u03c6t\n\n(7)\nUnlike all the weak classi\ufb01ers of AdaBoost which are chosen from the same hypothesis set H,\nthe hypothesis set of the t-layer casForest model contains that of the (t \u2212 1)-layer, similar to the\nhypothesis sets of the deep neural networks (DNNs) at different depths, i.e., Ht\u22121 \u2282 Ht,\u2200t \u2265 2.\nThe PRECONC process is dif\ufb01cult to analyze. For simplicity, here we do not consider the in\ufb02uence\nof the feature augmentation process though it is very crucial for deep forest. Instead, we only\nconsider the hypotheses based on the original feature space, and thus the entire additive cascade\nmodel \u02dcF : X \u2192 Y is de\ufb01ned as follows:\n\nl=1\n\n.\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\n\u02dcF (x) = \u02dc\u03c3(F (x)) = arg max\nj\u2208{1,2,...,s}\n\n\u03b1thj\n\nt (x)\n\n,\n\n(8)\n\nwhere F (x) is the \ufb01nal prediction vector of the casForest model for classi\ufb01cation and \u02dc\u03c3 denotes a\nmap from average prediction score vector to a label.\nWith such a simplicity, the casForest structure has relation to Cortes et al. [8, 9] and Huang et al.\n[17]. However, in the next section we will see that we prove that the generalization error of casForest\ncan be bounded by O(ln m/m + \u03bb\ndeviation against the margin mean is suf\ufb01cient small. This bound is tighter than the generalization\nbound O(ln m/m) for Deep Boosting [8, 9, 17].\n\n(cid:112)ln m/m), when the margin ratio related to the margin standard\n\n3 Generalization Analysis\n\nclassi\ufb01er (the T -layer casForest model) as F (x) =(cid:80)T\n\nIn this section, we analyze the generalization error to understand the sample complexity of the\ncasForest model. For simplicity, we consider the binary classi\ufb01cation2 task. We de\ufb01ne the strong\nt=1 \u03b1tht(x), i.e., casForest is formulated as an\nadditive model. Now we de\ufb01ne the margin for sample (x, y) as yF (x) \u2208 [\u22121, 1], which implies the\ncon\ufb01dence of prediction. We assume that the hypothesis set H of base classi\ufb01ers {h1, h2, . . . , hT}\ncan be decomposed as the union of T families H1,H2, . . . ,HT ordered by increasing complexity,\nwhere \u2200t \u2265 1,Ht \u2282 Ht+1 and ht \u2208 Ht. Remarkably, the complexity term of our bound admits an\nexplicit dependency in terms of the mixture coef\ufb01cients de\ufb01ning the ensembles. Thus, the ensemble\nt=1 Ht), which is the family of functions F (x) of the form\n\nfamily we consider is F = conv((cid:83)T\nF (x) =(cid:80)T\n\nt=1 \u03b1tht(x), where \u03b1 = (\u03b11, . . . , \u03b1T ) is in the simplex \u2206.\n\nFor a \ufb01xed g = (g1, . . . , gT ), any \u03b1 \u2208 \u2206 de\ufb01nes a distribution over {g1, . . . , gT}. Sampling\nfrom {g1, . . . , gT} according to \u03b1 and averaging leads to functions G = 1\ni=1 ntgt for some\n2In the binary classi\ufb01cation, we can rede\ufb01ne the output of the strong classi\ufb01er F (x) as a variable in [\u22121, 1],\ne.g. the difference between two prediction scores, where \u02dcF (x) = sign(F (x)) is the predicted label. The\nprevious bounds [8, 9, 17] are based on binary classi\ufb01cation, therefore, our result is comparable with them.\n\n(cid:80)T\n\nn\n\n4\n\n\fn\n\ngk,j\n\nT(cid:88)\n\nGF ,N =\n\nconsider the family of functions\n\nt=1 nt = n, and gt \u2208 Ht. For any N = (N1, . . . , NT ) with |N| = n, we\n\nn = (n1, . . . , nT ), with(cid:80)T\n\uf8f1\uf8f2\uf8f3 1\nNk(cid:88)\nand the union of all such families GF ,n =(cid:83)|N=n| GF ,N. For a \ufb01xed N, the size of GF ,N can be\nbounded as follows: ln|GF ,N| \u2264 n ln(cid:80)T\nLemma 1. ([16]) For F =(cid:80)T\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u2200(k, j) \u2208 [T ] \u00d7 [Nk], gk,j \u2208 Hk\n\nbound based on the margin mean and a Bernstein-type bound follows:\nt=1 \u03b1tht \u2208 F and G \u2208 GF ,n, we have\n\u2212n\u00012\n\nt=1 \u03b1t|Ht|. Our margin distribution theory is based on a\n\n(10)\nLemma 2. ([16]) For independent random variables X1, X2, . . . , Xm(m \u2265 5) with values in [0, 1],\nand for \u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4 we have\n\nS[yF (x)] + 4\u0001/3\n\n[yG(x) \u2212 yF (x) \u2265 \u0001] \u2264 exp\n\n\uf8fc\uf8fd\uf8fe ,\n\n2 \u2212 2E2\n\nS,GF ,n\n\n(cid:18)\n\n(cid:19)\n\n(9)\n\nPr\n\nk=1\n\nj=1\n\n.\n\nm(cid:88)\n\nm(cid:88)\ni(cid:54)=j(Xi \u2212 Xj)2/2m(m \u2212 1)\n\nE[Xi] \u2212 1\nm\n\n1\nm\n\ni=1\n\ni=1\n\nwhere \u02c6Vm =(cid:80)\n\n(cid:115)\n\nXi \u2264\n\n2 \u02c6Vm ln(2/\u03b4)\n\nm\n\n+\n\n7 ln(2/\u03b4)\n\n3m\n\n(11)\n\nSince the gap between the margin of strong classi\ufb01er yF (x) and that in the union family GF ,N\nis bounded by a function related to the margin mean ES[yF (x)], we can further obtain a margin\ndistribution theorem as follows:\nTheorem 1. Let D be a distribution over X \u00d7 Y and S be a training set of m samples drawn from\nD. With probability at least 1 \u2212 \u03b4, for r > 0, the strong classi\ufb01er F (x) (the T -layer casForest model)\nsatis\ufb01es that\n\nPrD [yF (x) < 0] \u2264 inf\nr\u2208(0,1]\n\n[yF (x) < r] +\n\nPr\nS\n\n1\nmd +\n\n+\n\n7\u00b5\n3m\n\n+ \u03bb\n\n\u221a\n3\n\u00b5\nm3/2\n\n(cid:34)\n\n> 2, \u00b5 = ln m ln(2\n\nT(cid:88)\n\nt=1\n\n\u03b1t|Ht|)/r2 + ln\n\n2\n\u03b4\n\n, \u03bb =\n\nVar[yF (x)]\nE2\nS[yF (x)]\n\n.\n\n(cid:35)\n\n(cid:114) 3\u00b5\n(cid:115)\n\nm\n\nwhere\n\nS[yF (x)] + r/9\n\n2\n\nd =\n\n1 \u2212 E2\n\nProof. For F = (cid:80)T\n\nChernoff\u2019s bound gives\n\nt=1 \u03b1tht \u2208 F and G \u2208 GF ,n, we have EG\u2208GF ,n[G] = F . For \u03b2 > 0, the\n\nPr\nD\n\n[yF (x) < 0] = Pr\n\nD,GF ,n\n\n[yF (x) < 0, yG(x) \u2265 \u03b2] + Pr\nD,GF ,n\n\n[yF (x) < 0, yG(x) < \u03b2]\n\nRecall that |GF ,N| \u2264(cid:81)T\n\n\u2264 exp(\u2212n\u03b22/2) + Pr\n(12)\nD,GF ,n\nt=1 |Ht|Nt for a \ufb01xed N. Therefore, for any \u03b4n > 0, combining the union\nbound with the Lemma 2 guarantees that with probability at least 1 \u2212 \u03b4n over sample S, for any\nG \u2208 GF ,N and \u03b2 > 0\n\n[yG(x) < \u03b2].\n\n[yG(x) < \u03b2] \u2264 Pr\n\nS\n\nPr\nD\n\n[yG(x) < \u03b2] +\n\n\u2264 Pr\n\nS\n\n[yG(x) < \u03b2] +\n\n(cid:32)\n\n2\n\u03b4\n\n|Ht|Nt\n\n(cid:33)\nT(cid:89)\n(cid:18) 2|Ht|\n(cid:19)\n\nt=1\n\n\u03b1t ln\n\n\u03b4\n\n(cid:32)\n\n|Ht|Nt\n\n(cid:33)\nT(cid:89)\n(cid:18) 2|Ht|\n(cid:19)\n\nt=1\n\n\u03b1t ln\n\n\u03b4\n\n+\n\n7\n3m\n\nln\n\n2\n\u03b4\n\nT(cid:88)\n\ni=1\n\n+\n\n7n\n3m\n\n(13)\n\n(14)\n\n(cid:118)(cid:117)(cid:117)(cid:116) 2\n(cid:118)(cid:117)(cid:117)(cid:116) 2n\n\nm\n\nm\n\n\u02c6Vm ln\n\n\u02c6Vm\n\nT(cid:88)\n\ni=1\n\n5\n\n\f(cid:118)(cid:117)(cid:117)(cid:116) 2n\n\nm\n\n(cid:32)\n\n2(cid:80)T\n\ni=1 \u03b1t|Ht|\n\n(cid:33)\n\n\u03b4\n\n\u02c6Vm ln\n\n(cid:32)\n\n2(cid:80)T\n\ni=1 \u03b1t|Ht|\n\n(cid:33)\n\n\u03b4\n\n+\n\n7n\n3m\n\nln\n\n\u2264 Pr\n\nS\n\n[yG(x) < \u03b2] +\n\nwhere\n\n(cid:88)\n\ni(cid:54)=j\n\n\u02c6Vm =\n\n(I[yiG(xi) < \u03b2] \u2212 I[yjG(xj) < \u03b2])2\n\n2m(m \u2212 1)\n\n,\n\n(15)\n\n(16)\n\nThe inequality (14) is a large probability bound when n is large enough and inequality (15) is\naccording to the Jensen\u2019s Inequality. Since there are T at most T n possible T -tuples N with |N| = n,\nby the union bound, for any \u03b4 > 0, with probability at least 1 \u2212 \u03b4, for all G \u2208 GF ,n and \u03b2 > 0:\ni=1 \u03b1t|Ht|\n\u03b4/T n\n\n[yG(x) < \u03b2] \u2264 Pr\n\ni=1 \u03b1t|Ht|\n\u03b4/T n\n\n(cid:118)(cid:117)(cid:117)(cid:116) 2n\n\n2(cid:80)T\n\n[yG(x) < \u03b2] +\n\n(cid:32)\n\n2(cid:80)T\n\n\u02c6Vm ln\n\n+\n\n7n\n3m\n\nln\n\n(cid:32)\n\n(cid:33)\n\n(cid:33)\n\nPr\nD\n\nm\n\nS\n\nMeantime, we can rewrite \u02c6Vm\n\n\u02c6Vm =\n\n=\n\n=\n\n(cid:88)\n\n(I[yiG(xi) < \u03b2] \u2212 I[yjG(xj) < \u03b2])2\n\n2m(m \u2212 1)\n\ni(cid:54)=j\n2m2 PrS[yG(x) < \u03b2] PrS[yG(x) \u2265 \u03b2]\n\n2m(m \u2212 1)\n\nFor any \u03b81, \u03b82 > 0, we utilize Chernoff\u2019s bound to get:\n\nm\nm \u2212 1\n\n\u2217\n\u02c6V\nm\n\n[yG(x) \u2265 \u03b2]\n\n\u2217\n\u02c6V\nm = Pr\nS\n\n[yG(x) < \u03b2] Pr\nS\n\n\u2264 3 exp(\u2212n\u03b82\n\u2264 3 exp(\u2212n\u03b82\n\n1/2) + Pr\nS\n1/2) + Pr\nS\n\n[yF (x) < \u03b2 + \u03b81] Pr\nS\n[yF (x) < \u03b2 + \u03b81 |ES[yF (x)] \u2265 \u03b2 + \u03b81 + \u03b82 ]\n\n[yF (x) \u2265 \u03b2 \u2212 \u03b81]\n\n(17)\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\n(cid:19)\n\n(25)\n\n\u00b7 Pr\n\n[yF (x) \u2265 \u03b2 \u2212 \u03b81|ES[yF (x)] \u2265 \u03b2 + \u03b81 + \u03b82]\n\nS\n\n\u2264 3 exp(\u2212n\u03b82\n\n1/2) +\n\n\u2264 3 exp(\u2212n\u03b82\n\n1/2) +\n\nVar[yF (x)]\n\n\u03b82\n2\nVar[yF (x)]\n\nAccording to Chebyshev\u2019s Inequality\n\n(ES[yF (x)] \u2212 \u03b2 + \u03b81)2 (cid:39) 3 exp(\u2212n\u03b82\n\n1/2) +\n\nVar[yF (x)]\nE2\nS[yF (x)]\n\n(24)\n\nS[yF (x)] is the variance of the margins.\n\nwhere Var[yF (x)] = ES[(yF (x))2] \u2212 E2\nFrom Lemma 1, we obtain that\n[yG(x) < \u03b2] \u2264 Pr\n\nPr\nS\n\n(cid:18)\n\n[yF (x) < \u03b2 + \u03b81] + exp\n\nS\n\n\u2212n\u03b82\n\n1\n\n2 \u2212 2E2\n\nS[yF (x)] + 4\u03b81/3\n\nLet \u03b81 = r/6, \u03b2 = 5r/6 and n = ln m/r2, we combine (12)(15)(24)(25), the proof is completed.\n\n(cid:112)ln m/m + ln m/m), which is controlled by the\nthe sample complexity. Then, this bound is tighter than the O((cid:112)ln m/m) rate as demonstrated in\n\nRemark 1. From Theorem 1, we know that the gap between the generalization error and the empirical\nmargin loss is generally bounded by the term O(\u03bb\nratio related to the margin standard deviation against the margin mean \u03bb. This ratio implies that the\nlarger margin mean and the smaller margin variance can reduce the generalization error of models\nproperly, which is crucial to alleviating the over\ufb01tting problem. When the margin distribution is\ngood enough (the margin mean is large and the margin variance is small), O(ln m/m) will dominate\n\nprevious theoretical works about Deep Boosting [8, 9, 17].\n\n6\n\n\fm ,\u2200i \u2208 [m]\n\nAlgorithm 2 mdDF (margin distribution Deep Forest)\nInput: Training set S = {(x1, y1), . . . , (xm, ym)} and random forests block algorithm Arfb.\nOutput: The \ufb01nal additive cascade model \u02dcF .\n1: Initialize \u03b10 \u2190 1, f0 \u2190 \u2205\n2: Initialize sample weights: D1(i) \u2190 1\n3: for t = 1, 2, . . . , T do\n4:\n5:\n6:\n7:\n\n\u03c6t \u2190 the random forests block returned by Arfb([xi, ft\u22121(xi); yi]m\nht(xi) \u2190 \u03c6t ([xi, ft\u22121(xi)]) ,\u2200i \u2208 [m].\nES[(cid:96)md((cid:80)t\n\u03b3t(xi) \u2190 hy\n\u03b1t \u2190 arg min\n9: Dt+1(i) \u2190 (cid:96)md((cid:80)t\nft(xi) \u2190 \u03b1tht (xi) + ft\u22121(xi),\u2200i \u2208 [m].\n(cid:80)m\ni=1 (cid:96)md((cid:80)t\n(cid:104)(cid:80)T\n10: end for\n11: return \u02dcF \u2190 arg max\nj\u2208{1,2,...,s}\n\nt (xi) \u2212 maxj(cid:54)=y hj\n\nt (xi),\u2200i \u2208 [m].\n\n,\u2200i \u2208 [m].\n\ni=1,Dt).\n\nl=1 \u03b1l\u03b3l(x))]\n\nt=1 \u03b1thj\n\nl=1 \u03b1l\u03b3l(xi))\n\nl=1 \u03b1l\u03b3l(xi))\n\n(cid:105)\n\n8:\n\n\u03b1t\n\n.\n\nt\n\nthe result of Cortes et al. [8]. The cardinality of the hypothesis set F = conv((cid:83)T\nby the mixture coef\ufb01cients \u03b1ts in (1).(cid:80)T\nthe expected margin distribution loss ES[(cid:96)md((cid:80)t\n\nRemark 2. As for the over\ufb01tting risk of the model (due to the large complexity), our bound inherits\nt=1 Ht) is controlled\nt=1 \u03b1t|Ht| in our bound implies that it is not detrimental to\ngeneralization if the corresponding mixture weight is relatively small, while some hypothesis sets\nused for learning could have large complexity. In other words, the coef\ufb01cients \u03b1ts need to minimize\nl=1 \u03b1l\u03b3l(x))], which estimates the generalization\n\nerror of the additive casForest model.\n\n4 Optimization\n\nThe generalization analysis shows the importance of optimizing the margin ratio \u03bb and the mixture\ncoef\ufb01cients \u03b1ts. Since we formulate casForest as an additive model, we utilize the reweighting\napproach to minimize the expected margin distribution loss\n\n(cid:34)\n\nES\n\n(cid:96)md\n\n(cid:33)(cid:35)\n\n(cid:32) t(cid:88)\n\nl=1\n\n\u03b1l\u03b3l(x)\n\n,\n\n(26)\n\nt (\u00b7) \u2212 maxj(cid:54)=y hj\n\nt (\u00b7)(cid:3) \u2208 C. According to\n\nwhere the margin distribution loss function (cid:96)md is designed to utilize the \ufb01rst- and second-order\nstatistics of margins, and \u03b3l(x) denotes the margin in the l-th layer. The scalar \u03b1t is determined by\nminimizing the expected loss for the t-layer model.\nThe mdDF algorithm (Algorithm 2). We denote a prediction score space by C = Rs, where\ns is the number of classes. When each sample passes through the forest model, we will get an\nCrammer & Singer [10], we can de\ufb01ne the margin of sample \u03b3t(\u00b7) for multi-class classi\ufb01cation as:\n\u03b3t(\u00b7) := hy\nThe initial sample weights are [1/m, 1/m, . . . , 1/m], and we update the i-th weight by\n\nt (\u00b7), h2\nt (\u00b7) . . . , hs\nt (\u00b7), i.e., the con\ufb01dence of prediction.\n(cid:17)\n\naverage prediction vector in each layer: ht(\u00b7) = (cid:2)h1\n(cid:16)(cid:80)t\n(cid:16)(cid:80)t\n(cid:80)m\n(cid:40) (z\u2212\u03b3)2\n\nwhere the margin distribution loss function (cid:96)md(\u00b7) is de\ufb01ned by Zhang & Zhou [28] to optimize the\n\ufb01rst- and second-order statistics of margins as follows:\n\nDt+1(i) =\n\nl=1 \u03b1l\u03b3l(xi)\n\nl=1 \u03b1l\u03b3l(xi)\n\n(cid:17) ,\n\ni=1 (cid:96)md\n\n(27)\n\n(cid:96)md\n\nz \u2264 \u03b3,\nz > \u03b3,\n\n(28)\n\n(cid:96)md(z) =\n\n\u03b32\n\u00b5(z\u2212\u03b3)2\n(1\u2212\u03b3)2\n\n7\n\n\fwhere hyper-parameter \u03b3 is a parameter as the margin mean and \u00b5 is a parameter to trade off two\ndifferent kinds of deviation (keeping the balance on both sides of the margin mean). Obviously, this\nmargin distribution loss function will enforce the band that has a lower loss to contain the sample\npoints as many as possible. In practice, we generally choose these two hyper-parameters from the\n\ufb01nite sets \u03b3 \u2208 {0.7, 0.75, 0.8, 0.85, 0.9, 0.95} and \u00b5 \u2208 {0.01, 0.05, 0.1}. The algorithm utilizing the\nmargin distribution optimization is summarized in Algorithm 2.\n\n5 Experiments\n\nDatasets and con\ufb01guration. We choose eight classi\ufb01cation benchmark datasets with different\nscales. The datasets vary in size: from 1484 up to 78823 instances, from 8 up to 784 features, and\nfrom 2 up to 26 classes. From the literature, these datasets come pre-divided into training and testing\nsets. Therefore in our experiments, we use them in their original format. PROTEIN, SENSIT, and\nSATIMAGE datasets are obtained from LIBSVM datasets [4]. Except for MNIST [18] dataset, others\ncome from the UCI Machine Learning Repository [11]. Based on the attribute characteristics of the\ndataset, we classify the datasets into three categories: categorical, numerical, and mixed modeling\ntasks. We conjecture that some numerical modeling tasks such as image or audio recognition are\nvery suitable for DNNs. Some operations, such as convolution, exactly \ufb01t well with numerical signal\nmodeling. The deep forest model is not developed to replace DNNs for such tasks; instead, it offers\nan alternative when DNNs are not superior, e.g., deep forests are good at the categorical/symbolic or\nmixed modeling tasks especially [33].\nIn mdDF, we take two random forests and two completely-random forests in each layer, and each\nforest contains 100 trees, whose maximum depth of trees in random forests grows with the layer, i.e.,\nmax \u2208 {2t + 2, 4t + 4, 8t + 8, 16t + 16}. To reduce the risk of over\ufb01tting, the representation learned\nd(t)\nby each forest is generated by k-fold cross-validation (k = 5 in our experiments). In Algorithm 1,\neach instance will be used as training data for (k \u2212 1) times, and produce the \ufb01nal class vector as\naugmented features for the resulting in (k \u2212 1) class vectors, that are averaged to the next layer.\nWe compare mdDF with the other four common used algorithms on different datasets: multilayer\nperceptron (MLP), random forest (RF) [3], XGBoost [5] and gcForest [32]. Here, we set the same\nnumber of forests as mdDF in each layer of gcForest. For random forests, we set 400 \u00d7 k trees; and\nfor XGBoost, we also take 400 \u00d7 k trees. As for other hyper-parameters, we set them as the default\nvalues. For the multilayer perceptron (MLP) con\ufb01gurations, we use ReLU for the activation function,\ncross-entropy for the loss function, adadelta for optimization, no dropout for hidden layers according\nto the scale of training data. The network structure hyper-parameters, however, could not be \ufb01xed\nacross tasks. Therefore, for MLP, we examine a variety of architectures on the validation set, and\npick the one with the best performance, then train the whole network again on the training set and\nreport the test accuracy. The examined architectures are listed as follows: (1) input-1024-512-output;\n(2) input-16-8-8-output; (3) input-70-50-output; (4) input-50-30-output; (5) input-30-20-output.\n\nTest accuracy on benchmark datasets. Table 1 shows that mdDF achieves better accuracy than\nthe other methods on several datasets. Compared with the MLP method, the deep forest models\n\nTable 1: Left: Comparison results between mdDF and the other tree-based algorithms on test accuracy\nwith different datasets. The best accuracy on each dataset is highlighted in bold type. \u2022 indicates the\nsecond best accuracy on each dataset. The average rank is listed at the bottom. Right: Comparison\nresults between the standard mdDF structure and the other mdDF structures.\n\nDataset\nADULT Categorical\nYEAST Categorical\nLETTER Categorical\nPROTEIN Categorical\nMixed\nMixed\nNumerical\nNumerical\n-\n\nAttribute MLP\n80.597\n59.641\n96.025\n68.660\n94.231 \u2022\n78.957\n91.125\n98.621 \u2022\n3.650\n\nHAR\nSENSIT\nSATIMAGE\nMNIST\nAvg. Rank\n\nRF\n85.818\n61.886\n96.575\n68.071\n92.569\n80.133\n91.200\n96.831\n4.000\n\nXGBoost\n85.904\n59.161\n95.850\n71.214 \u2022\n93.112\n81.874\n90.450\n97.730\n3.750\n\n8\n\ngcForest mdDF mdDFSF mdDFST mdDFNP\n86.276 \u2022\n63.004 \u2022\n97.375 \u2022\n71.009\n94.224\n82.334 \u2022\n91.700 \u2022\n98.252\n2.375\n\n86.560\n63.340\n97.500\n71.247\n94.600\n82.534\n91.750\n98.734\n1.000\n\n85.650\n62.556\n96.975\n68.509\n94.060\n80.320\n90.800\n98.240\n-\n\n86.200\n63.000\n96.475\n71.127\n93.926\n82.014\n91.600\n98.254\n-\n\n85.710\n62.780\n97.300\n70.291\n94.290\n80.412\n91.300\n98.101\n-\n\n\f(a) The accuracy (solid line) and the margin\nratio (dotted line) of the mdDF algorithm at\ndifferent layers on the HAR dataset.\n\n(b) The multi-layer feature visualization of the mdDF al-\ngorithm on HAR training set. The ratios of the intra-class\nvariance to the inter-class variance SA/SE are (3.88, 1.97).\n\nFigure 2: The relation between the margin ratio and learning ability in the different layers.\n\nalmost outperform on these datasets and obtain the top 2 test accuracy on categorical or mixed\nmodeling tasks. Obviously, gcForest and mdDF perform better than the shallow ones, and mdDF\nwith reweighting and boosting representations outperforms gcForest across these datasets. The\nempirical results show that the deep models provide an improvement in performance with in-model\ntransformation, compared to the shallow models that only have invariant features.\n\nComparison with the other mdDF structures\nIn Table 1, we compare our mdDF structure with\nthe three other mdDF structures on different datasets: (1) mdDF using same forests (use 4 random\nforests) named mdDFSF; (2) mdDF using stacking (only transmit the prediction vectors to next layer)\nnamed mdDFST; (3) mdDF without PRECONC (only transmit the input feature vector to next layer)\nnamed mdDFNP. In this way, we explore the importance of internal structures of the mdDF. When\nwe remove a concrete structure and control other variables, the performance of the mdDF algorithm\nwill be worse. The empirical results demonstrate the effectiveness of these speci\ufb01c structures.\n\nRelation between the margin ratio and learning ability. Figure 2(a) plots the accuracy and\nmargin ratio of mdDF on the HAR dataset. It demonstrates clearly that the performance is consistent\nwith the margin ratio. When the margin ratio is smaller, i.e., the margin std/mean is smaller, the\nperformance is better. Figure 2(b) plots the t-SNE visualization of mdDF on the HAR dataset. We also\nuse the variance decomposition in the 2D space. The result shows that the intra-class compactness and\ninter-class separability are getting better as the layers becomes deeper. Such a correlation validates\nthe theoretical result of our re\ufb01ned margin distribution analysis.\n\n6 Conclusion\n\nRecent studies propose a few tree-based deep models to learn the representations from a broad range\nof tasks and achieve good performance. By formulating casForest as an additive model, we partially\nexplain the success of it from the perspective of the margin theory. The theoretical results inspire us\nto design a margin distribution reweighting approach that improves the generalization performance.\nThen, the empirical studies validate our theoretical results. We will explore how to understand the\neffectiveness of the PRECONC operation (which is crucial for feature enrichment) in future work.\n\nAcknowledgments\n\nThis research was supported by the NSFC (61751306), National Key R&D Program of China\n(2018YFB1004300), and the Collaborative Innovation Center of Novel Software Technology and\nIndustrialization. The authors would like to thank the anonymous reviewers for constructive sugges-\ntions, as well as Wei Gao, Lijun Zhang, Shengjun Huang, Xizhu Wu, Lu Wang, Peng Zhao, Ming\nPang and Kangle Zhao for helpful discussions.\n\n9\n\n48121620layer index889092949698100accuracy (%)Train accTest acc0.00.10.20.30.40.50.60.70.8std/meanTrain ratioTest ratio10010Layer 11001010010Layer 510010\fReferences\n[1] Breiman, L. Stacked regressions. Machine Learning, 24(1):49\u201364, 1996.\n\n[2] Breiman, L. Prediction games and Arcing algorithms. Neural Computation, 11(7):1493\u20131517,\n\n1999.\n\n[3] Breiman, L. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n\n[4] Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions\n\non Intelligent Systems and Technology, 2(3):27, 2011.\n\n[5] Chen, T. and Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the\n22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.\n785\u2013794, 2016.\n\n[6] Chernoff, H. et al. A measure of asymptotic ef\ufb01ciency for tests of a hypothesis based on the\n\nsum of observations. Annals of Mathematical Statistics, 23(4):493\u2013507, 1952.\n\n[7] Cortes, C. and Vapnik, V. Support-vector networks. Machine Learning, 20(3):273\u2013297, 1995.\n\n[8] Cortes, C., Mohri, M., and Syed, U. Deep boosting. In Proceedings of the 31st International\n\nConference on Machine Learning, pp. 1179\u20131187, 2014.\n\n[9] Cortes, C., Gonzalvo, X., Kuznetsov, V., Mohri, M., and Yang, S. AdaNet: Adaptive structural\nlearning of arti\ufb01cial neural networks. In Proceedings of the 34th International Conference on\nMachine Learning, pp. 874\u2013883, 2017.\n\n[10] Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based\n\nvector machines. Journal of Machine Learning Research, 2:265\u2013292, 2001.\n\n[11] Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017.\n\n[12] Elsayed, G., Krishnan, D., Mobahi, H., Regan, K., and Bengio, S. Large margin deep networks\nfor classi\ufb01cation. In Advances in Neural Information Processing Systems 31, pp. 850\u2013860,\n2018.\n\n[13] Feng, J. and Zhou, Z. Autoencoder by forest. In Proceedings of the 32nd AAAI Conference on\n\nArti\ufb01cial Intelligence, pp. 2967\u20132973, 2018.\n\n[14] Feng, J., Yu, Y., and Zhou, Z.-H. Multi-layered gradient boosting decision trees. In Advances in\n\nNeural Information Processing Systems 31, pp. 3555\u20133565, 2018.\n\n[15] Friedman, J. H. Greedy function approximation: A gradient boosting machine. Annals of\n\nStatistics, pp. 1189\u20131232, 2001.\n\n[16] Gao, W. and Zhou, Z.-H. On the doubt about margin explanation of boosting. Arti\ufb01cial\n\nIntelligence, 203:1\u201318, 2013.\n\n[17] Huang, F., Ash, J., Langford, J., and Schapire, R. Learning deep ResNet blocks sequentially\nIn Proceedings of the 35th International Conference on Machine\n\nusing boosting theory.\nLearning, pp. 2058\u20132067, 2018.\n\n[18] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[19] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436, 2015.\n\n[20] Lyu, S.-H., Wang, L., and Zhou, Z.-H. Optimal margin distribution network. CoRR, ab-\n\ns/1812.10761, 2018.\n\n[21] Pang, M., Ting, K.-M., Zhao, P., and Zhou, Z.-H. Improving deep forest by con\ufb01dence screening.\nIn Proceedings of the 18th IEEE International Conference on Data Mining, pp. 1194\u20131199,\n2018.\n\n10\n\n\f[22] Reyzin, L. and Schapire, R. E. How boosting the margin can also boost classi\ufb01er complexity.\nIn Proceedings of the 23rd International Conference on Machine Learning, pp. 753\u2013760, 2006.\n\n[23] Schapire, R. E. and Freund, Y. Boosting: Foundations and Algorithms. MIT press, 2012.\n\n[24] Schapire, R. E., Freund, Y., Bartlett, P., Lee, W. S., et al. Boosting the margin: A new explanation\n\nfor the effectiveness of voting methods. Annals of Statistics, 26(5):1651\u20131686, 1998.\n\n[25] Utkin, L. V. and Ryabinin, M. A. A siamese deep forest. Knowledge-Based Systems, 139:13\u201322,\n\n2018.\n\n[26] van der Maaten, L. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning\n\nResearch, 9:2579\u20132605, 2008.\n\n[27] Wolpert, D. H. Stacked generalization. Neural Networks, 5(2):241\u2013259, 1992.\n\n[28] Zhang, T. and Zhou, Z.-H. Multi-class optimal margin distribution machine. In Proceedings of\n\nthe 34th International Conference on Machine Learning, pp. 4063\u20134071, 2017.\n\n[29] Zhang, T. and Zhou, Z.-H. Optimal margin distribution machine.\n\nIEEE Transactions on\n\nKnowledge and Data Engineering, 2019. doi: 10.1109/TKDE.2019.2897662.\n\n[30] Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms. CRC Press, 2012.\n\n[31] Zhou, Z.-H. Large margin distribution learning. In Proceedings of the 6th IAPR International\n\nWorkshop on Arti\ufb01cial Neural Networks in Pattern Recognition, pp. 1\u201311, 2014.\n\n[32] Zhou, Z.-H. and Feng, J. Deep forest: Towards an alternative to deep neural networks. In\nProceedings of the 26th International Joint Conference on Arti\ufb01cial Intelligence, pp. 3553\u20133559,\n2017.\n\n[33] Zhou, Z.-H. and Feng, J. Deep forest. National Science Review, 6(1):74\u201386, 2019.\n\n11\n\n\f", "award": [], "sourceid": 2956, "authors": [{"given_name": "Shen-Huan", "family_name": "Lyu", "institution": "Nanjing University"}, {"given_name": "Liang", "family_name": "Yang", "institution": "Nanjing University"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}