{"title": "CatBoost: unbiased boosting with categorical features", "book": "Advances in Neural Information Processing Systems", "page_first": 6638, "page_last": 6648, "abstract": "This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other publicly available boosting implementations in terms of quality on a variety of datasets. Two critical algorithmic advances introduced in CatBoost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms. In this paper, we provide a detailed analysis of this problem and demonstrate that proposed algorithms solve it effectively, leading to excellent empirical results.", "full_text": "CatBoost: unbiased boosting with categorical features\n\nLiudmila Prokhorenkova1,2, Gleb Gusev1,2, Aleksandr Vorobev1,\n\n2Moscow Institute of Physics and Technology, Dolgoprudny, Russia\n\n{ostroumova-la, gleb57, alvor88, annaveronika, gulin}@yandex-team.ru\n\nAnna Veronika Dorogush1, Andrey Gulin1\n\n1Yandex, Moscow, Russia\n\nAbstract\n\nThis paper presents the key algorithmic techniques behind CatBoost, a new gradient\nboosting toolkit. Their combination leads to CatBoost outperforming other publicly\navailable boosting implementations in terms of quality on a variety of datasets.\nTwo critical algorithmic advances introduced in CatBoost are the implementation\nof ordered boosting, a permutation-driven alternative to the classic algorithm, and\nan innovative algorithm for processing categorical features. Both techniques were\ncreated to \ufb01ght a prediction shift caused by a special kind of target leakage present\nin all currently existing implementations of gradient boosting algorithms. In this\npaper, we provide a detailed analysis of this problem and demonstrate that proposed\nalgorithms solve it effectively, leading to excellent empirical results.\n\n1\n\nIntroduction\n\nGradient boosting is a powerful machine-learning technique that achieves state-of-the-art results in a\nvariety of practical tasks. For many years, it has remained the primary method for learning problems\nwith heterogeneous features, noisy data, and complex dependencies: web search, recommendation\nsystems, weather forecasting, and many others [5, 26, 29, 32]. Gradient boosting is essentially a\nprocess of constructing an ensemble predictor by performing gradient descent in a functional space.\nIt is backed by solid theoretical results that explain how strong predictors can be built by iteratively\ncombining weaker models (base predictors) in a greedy manner [17].\nWe show in this paper that all existing implementations of gradient boosting face the following\nstatistical issue. A prediction model F obtained after several steps of boosting relies on the targets\nof all training examples. We demonstrate that this actually leads to a shift of the distribution of\nF (xk) | xk for a training example xk from the distribution of F (x) | x for a test example x. This\n\ufb01nally leads to a prediction shift of the learned model. We identify this problem as a special kind of\ntarget leakage in Section 4. Further, there is a similar issue in standard algorithms of preprocessing\ncategorical features. One of the most effective ways [6, 25] to use them in gradient boosting is\nconverting categories to their target statistics. A target statistic is a simple statistical model itself, and\nit can also cause target leakage and a prediction shift. We analyze this in Section 3.\nIn this paper, we propose ordering principle to solve both problems. Relying on it, we derive\nordered boosting, a modi\ufb01cation of standard gradient boosting algorithm, which avoids target\nleakage (Section 4), and a new algorithm for processing categorical features (Section 3). Their\ncombination is implemented as an open-source library1 called CatBoost (for \u201cCategorical Boosting\u201d),\nwhich outperforms the existing state-of-the-art implementations of gradient boosted decision trees \u2014\nXGBoost [8] and LightGBM [16] \u2014 on a diverse set of popular machine learning tasks (see Section 6).\n\n1https://github.com/catboost/catboost\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2 Background\nAssume we observe a dataset of examples D = {(xk, yk)}k=1..n, where xk = (x1\nk ) is a\nrandom vector of m features and yk \u2208 R is a target, which can be either binary or a numerical\nresponse. Examples (xk, yk) are independent and identically distributed according to some unknown\ndistribution P (\u00b7,\u00b7). The goal of a learning task is to train a function F : Rm \u2192 R which minimizes\nthe expected loss L(F ) := EL(y, F (x)). Here L(\u00b7,\u00b7) is a smooth loss function and (x, y) is a test\nexample sampled from P independently of the training set D.\nA gradient boosting procedure [12] builds iteratively a sequence of approximations F t : Rm \u2192 R,\nt = 0, 1, . . . in a greedy fashion. Namely, F t is obtained from the previous approximation F t\u22121 in\nan additive manner: F t = F t\u22121 + \u03b1ht, where \u03b1 is a step size and function ht : Rm \u2192 R (a base\npredictor) is chosen from a family of functions H in order to minimize the expected loss:\n\nk, . . . , xm\n\nht = arg min\n\nh\u2208H\n\nL(F t\u22121 + h) = arg min\nh\u2208H\n\nEL(y, F t\u22121(x) + h(x)).\n\n(1)\n\nThe minimization problem is usually approached by the Newton method using a second\u2013order\napproximation of L(F t\u22121 + ht) at F t\u22121 or by taking a (negative) gradient step. Both methods\nare kinds of functional gradient descent [10, 24]. In particular, the gradient step ht is chosen in\nsuch a way that ht(x) approximates \u2212gt(x, y), where gt(x, y) := \u2202L(y,s)\nleast-squares approximation is used:\n\n(cid:12)(cid:12)s=F t\u22121(x). Usually, the\n\n\u2202s\n\nE(cid:0)\u2212gt(x, y) \u2212 h(x)(cid:1)2\n\nht = arg min\n\nh\u2208H\n\n.\n\n(2)\n\nJ(cid:88)\n\nCatBoost is an implementation of gradient boosting, which uses binary decision trees as base\npredictors. A decision tree [4, 10, 27] is a model built by a recursive partition of the feature space\nRm into several disjoint regions (tree nodes) according to the values of some splitting attributes a.\nAttributes are usually binary variables that identify that some feature xk exceeds some threshold t,\nthat is, a = 1{xk>t}, where xk is either numerical or binary feature, in the latter case t = 0.5.2 Each\n\ufb01nal region (leaf of the tree) is assigned to a value, which is an estimate of the response y in the\nregion for the regression task or the predicted class label in the case of classi\ufb01cation problem.3 In\nthis way, a decision tree h can be written as\n\nh(x) =\n\nbj 1{x\u2208Rj},\n\n(3)\n\nwhere Rj are the disjoint regions corresponding to the leaves of the tree.\n\nj=1\n\n3 Categorical features\n\n3.1 Related work on categorical features\n\nA categorical feature is one with a discrete set of values called categories that are not comparable to\neach other. One popular technique for dealing with categorical features in boosted trees is one-hot\nencoding [7, 25], i.e., for each category, adding a new binary feature indicating it. However, in the\ncase of high cardinality features (like, e.g., \u201cuser ID\u201d feature), such technique leads to infeasibly\nlarge number of new features. To address this issue, one can group categories into a limited number\nof clusters and then apply one-hot encoding. A popular method is to group categories by target\nstatistics (TS) that estimate expected target value in each category. Micci-Barreca [25] proposed\nto consider TS as a new numerical feature instead. Importantly, among all possible partitions of\n\n2Alternatively, non-binary splits can be used, e.g., a region can be split according to all values of a categorical\nfeature. However, such splits, compared to binary ones, would lead to either shallow trees (unable to capture\ncomplex dependencies) or to very complex trees with exponential number of terminal nodes (having weaker\ntarget statistics in each of them). According to [4], the tree complexity has a crucial effect on the accuracy of the\nmodel and less complex trees are less prone to over\ufb01tting.\n\n3In a regression task, splitting attributes and leaf values are usually chosen by the least\u2013squares criterion.\nNote that, in gradient boosting, a tree is constructed to approximate the negative gradient (see Equation (2)), so\nit solves a regression problem.\n\n2\n\n\fcategories into two sets, an optimal split on the training data in terms of logloss, Gini index, MSE\ncan be found among thresholds for the numerical TS feature [4, Section 4.2.2] [11, Section 9.2.4].\nIn LightGBM [20], categorical features are converted to gradient statistics at each step of gradient\nboosting. Though providing important information for building a tree, this approach can dramatically\nincrease (i) computation time, since it calculates statistics for each categorical value at each step, and\n(ii) memory consumption to store which category belongs to which node for each split based on a\ncategorical feature. To overcome this issue, LightGBM groups tail categories into one cluster [21] and\nthus looses part of information. Besides, the authors claim that it is still better to convert categorical\nfeatures with high cardinality to numerical features [19]. Note that TS features require calculating\nand storing only one number per one category.\nThus, using TS as new numerical features seems to be the most ef\ufb01cient method of handling\ncategorical features with minimum information loss. TS are widely-used, e.g., in the click prediction\ntask (click-through rates) [1, 15, 18, 22], where such categorical features as user, region, ad, publisher\nplay a crucial role. We further focus on ways to calculate TS and leave one-hot encoding and gradient\nstatistics out of the scope of the current paper. At the same time, we believe that the ordering principle\nproposed in this paper is also effective for gradient statistics.\n\n3.2 Target statistics\n\nAs discussed in Section 3.1, an effective and ef\ufb01cient way to deal with a categorical feature i is\nk of k-th training example with one numeric feature equal to some\nto substitute the category xi\ntarget statistic (TS) \u02c6xi\nk. Commonly, it estimates the expected target y conditioned by the category:\nk \u2248 E(y | xi = xi\nk).\n\u02c6xi\nGreedy TS A straightforward approach is to estimate E(y | xi = xi\nover the training examples with the same category xi\ncategories, and one usually smoothes it by some prior p:\n\nk) as the average value of y\nk [25]. This estimate is noisy for low-frequency\nk} \u00b7 yj + a p\n\n1{xi\n\n(cid:80)n\n(cid:80)n\n\nj=1\n\n\u02c6xi\nk =\n\nj =xi\n1{xi\n\nj=1\n\nj =xi\n\nk} + a\n\n,\n\n(4)\n\nwhere a > 0 is a parameter. A common setting for p is the average target value in the dataset [25].\nThe problem of such greedy approach is target leakage: feature \u02c6xi\nk is computed using yk, the target of\nxk. This leads to a conditional shift [30]: the distribution of \u02c6xi|y differs for training and test examples.\nThe following extreme example illustrates how dramatically this may affect the generalization error\nof the learned model. Assume i-th feature is categorical, all its values are unique, and for each\ncategory A, we have P(y = 1 | xi = A) = 0.5 for a classi\ufb01cation task. Then, in the training dataset,\nk = yk+ap\nto perfectly classify\n\u02c6xi\nall training examples. However, for all test examples, the value of the greedy TS is p, and the obtained\nmodel predicts 0 for all of them if p < t and predicts 1 otherwise, thus having accuracy 0.5 in both\ncases. To this end, we formulate the following desired property for TS:\n\n1+a , so it is suf\ufb01cient to make only one split with threshold t = 0.5+ap\n\n1+a\n\nP1 E(\u02c6xi | y = v) = E(\u02c6xi\n\nk | yk = v), where (xk, yk) is the k-th training example.\n\nIn our example above, E(\u02c6xi\nThere are several ways to avoid this conditional shift. Their general idea is to compute the TS for xk\non a subset of examples Dk \u2282 D \\ {xk} excluding xk:\n\nk | yk) = yk+ap\n1+a and E(\u02c6xi | y) = p are different.\n(cid:80)\n\n(cid:80)\n\nk} \u00b7 yj + a p\n\nxj\u2208Dk\n\n1{xi\n\nj =xi\n1{xi\n\nxj\u2208Dk\n\nj =xi\n\nk} + a\n\n\u02c6xi\nk =\n\n.\n\n(5)\n\nHoldout TS One way is to partition the training dataset into two parts D = \u02c6D0 (cid:116) \u02c6D1 and use\nDk = \u02c6D0 for calculating the TS according to (5) and \u02c6D1 for training (e.g., applied in [8] for Criteo\ndataset). Though such holdout TS satis\ufb01es P1, this approach signi\ufb01cantly reduces the amount of data\nused both for training the model and calculating the TS. So, it violates the following desired property:\n\nP2 Effective usage of all training data for calculating TS features and for learning a model.\n\n3\n\n\fLeave-one-out TS At \ufb01rst glance, a leave-one-out technique might work well: take Dk = D \\ xk\nfor training examples xk and Dk = D for test ones [31]. Surprisingly, it does not prevent target\nk = A for all examples. Let n+ be the\nleakage. Indeed, consider a constant categorical feature: xi\nk = n+\u2212yk+a p\nand one can perfectly classify the training\nnumber of examples with y = 1, then \u02c6xi\nn\u22121+a\ndataset by making a split with threshold t = n+\u22120.5+a p\n.\nn\u22121+a\n\nOrdered TS CatBoost uses a more effective strategy. It relies on the ordering principle, the\ncentral idea of the paper, and is inspired by online learning algorithms which get training examples\nsequentially in time [1, 15, 18, 22]). Clearly, the values of TS for each example rely only on the\nobserved history. To adapt this idea to standard of\ufb02ine setting, we introduce an arti\ufb01cial \u201ctime\u201d, i.e.,\na random permutation \u03c3 of the training examples. Then, for each example, we use all the available\n\u201chistory\u201d to compute its TS, i.e., take Dk = {xj : \u03c3(j) < \u03c3(k)} in Equation (5) for a training\nexample and Dk = D for a test one. The obtained ordered TS satis\ufb01es the requirement P1 and allows\nto use all training data for learning the model (P2). Note that, if we use only one random permutation,\nthen preceding examples have TS with much higher variance than subsequent ones. To this end,\nCatBoost uses different permutations for different steps of gradient boosting, see details in Section 5.\n\n4 Prediction shift and ordered boosting\n\n4.1 Prediction shift\n\nIn this section, we reveal the problem of prediction shift in gradient boosting, which was neither\nrecognized nor previously addressed. Like in case of TS, prediction shift is caused by a special kind\nof target leakage. Our solution is called ordered boosting and resembles the ordered TS method.\nLet us go back to the gradient boosting procedure described in Section 2. In practice, the expectation\nin (2) is unknown and is usually approximated using the same dataset D:\n\n.\n\n(6)\n\n(cid:0)\u2212gt(xk, yk) \u2212 h(xk)(cid:1)2\n\nn(cid:88)\n\nk=1\n\nht = arg min\n\nh\u2208H\n\n1\nn\n\nNow we describe and analyze the following chain of shifts:\n\n1. the conditional distribution of the gradient gt(xk, yk) | xk (accounting for randomness of\n\nD \\ {xk}) is shifted from that distribution on a test example gt(x, y) | x;\n\n2. in turn, base predictor ht de\ufb01ned by Equation (6) is biased from the solution of Equation (2);\n3. this, \ufb01nally, affects the generalization ability of the trained model F t.\n\nAs in the case of TS, these problems are caused by the target leakage. Indeed, gradients used at each\nstep are estimated using the target values of the same data points the current model F t\u22121 was built on.\nHowever, the conditional distribution F t\u22121(xk) | xk for a training example xk is shifted, in general,\nfrom the distribution F t\u22121(x) | x for a test example x. We call this a prediction shift.\nRelated work on prediction shift The shift of gradient conditional distribution gt(xk, yk) | xk\nwas previously mentioned in papers on boosting [3, 13] but was not formally de\ufb01ned. Moreover, even\nthe existence of non-zero shift was not proved theoretically. Based on the out-of-bag estimation [2],\nBreiman proposed iterated bagging [3] which constructs a bagged weak learner at each iteration\non the basis of \u201cout-of-bag\u201d residual estimates. However, as we formally show in Section E of\nthe supplementary material, such residual estimates are still shifted. Besides, the bagging scheme\nincreases learning time by factor of the number of data buckets. Subsampling of the dataset at each\niteration proposed by Friedman [13] addresses the problem much more heuristically and also only\nalleviates it.\n\nAnalysis of prediction shift We formally analyze the problem of prediction shift in a simple case\nof a regression task with the quadratic loss function L(y, \u02c6y) = (y \u2212 \u02c6y)2.4 In this case, the negative\ngradient \u2212gt(xk, yk) in Equation (6) can be substituted by the residual function rt\u22121(xk, yk) :=\nyk \u2212 F t\u22121(xk).5 Assume we have m = 2 features x1, x2 that are i.i.d. Bernoulli random variables\n4We restrict the rest of Section 4 to this case, but the approaches of Section 4.2 are applicable to other tasks.\n5Here we removed the multiplier 2, what does not matter for further analysis.\n\n4\n\n\fn\u22121 c2(x2 \u2212 1\n\n2 ) + O(1/2n).\n\nwith p = 1/2 and y = f\u2217(x) = c1x1 + c2x2. Assume we make N = 2 steps of gradient boosting\nwith decision stumps (trees of depth 1) and step size \u03b1 = 1. We obtain a model F = F 2 = h1 + h2.\nW.l.o.g., we assume that h1 is based on x1 and h2 is based on x2, what is typical for |c1| > |c2| (here\nwe set some asymmetry between x1 and x2).\nTheorem 1 1. If two independent samples D1 and D2 of size n are used to estimate h1 and h2,\nrespectively, using Equation (6), then ED1,D2 F 2(x) = f\u2217(x) + O(1/2n) for any x \u2208 {0, 1}2.\n2. If the same dataset D = D1 = D2 is used in Equation (6) for both h1 and h2, then EDF 2(x) =\nf\u2217(x) \u2212 1\nThis theorem means that the trained model is an unbiased estimate of the true dependence y = f\u2217(x),\nwhen we use independent datasets at each gradient step.6 On the other hand, if we use the same\ndataset at each step, we suffer from a bias \u2212 1\n2 ), which is inversely proportional to\nthe data size n. Also, the value of the bias can depend on the relation f\u2217: in our example, it is\nproportional to c2. We track the chain of shifts for the second part of Theorem 1 in a sketch of the\nproof below, while the full proof of Theorem 1 is available in the supplementary material (Section A).\nSketch of the proof . Denote by \u03best, s, t \u2208 {0, 1}, the number of examples (xk, yk) \u2208 D with\n. Its expectation E(h1(x)) on a test example x equals\nxk = (s, t). We have h1(s, t) = c1s + c2\u03bes1\n\u03bes0+\u03bes1\n2 . At the same time, the expectation E(h1(xk)) on a training example xk is different and\nc1x1 + c2\n) + O(2\u2212n). That is, we experience a prediction shift of h1. As a\nequals (c1x1 + c2\nconsequence, the expected value of h2(x) is E(h2(x)) = c2(x2 \u2212 1\nn\u22121 ) + O(2\u2212n) on a test\nexample x and E(h1(x) + h2(x)) = f\u2217(x) \u2212 1\n\n2 )(1 \u2212 1\n2 ) + O(1/2n). (cid:3)\n\nn\u22121 c2(x2 \u2212 1\n\n2 ) \u2212 c2( 2x2\u22121\n\nn\n\nn\u22121 c2(x2 \u2212 1\n\nFinally, recall that greedy TS \u02c6xi can be considered as a simple statistical model predicting the target\nk | yk, caused by the target leakage, i.e.,\ny and it suffers from a similar problem, conditional shift of \u02c6xi\nusing yk to compute \u02c6xi\nk.\n\n4.2 Ordered boosting\n\nHere we propose a boosting algorithm which does not suffer from the prediction shift problem\ndescribed in Section 4.1. Assuming access to an unlimited amount of training data, we can easily\nconstruct such an algorithm. At each step of boosting, we sample a new dataset Dt independently\nand obtain unshifted residuals by applying the current model to new training examples. In practice,\nhowever, labeled data is limited. Assume that we learn a model with I trees. To make the residual\nrI\u22121(xk, yk) unshifted, we need to have F I\u22121 trained without the example xk. Since we need\nunbiased residuals for all training examples, no examples may be used for training F I\u22121, which at\n\ufb01rst glance makes the training process impossible. However, it is possible to maintain a set of models\ndiffering by examples used for their training. Then, for calculating the residual on an example, we use\na model trained without it. In order to construct such a set of models, we can use the ordering principle\npreviously applied to TS in Section 3.2. To illustrate the idea, assume that we take one random\npermutation \u03c3 of the training examples and maintain n different supporting models M1, . . . , Mn\nsuch that the model Mi is learned using only the \ufb01rst i examples in the permutation. At each step, in\norder to obtain the residual for j-th sample, we use the model Mj\u22121 (see Figure 1). The resulting\nAlgorithm 1 is called ordered boosting below. Unfortunately, this algorithm is not feasible in most\npractical tasks due to the need of training n different models, what increase the complexity and\nmemory requirements by n times. In CatBoost, we implemented a modi\ufb01cation of this algorithm on\nthe basis of the gradient boosting algorithm with decision trees as base predictors (GBDT) described\nin Section 5.\n\nOrdered boosting with categorical features\nIn Sections 3.2 and 4.2 we proposed to use random\npermutations \u03c3cat and \u03c3boost of training examples for the TS calculation and for ordered boosting,\nrespectively. Combining them in one algorithm, we should take \u03c3cat = \u03c3boost to avoid prediction\nshift. This guarantees that target yi is not used for training Mi (neither for the TS calculation, nor\nfor the gradient estimation). See Section F of the supplementary material for theoretical guarantees.\nEmpirical results con\ufb01rming the importance of having \u03c3cat = \u03c3boost are presented in Section G of\nthe supplementary material.\n\n6Up to an exponentially small term, which occurs for a technical reason.\n\n5\n\n\f5 Practical implementation of ordered boosting\n\nCatBoost has two boosting modes, Ordered and Plain. The latter mode is the standard GBDT\nalgorithm with inbuilt ordered TS. The former mode presents an ef\ufb01cient modi\ufb01cation of Algorithm 1.\nA formal description of the algorithm is included in Section B of the supplementary material. In this\nsection, we overview the most important implementation details.\n\ni=1, M ode\n\n: M, {(xi, yi)}n\n\nAlgorithm 2: Building a tree in CatBoost\ni=1, \u03b1, L, {\u03c3i}s\ninput\ngrad \u2190 CalcGradient(L, M, y);\nr \u2190 random(1, s);\nif M ode = P lain then\n\nG \u2190 (gradr(i) for i = 1..n);\nG \u2190 (gradr,\u03c3r(i)\u22121(i) for i = 1..n);\n\nif M ode = Ordered then\nT \u2190 empty tree;\nforeach step of top-down procedure do\n\nforeach candidate split c do\nTc \u2190 add split c to T ;\nif M ode = P lain then\n\nif M ode = Ordered then\n\n\u2206(i) \u2190 avg(gradr(p) for\np : leafr(p) = leafr(i)) for i = 1..n;\n\u2206(i) \u2190 avg(gradr,\u03c3r(i)\u22121(p) for\np : leafr(p) = leafr(i), \u03c3r(p) < \u03c3r(i))\nfor i = 1..n;\n\nif M ode = P lain then\n\nloss(Tc) \u2190 cos(\u2206, G)\nT \u2190 arg minTc(loss(Tc))\nMr(cid:48)(i) \u2190 Mr(cid:48)(i) \u2212 \u03b1 avg(gradr(cid:48)(p) for\np : leafr(cid:48)(p) = leafr(cid:48)(i)) for r(cid:48) = 1..s, i = 1..n;\nMr(cid:48),j(i) \u2190 Mr(cid:48),j(i) \u2212 \u03b1 avg(gradr(cid:48),j(p) for\np : leafr(cid:48)(p) = leafr(cid:48)(i), \u03c3r(cid:48)(p) \u2264 j) for r(cid:48) = 1..s,\ni = 1..n, j \u2265 \u03c3r(cid:48)(i) \u2212 1;\n\nif M ode = Ordered then\n\nFigure 1: Ordered boosting principle,\nexamples are ordered according to \u03c3.\n\nk=1, I;\n\nAlgorithm 1: Ordered boosting\n: {(xk, yk)}n\ninput\n\u03c3 \u2190 random permutation of [1, n] ;\nMi \u2190 0 for i = 1..n;\nfor t \u2190 1 to I do\nfor i \u2190 1 to n do\nfor i \u2190 1 to n do\n\nri \u2190 yi \u2212 M\u03c3(i)\u22121(xi);\n\u2206M \u2190\nLearnM odel((xj, rj) :\n\u03c3(j) \u2264 i);\nMi \u2190 Mi + \u2206M ;\n\nreturn Mn\n\nreturn T, M\n\nAt the start, CatBoost generates s + 1 independent random permutations of the training dataset. The\npermutations \u03c31, . . . , \u03c3s are used for evaluation of splits that de\ufb01ne tree structures (i.e., the internal\nnodes), while \u03c30 serves for choosing the leaf values bj of the obtained trees (see Equation (3)). For\nexamples with short history in a given permutation, both TS and predictions used by ordered boosting\n(M\u03c3(i)\u22121(xi) in Algorithm 1) have a high variance. Therefore, using only one permutation may\nincrease the variance of the \ufb01nal model predictions, while several permutations allow us to reduce\nthis effect in a way we further describe. The advantage of several permutations is con\ufb01rmed by our\nexperiments in Section 6.\n\nBuilding a tree\nIn CatBoost, base predictors are oblivious decision trees [9, 14] also called decision\ntables [23]. Term oblivious means that the same splitting criterion is used across an entire level of the\ntree. Such trees are balanced, less prone to over\ufb01tting, and allow speeding up execution at testing\ntime signi\ufb01cantly. The procedure of building a tree in CatBoost is described in Algorithm 2.\nIn the Ordered boosting mode, during the learning process, we maintain the supporting models Mr,j,\nwhere Mr,j(i) is the current prediction for the i-th example based on the \ufb01rst j examples in the\npermutation \u03c3r. At each iteration t of the algorithm, we sample a random permutation \u03c3r from\n{\u03c31, . . . , \u03c3s} and construct a tree Tt on the basis of it. First, for categorical features, all TS are\ncomputed according to this permutation. Second, the permutation affects the tree learning procedure.\n\n6\n\n1236547\ud835\udc405\ud835\udc61\u22121\ud835\udc406\ud835\udc61\u22121\ud835\udc5f\ud835\udc61\ud835\udc317,\ud835\udc667=\ud835\udc667\u2212\ud835\udc406\ud835\udc61\u22121(\ud835\udc317)89\fProcedure\nComplexity for iteration t\n\nCalcGradient\n\nO(s \u00b7 n)\n\nBuild T\nO(|C| \u00b7 n)\n\nj Update M Calc ordered TS\nO(NT S,t \u00b7 n)\n\nO(s \u00b7 n)\n\nTable 1: Computational complexity.\n\nCalc all bt\n\nO(n)\n\n(cid:12)(cid:12)s=Mr,j (i).\n\n\u2202s\n\nNamely, based on Mr,j(i), we compute the corresponding gradients gradr,j(i) = \u2202L(yi,s)\nThen, while constructing a tree, we approximate the gradient G in terms of the cosine similarity\ncos(\u00b7,\u00b7), where, for each example i, we take the gradient gradr,\u03c3(i)\u22121(i) (it is based only on the\nprevious examples in \u03c3r). At the candidate splits evaluation step, the leaf value \u2206(i) for example i is\nobtained individually by averaging the gradients gradr,\u03c3r(i)\u22121 of the preceding examples p lying\nin the same leaf leafr(i) the example i belongs to. Note that leafr(i) depends on the chosen\npermutation \u03c3r, because \u03c3r can in\ufb02uence the values of ordered TS for example i. When the tree\nstructure Tt (i.e., the sequence of splitting attributes) is built, we use it to boost all the models Mr(cid:48),j.\nLet us stress that one common tree structure Tt is used for all the models, but this tree is added to\ndifferent Mr(cid:48),j with different sets of leaf values depending on r(cid:48) and j, as described in Algorithm 2.\nThe Plain boosting mode works similarly to a standard GBDT procedure, but, if categorical features\nare present, it maintains s supporting models Mr corresponding to TS based on \u03c31, . . . , \u03c3s.\n\nChoosing leaf values Given all the trees constructed, the leaf values of the \ufb01nal model F are\ncalculated by the standard gradient boosting procedure equally for both modes. Training examples i\nare matched to leaves leaf0(i), i.e., we use permutation \u03c30 to calculate TS here. When the \ufb01nal\nmodel F is applied to a new example at testing time, we use TS calculated on the whole training data\naccording to Section 3.2.\n\nComplexity In our practical implementation, we use one important trick, which signi\ufb01cantly\nreduces the computational complexity of the algorithm. Namely, in the Ordered mode, instead\nof O(s n2) values Mr,j(i), we store and update only the values M(cid:48)\nr,j(i) := Mr,2j (i) for j =\n1, . . . ,(cid:100)log2 n(cid:101) and all i with \u03c3r(i) \u2264 2j+1, what reduces the number of maintained supporting\npredictions to O(s n). See Section B of the supplementary material for the pseudocode of this\nmodi\ufb01cation of Algorithm 2.\nIn Table 1, we present the computational complexity of different components of both CatBoost modes\nper one iteration (see Section C.1 of the supplementary material for the proof). Here NT S,t is the\nnumber of TS to be calculated at the iteration t and C is the set of candidate splits to be considered at\nthe given iteration. It follows that our implementation of ordered boosting with decision trees has\nthe same asymptotic complexity as the standard GBDT with ordered TS. In comparison with other\ntypes of TS (Section 3.2), ordered TS slow down by s times the procedures CalcGradient, updating\nsupporting models M, and computation of TS.\n\nFeature combinations Another important detail of CatBoost is using combinations of categorical\nfeatures as additional categorical features which capture high-order dependencies like joint informa-\ntion of user ID and ad topic in the task of ad click prediction. The number of possible combinations\ngrows exponentially with the number of categorical features in the dataset, and it is infeasible to\nprocess all of them. CatBoost constructs combinations in a greedy way. Namely, for each split of a\ntree, CatBoost combines (concatenates) all categorical features (and their combinations) already used\nfor previous splits in the current tree with all categorical features in the dataset. Combinations are\nconverted to TS on the \ufb02y.\n\nOther important details Finally, let us discuss two options of the CatBoost algorithm not covered\nabove. The \ufb01rst one is subsampling of the dataset at each iteration of boosting procedure, as proposed\nby Friedman [13]. We claimed earlier in Section 4.1 that this approach alone cannot fully avoid\nthe problem of prediction shift. However, since it has proved effective, we implemented it in both\nmodes of CatBoost as a Bayesian bootstrap procedure. Speci\ufb01cally, before training a tree according\nto Algorithm 2, we assign a weight wi = at\ni are generated according\nto the Bayesian bootstrap procedure (see [28, Section 2]). These weights are used as multipliers for\ngradients gradr(i) and gradr,j(i), when we calculate \u2206(i) and the components of the vector \u2206 \u2212 G\nto de\ufb01ne loss(Tc).\n\ni to each example i, where at\n\n7\n\n\fThe second option deals with \ufb01rst several examples in a permutation. For examples i with small\nvalues \u03c3r(i), the variance of gradr,\u03c3r(i)\u22121(i) can be high. Therefore, we discard \u2206(i) from the\nbeginning of the permutation, when we calculate loss(Tc) in Algorithm 2. Particularly, we eliminate\nthe corresponding components of vectors G and \u2206 when calculating the cosine similarity between\nthem.\n\n6 Experiments\n\nComparison with baselines We compare our algorithm with the most popular open-source li-\nbraries \u2014 XGBoost and LightGBM \u2014 on several well-known machine learning tasks. The detailed\ndescription of the experimental setup together with dataset descriptions is available in the supple-\nmentary material (Section D). The source code of the experiment is available, and the results can\nbe reproduced.7 For all learning algorithms, we preprocess categorical features using the ordered\nTS method described in Section 3.2. The parameter tuning and training were performed on 4/5\nof the data and the testing was performed on the remaining 1/5.8 The results measured by logloss\nand zero-one loss are presented in Table 2 (the absolute values for the baselines can be found in\nSection G of the supplementary material). For CatBoost, we used Ordered boosting mode in this\nexperiment.9 One can see that CatBoost outperforms other algorithms on all the considered datasets.\nWe also measured statistical signi\ufb01cance of improvements presented in Table 2: except three datasets\n(Appetency, Churn and Upselling) the improvements are statistically signi\ufb01cant with p-value (cid:28) 0.01\nmeasured by the paired one-tailed t-test.\nTo demonstrate that our implementation of plain boosting is an appropriate baseline for our research,\nwe show that a raw setting of CatBoost provides state-of-the-art quality. Particularly, we take a\nsetting of CatBoost, which is close to classical GBDT [12], and compare it with the baseline boosting\nimplementations in Section G of the supplementary material. Experiments show that this raw setting\ndiffers from the baselines insigni\ufb01cantly.\n\nTable 2: Comparison with baselines: logloss /\nzero-one loss (relative increase for baselines).\n\nTable 3: Plain boosting mode: logloss, zero-\none loss and their change relative to Ordered\nboosting mode.\n\nCatBoost\n\nLightGBM\n\nXGBoost\n\nLogloss\n\nZero-one loss\n\nAdult\nAmazon\nClick\nEpsilon\nAppetency\nChurn\nInternet\nUpselling\nKick\n\n0.270 / 0.127\n0.139 / 0.044\n0.392 / 0.156\n0.265 / 0.109\n0.072 / 0.018\n0.232 / 0.072\n0.209 / 0.094\n0.166 / 0.049\n0.286 / 0.095\n\n+2.4% / +1.9%\n+17% / +21%\n+1.2% / +1.2%\n+1.5% / +4.1%\n+0.4% / +0.2%\n+0.1% / +0.6%\n+6.8% / +8.6%\n+0.3% / +0.1%\n+3.5% / +4.4%\n\n+2.2% / +1.0%\n+17% / +21%\n+1.2% / +1.2%\n+11% / +12%\n+0.4% / +0.7%\n+0.5% / +1.6%\n+7.9% / +8.0%\n+0.04% / +0.3%\n+3.2% / +4.1%\n\nAdult\nAmazon\nClick\nEpsilon\nAppetency\nChurn\nInternet\nUpselling\nKick\n\n0.272 (+1.1%)\n0.139 (-0.6%)\n0.392 (-0.05%)\n0.266 (+0.6%)\n0.072 (+0.5%)\n0.232 (-0.06%)\n0.217 (+3.9%)\n0.166 (+0.1%)\n0.285 (-0.2%)\n\n0.127 (-0.1%)\n0.044 (-1.5%)\n0.156 (+0.19%)\n0.110 (+0.9%)\n0.018 (+1.5%)\n0.072 (-0.17%)\n0.099 (+5.4%)\n0.049 (+0.4%)\n0.095 (-0.1%)\n\nWe also empirically analyzed the running times of the algorithms on Epsilon dataset. The details\nof the comparison can be found in the supplementary material (Section C.2). To summarize, we\nobtained that CatBoost Plain and LightGBM are the fastest ones followed by Ordered mode, which is\nabout 1.7 times slower.\n\nOrdered and Plain modes\nIn this section, we compare two essential boosting modes of CatBoost:\nPlain and Ordered. First, we compared their performance on all the considered datasets, the results\nare presented in Table 3. It can be clearly seen that Ordered mode is particularly useful on small\ndatasets. Indeed, the largest bene\ufb01t from Ordered is observed on Adult and Internet datasets, which\nare relatively small (less than 40K training examples), which supports our hypothesis that a higher\nbias negatively affects the performance. Indeed, according to Theorem 1 and our reasoning in\nSection 4.1, bias is expected to be larger for smaller datasets (however, it can also depend on other\nproperties of the dataset, e.g., on the dependency between features and target). In order to further\n\n7https://github.com/catboost/benchmarks/tree/master/quality_benchmarks\n8For Epsilon, we use default parameters instead of parameter tuning due to large running time for all\n\nalgorithms. We tune only the number of trees to avoid over\ufb01tting.\n\n9The numbers for CatBoost in Table 2 may slightly differ from the corresponding numbers in our GitHub\n\nrepository, since we use another version of CatBoost with all the discussed features implemented.\n\n8\n\n\fvalidate this hypothesis, we make the following experiment: we train CatBoost in Ordered and Plain\nmodes on randomly \ufb01ltered datasets and compare the obtained losses, see Figure 2. As we expected,\nfor smaller datasets the relative performance of Plain mode becomes worse. To save space, here we\npresent the results only for logloss; the \ufb01gure for zero-one loss is similar.\nWe also compare Ordered and Plain modes in the above-mentioned raw setting of CatBoost in\nSection G of the supplementary material and conclude that the advantage of Ordered mode is not\ncaused by interaction with speci\ufb01c CatBoost options.\n\nTable 4: Comparison of target statistics, relative\nchange in logloss / zero-one loss compared to or-\ndered TS.\n\nGreedy\n\nHoldout\n\nLeave-one-out\n\nAdult\nAmazon\nClick\nAppetency\nChurn\nInternet\nUpselling\nKick\n\n+1.1% / +0.8%\n+40% / +32%\n+13% / +6.7%\n+24% / +0.7%\n+12% / +2.1%\n+33% / +22%\n+57% / +50%\n+22% / +28%\n\n+2.1% / +2.0%\n+8.3% / +8.3%\n+1.5% / +0.5%\n+1.6% / -0.5%\n+0.9% / +1.3%\n+2.6% / +1.8%\n+1.6% / +0.9%\n+1.3% / +0.32%\n\n+5.5% / +3.7%\n+4.5% / +5.6%\n+2.7% / +0.9%\n+8.5% / +0.7%\n+1.6% / +1.8%\n+27% / +19%\n+3.9% / +2.9%\n+3.7% / +3.3%\n\nFigure 2: Relative error of Plain boosting mode\ncompared to Ordered boosting mode depending\non the fraction of the dataset.\n\nAnalysis of target statistics We compare different TSs introduced in Section 3.2 as options of\nCatBoost in Ordered boosting mode keeping all other algorithmic details the same; the results can\nbe found in Table 4. Here, to save space, we present only relative increase in loss functions for\neach algorithm compared to CatBoost with ordered TS. Note that the ordered TS used in CatBoost\nsigni\ufb01cantly outperform all other approaches. Also, among the baselines, the holdout TS is the best\nfor most of the datasets since it does not suffer from conditional shift discussed in Section 3.2 (P1);\nstill, it is worse than CatBoost due to less effective usage of training data (P2). Leave-one-out is\nusually better than the greedy TS, but it can be much worse on some datasets, e.g., on Adult. The\nreason is that the greedy TS suffer from low-frequency categories, while the leave-one-out TS suffer\nalso from high-frequency ones, and on Adult all the features have high frequency.\nFinally, let us note that in Table 4 we combine Ordered mode of CatBoost with different TSs. To\ngeneralize these results, we also made a similar experiment by combining different TS with Plain\nmode, used in standard gradient boosting. The obtained results and conclusions turned out to be very\nsimilar to the ones discussed above.\n\nFeature combinations The effect of feature combinations discussed in Section 5 is demonstrated\nin Figure 1 of the supplementary material. In average, changing the number cmax of features allowed\nto be combined from 1 to 2 provides an outstanding improvement of logloss by 1.86% (reaching\n11.3%), changing from 1 to 3 yields 2.04%, and further increase of cmax does not in\ufb02uence the\nperformance signi\ufb01cantly.\n\nNumber of permutations The effect of the number s of permutations on the performance of\nCatBoost is presented in Figure 2 of the supplementary material. In average, increasing s slightly\ndecreases logloss, e.g., by 0.19% for s = 3 and by 0.38% for s = 9 compared to s = 1.\n\n7 Conclusion\n\nIn this paper, we identify and analyze the problem of prediction shifts present in all existing imple-\nmentations of gradient boosting. We propose a general solution, ordered boosting with ordered TS,\nwhich solves the problem. This idea is implemented in CatBoost, which is a new gradient boosting\nlibrary. Empirical results demonstrate that CatBoost outperforms leading GBDT packages and leads\nto new state-of-the-art results on common benchmarks.\n\n9\n\n-5 0 5 10 15 20 25 0.001 0.01 0.1 1relative logloss for Plain, %fractionAdultAmazonClick predictionKDD appetencyKDD churnKDD internetKDD upsellingKick prediction\fAcknowledgments\n\nWe are very grateful to Mikhail Bilenko for important references and advices that lead to theoretical\nanalysis of this paper, as well as suggestions on the presentation. We also thank Pavel Serdyukov for\nmany helpful discussions and valuable links, Nikita Kazeev, Nikita Dmitriev, Stanislav Kirillov and\nVictor Omelyanenko for help with experiments.\n\nReferences\n[1] L. Bottou and Y. L. Cun. Large scale online learning. In Advances in neural information\n\nprocessing systems, pages 217\u2013224, 2004.\n\n[2] L. Breiman. Out-of-bag estimation, 1996.\n\n[3] L. Breiman. Using iterated bagging to debias regressions. Machine Learning, 45(3):261\u2013277,\n\n2001.\n\n[4] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classi\ufb01cation and regression trees.\n\nCRC press, 1984.\n\n[5] R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms.\nIn Proceedings of the 23rd international conference on Machine learning, pages 161\u2013168. ACM,\n2006.\n\n[6] B. Cestnik et al. Estimating probabilities: a crucial task in machine learning. In ECAI, volume 90,\n\npages 147\u2013149, 1990.\n\n[7] O. Chapelle, E. Manavoglu, and R. Rosales. Simple and scalable response prediction for display\nadvertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4):61, 2015.\n\n[8] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages\n785\u2013794. ACM, 2016.\n\n[9] M. Ferov and M. Modr`y. Enhancing lambdamart using oblivious trees. arXiv preprint\n\narXiv:1609.05610, 2016.\n\n[10] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of\n\nboosting. The annals of statistics, 28(2):337\u2013407, 2000.\n\n[11] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1.\n\nSpringer series in statistics New York, 2001.\n\n[12] J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of\n\nstatistics, pages 1189\u20131232, 2001.\n\n[13] J. H. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis,\n\n38(4):367\u2013378, 2002.\n\n[14] A. Gulin, I. Kuralenok, and D. Pavlov. Winning the transfer learning track of yahoo!\u2019s learning\n\nto rank challenge with yetirank. In Yahoo! Learning to Rank Challenge, pages 63\u201376, 2011.\n\n[15] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al.\nPractical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth\nInternational Workshop on Data Mining for Online Advertising, pages 1\u20139. ACM, 2014.\n\n[16] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A\nhighly ef\ufb01cient gradient boosting decision tree. In Advances in Neural Information Processing\nSystems, pages 3149\u20133157, 2017.\n\n[17] M. Kearns and L. Valiant. Cryptographic limitations on learning boolean formulae and \ufb01nite\n\nautomata. Journal of the ACM (JACM), 41(1):67\u201395, 1994.\n\n[18] J. Langford, L. Li, and T. Zhang. Sparse online learning via truncated gradient. Journal of\n\nMachine Learning Research, 10(Mar):777\u2013801, 2009.\n\n10\n\n\f[19] LightGBM. Categorical feature support. http://lightgbm.readthedocs.io/en/latest/\n\nAdvanced-Topics.html#categorical-feature-support, 2017.\n\n[20] LightGBM. Optimal split for categorical features. http://lightgbm.readthedocs.io/en/\n\nlatest/Features.html#optimal-split-for-categorical-features, 2017.\n\n[21] LightGBM. feature_histogram.cpp. https://github.com/Microsoft/LightGBM/blob/\n\nmaster/src/treelearner/feature_histogram.hpp, 2018.\n\n[22] X. Ling, W. Deng, C. Gu, H. Zhou, C. Li, and F. Sun. Model ensemble for click prediction\nin bing search ads. In Proceedings of the 26th International Conference on World Wide Web\nCompanion, pages 689\u2013698. International World Wide Web Conferences Steering Committee,\n2017.\n\n[23] Y. Lou and M. Obukhov. Bdt: Gradient boosted decision tables for high accuracy and scoring\nef\ufb01ciency. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, pages 1893\u20131901. ACM, 2017.\n\n[24] L. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent.\n\nIn Advances in neural information processing systems, pages 512\u2013518, 2000.\n\n[25] D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classi\ufb01-\n\ncation and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27\u201332, 2001.\n\n[26] B. P. Roe, H.-J. Yang, J. Zhu, Y. Liu, I. Stancu, and G. McGregor. Boosted decision trees as\nan alternative to arti\ufb01cial neural networks for particle identi\ufb01cation. Nuclear Instruments and\nMethods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated\nEquipment, 543(2):577\u2013584, 2005.\n\n[27] L. Rokach and O. Maimon. Top\u2013down induction of decision trees classi\ufb01ers \u2014 a survey.\nIEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews),\n35(4):476\u2013487, 2005.\n\n[28] D. B. Rubin. The bayesian bootstrap. The annals of statistics, pages 130\u2013134, 1981.\n\n[29] Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adapting boosting for information retrieval\n\nmeasures. Information Retrieval, 13(3):254\u2013270, 2010.\n\n[30] K. Zhang, B. Sch\u00f6lkopf, K. Muandet, and Z. Wang. Domain adaptation under target and\n\nconditional shift. In International Conference on Machine Learning, pages 819\u2013827, 2013.\n\n[31] O. Zhang. Winning data science competitions.\n\nhttps://www.slideshare.net/\n\nShangxuanZhang/winning-data-science-competitions-presented-by-owen-zhang,\n2015.\n\n[32] Y. Zhang and A. Haghani. A gradient boosting method to improve travel time prediction.\n\nTransportation Research Part C: Emerging Technologies, 58:308\u2013324, 2015.\n\n11\n\n\f", "award": [], "sourceid": 3342, "authors": [{"given_name": "Liudmila", "family_name": "Prokhorenkova", "institution": "Yandex"}, {"given_name": "Gleb", "family_name": "Gusev", "institution": "Yandex LLC"}, {"given_name": "Aleksandr", "family_name": "Vorobev", "institution": "Yandex LLC"}, {"given_name": "Anna Veronika", "family_name": "Dorogush", "institution": "Yandex"}, {"given_name": "Andrey", "family_name": "Gulin", "institution": "Yandex"}]}