{"title": "Boosting Density Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 657, "page_last": 664, "abstract": null, "full_text": "Boosting Density Estimation\n\nSaharon Rosset\n\nDepartment of Statistics\n\nStanford University\nStanford, CA, 94305\n\nsaharon@stat.stanford.edu\n\nEran Segal\n\nComputer Science Department\n\nStanford University\nStanford, CA, 94305\neran@cs.stanford.edu\n\nAbstract\n\nSeveral authors have suggested viewing boosting as a gradient descent search for\na good \ufb01t in function space. We apply gradient-based boosting methodology to\nthe unsupervised learning problem of density estimation. We show convergence\nproperties of the algorithm and prove that a strength of weak learnability prop-\nerty applies to this problem as well. We illustrate the potential of this approach\nthrough experiments with boosting Bayesian networks to learn density models.\n\n1 Introduction\n\nBoosting is a method for incrementally building linear combinations of \u201cweak\u201d models,\n\n, a boosting algorithm sequentially \ufb01nds models\n\naBoost [6], the original boosting algorithm, was speci\ufb01cally devised for the task of classi-\n\n\u000e\u0011\u0010\n\nto generate a \u201cstrong\u201d predictive model. Given data \u0002\u0001\u0004\u0003\u0006\u0005\u0002\u0007\nweak learners \f\n\n\u0002\u000f\nto minimize \u001b\n\ufb01cation, where \u0001%\u0003'&\n\r\u001d\u001c\n\n\u0003*\u000f$+,\u0003-\" with +,\u0003\u0016\u0015.,/102\u000f\u00130\u0004\u0005 and \n\nand a loss function \r\nand constants \u0017\n\n\u000f\u0018\u0012\u0014\u0012\t\u0012\u0016\u0015\u001a\u0019\n\n\u000f\u0013\u0012\t\u0012\u0014\u0012\u0016\u0015\n\n\u0002\u000f\n\n\u001c)(\n\n\u0003\t\b\u000b\n , a basis (or dictionary) of\n\u0001\u0002\u0003#\"$\" . Ad-\n\u0003#\"$\" . AdaBoost\n\n+2\u0003\u001e\u000f\n\n\u001d\u001c\n\n\u001b \u001f!\u0017\n\u001c4(\n\n\u0001\u0002\u0003\u001e\u000f\n\u001f3\u0017\n\nsequentially \ufb01ts weak learners on re-weighted versions of the data, where the weights are\ndetermined according to the performance of the model so far, emphasizing the more \u201cchal-\nlenging\u201d examples. Its inventors attribute its success to the \u201cboosting\u201d effect which the\nlinear combination of weak learners achieves, when compared to their individual perfor-\nmance. This effect manifests itself both in training data performance, where the boosted\nmodel can be shown to converge, under mild conditions, to ideal training classi\ufb01cation, and\nin generalization error, where the success of boosting has been attributed to its \u201cseparating\u201d\n\u2014 or margin maximizing \u2014 properties [18].\n\nIt has been shown [8, 13] that AdaBoost can be described as a gradient descent algorithm,\nwhere the weights in each step of the algorithm correspond to the gradient of an exponential\nloss function at the \u201ccurrent\u201d \ufb01t. In a recent paper, [17] show that the margin maximizing\nproperties of AdaBoost can be derived in this framework as well. This view of boosting\nas gradient descent has allowed several authors [7, 13, 21] to suggest \u201cgradient boosting\nmachines\u201d which apply to a wider class of supervised learning problems and loss functions\nthan the original AdaBoost. Their results have been very promising.\n\nIn this paper we apply gradient boosting methodology to the unsupervised learning problem\n\nof density estimation, using the negative log likelihood loss criterion \r5\u001c\n/98\t:,;\n\n\u00016\"$\"7&\n\u00016\"$\" . The density estimation problem has been studied extensively in many\n\ncontexts using various parametric and non-parametric approaches [2, 5]. A particular\n\n\u001f!\u0017\n\n\u0001\u0011\u000f\n\n\u001f!\u0017\n\n\u000e\n\f\n\u0017\n\u0010\n\u0003\n\u001f\n\u000e\n\u001f\n\u001c\n&\n\u001b\n\u001f\n\u000e\n\u001f\n\u001b\n\u001f\n\u000e\n\u001f\n\u001c\n\u001c\n\u001b\n\u001f\n\u000e\n\u001f\n\u001c\n\fframework which has recently gained much popularity is that of Bayesian networks [11],\nwhose main strength stems from their graphical representation, allowing for highly inter-\npretable models. More recently, researchers have developed methods for learning Bayesian\nnetworks from data including learning in the context of incomplete data. We use Bayesian\nnetworks as our choice of weak learners, combining the models using the boosting method-\nology. We note that several researchers have considered learning weighted mixtures of\nnetworks [14], or ensembles of Bayesian networks combined by model averaging [9, 20].\n\nWe describe a generic density estimation boosting algorithm, following the approach of\n\n[13]. The main idea is to identify, at each boosting iteration, the basis function \u000e\nline search is then used to \ufb01nd an appropriate coef\ufb01cient for the newly selected \u000e\n\nwhich gives the largest \u201clocal\u201d improvement in the loss at the current \ufb01t. Intuitively,\nassigns higher probability to instances that received low probability by the current model. A\nfunction,\n\nand it is added to the current model.\n\nWe provide a theoretical analysis of our density estimation boosting algorithm, showing an\nexplicit condition, which if satis\ufb01ed, guarantees that adding a weak learner to the model im-\nproves the training set loss. We also prove a \u201cstrength of weak learnability\u201d theorem which\ngives lower bounds on overall training loss improvement as a function of the individual\nweak learners\u2019 performance on re-weighted versions of the training data.\n\nBayesian networks as our basis of weak learners \f\n\nWe describe the instantiation of our generic boosting algorithm for the case of using\nand provide experimental results on\ntwo distinct data sets, showing that our algorithm achieves higher generalization on unseen\ndata as compared to a single Bayesian network and one particular ensemble of Bayesian\nnetworks. We also show that our theoretical criterion for a weak learner to improve the\noverall model applies well in practice.\n\n2 A density estimation boosting algorithm\n\n\u0002\u0005\u0004\n\n\u0001\u0002\u0003\n\n\"$\"\n\n\"$\"\n\n\" .\n\n\u001f\u0007\u0006\n\n\u0001\u0002\u0003#\"\n\n\u0012\u0011\n\nand add it to our model with a small coef\ufb01cient\nin a Taylor series\n\nAt each step\nIf we now choose a weak learner \u000e\n\b , then developing the training loss of the new model \t\naround the loss at \u0001\f\u0002\u0005\u0004\n\"*\"7&\n\r\u001d\u001c\n\nin a boosting algorithm, the model built so far is: \u0001\u0003\u0002\u0005\u0004\n\n\u000b\n\n\u0002\u0005\u0004\n\"$\"\n\n\u0001\u0002\u0003#\"$\"\n\n\u0001\u0002\u0003\u001e\u000f\n\n\u0001\u0002\u0003\n\n gives\n\u0001\u0002\u0003\u001e\u000f\n\r5\u001c\n\n5\u001c\n\n\u0002\u0005\u0004\n\n\u0001\u0002\u0003-\"\n\n\u0001\u0002\u0003#\"\n\n\u0001\f\u0002\u0005\u0004\n\n\u0012\u0011\n\n\u0003\u0010\u000f\n\nwhich in the case of negative log-likelihood loss can be written as\n\n\u0001\u000e\u0002\u0005\u0004\n/\u0014\u0013\u0019\u0015\u0018\u0017\n/\u0014\u0013\u0016\u0015\u0018\u0017\n\u001c\u0019\u0001\u000e\u0002\u0005\u0004\n\u001c\u0016\t\nSince \b\n\" . We are thus \ufb01nding the \ufb01rst order optimal weak learner,\n\u0001\u0002\u0003\nto maximize \u001b\n\u001a\u001c\u001b\u0016\u001d\u001f\u001e! #\"%$'&\nwe should note that once \b becomes non-in\ufb01nitesimal, no \u201coptimality\u201d property can be\nclaimed for this selected \u000e\n\u0002 .\n\nwhich gives the \u201csteepest descent\u201d in the loss at the current model predictions. However,\n\nis small, we can ignore the second order term and choose the next boosting step\n\nThe main idea of gradient-based generic boosting algorithms, such as AnyBoost [13] and\nGradientBoost [7], is to utilize this \ufb01rst order approach to \ufb01nd, at each step, the weak\nlearner which gives good improvement in the loss and then follow the \u201cdirection\u201d of this\nis determined in various ways\nin the different algorithms, the most popular choice being line-search, which we adopt here.\n\nweak learner to augment the current model. The step size \u0017\nWhen we consider applying this methodology to density estimation, where the basis \f\ncomprised of probability distributions and the overall model \u0001\u0003\u0002\n\nis\nis a probability distribution\n\n\u0015\n\f\n\u000e\n\n\u001c\n\u0001\n\"\n&\n\u001b\n\u0002\n\u0017\n\u001f\n\u000e\n\u001f\n\u001c\n\u0001\n\u0015\n\f\n&\n\u0001\n\b\n\u000e\n\n\u0003\n\t\n\u001c\n\n\u0003\n\n\u001c\n\n\b\n\n\u0001\n\u0003\n\u000f\n\u0001\n\n\u001c\n\u0001\n\u0003\n\u000f\n\u0001\n\n\u001c\n\u0001\n\u0003\n\"\n\u000e\n\u001c\n\u001c\n\b\n\u0010\n\"\n\u000f\n\n\u0003\n\u001c\n&\n\n\u0003\n\n\u001c\n\u0001\n\u0003\n/\n\b\n\n\u0003\n0\n\n\u001c\n\u000e\n\u001c\n\u001c\n\b\n\u0010\n\"\n\u0012\n\u000e\n\u0002\n\u0003\n\n\u000e\n\u001c\n\u0002\n\f, and the method for searching for \u000e\n\nthese details in Section 4.\n\nlearners \f\n1. Set \u0010\u0012\u0011\u0014\u0013\u0016\u0015\u0018\u0017\n\n2. For t = 1 to T\n\nto uniform on the domain of\u0015\n\n(a) Set \u0019\u001b\u001a\u001d\u001c\u001f\u001e! \"\u0010\u0012#%$\u001d&!\u0013\u0016\u0015!\u001a'\u0017\n(b) Find (\n#*),+\n(c) If -\n(d) Find 2\b#3\u001c54!687:9<;>=@?\n(e) Set \u0010\u0012#\b\u001c\u001f\u0013H\u001e\n3. Output the \ufb01nal model \u0010\u0012Q\n\nto maximize -\n\u0017/.10 break.\n7F\u0013G\u0013H\u001e\n\u001aBADC>E\n2\u0012#H\u0017H\u0010\u0012#%$I&\bJO2\b#P(M#\n\n\u0013\u0016\u0015\n\n\u0013\u0016\u0015\n\n23\u0017H\u0010\u0012#%$I&!\u0013\u0016\u0015\"\u001a'\u0017KJL2\u0012(M#N\u0013\u0016\u0015!\u001a'\u0017G\u0017\n\n\u0002 will no longer be a\n\u0002 ,\n\u0017\f\u0002\n0 . It is easy to see that the \ufb01rst order theory of gradient boosting and the\n\n cannot be improved by adding any of the weak learners\n\nas well, we cannot simply augment the model, since \u0001\u0003\u0002\u0005\u0004\nprobability distribution. Rather, we consider a step of the form \u0001\nwhere \u0001\u0003\u0002\nline search solution apply to this formulation as well.\nIf at some stage , the current\u0001\n0\u0016/\n\nas above, the algorithm terminates, and we have reached a global minimum. This can only\nhappen if the derivative of the loss at the current model with respect to the coef\ufb01cient of\neach weak learner is non-negative:\n\n/98\t:,;\n\n\u0001\u0002\u0003#\"$\"\n\n\u0001\f\u0002\u0005\u0004\n\n\u0017\f\u0002\n\n\u0017\f\u0002\n\n\u0002\u0005\u0004\n\n\u0001\u0002\u0003-\"\nThus, the algorithm terminates if no \u000e\n\n\u0001\u000e\u0002\u0005\u0004\n\nproof and discussion).\n\n\u001c*\u001c\n\nThe resulting generic gradient boosting algorithm for density estimation can be seen in\nFig. 1. Implementation details for this algorithm include the choice of the family of weak\n\n\u0006,\b\b\u0007\n\n&\n\t\n\u001b'\u001d\u001f\u001e\n\ngives \u001b\n\n\u0001\u0002\u0003-\"\f\u000b\n\n(see section 3 for\n\n\u0002\u0005\u0004\n\n\"\u000e\r\u000f\t\n\n\u0002 at each boosting iteration. We address\n\nFigure 1: Boosting density estimation algorithm\n\n3 Training data performance\n\nThe concept of \u201cstrength of weak learnability\u201d [6, 18] has been developed in the context\nof boosting classi\ufb01cation models. Conceptually, this property can be described as follows:\n\n\u201cif for any weighting of the training data SR\u001d\u0003$\u0005\n\n\u0003\u0014\b\n\n , there is a weak learner \u000e\n\nachieves weighted training error slightly better than random guessing on the re-weighted\nversion of the data using these weights, then the combined boosted learner will have van-\nishing error on the training data\u201d.\n\n\f which\n\nIn classi\ufb01cation, this concept is realized elegantly. At each step in the algorithm, the\n\u0012UT . Thus, the new\nweighted error of the previous model, using the new weights is exactly \u0001\nweak learner doing \u201cbetter than random\u201d on the re-weighted data means it can improve the\nprevious weak learner\u2019s performance at the current \ufb01t, by achieving weighted classi\ufb01cation\nT . In fact it is easy to show that the weak learnability condition of at\nerror better than \u0001\nT on the re-weighted data\nleast one weak learner attaining classi\ufb01cation error less than \u0001\ndoes not hold only if the current combined model is the optimal solution in the space of\nlinear combinations of weak learners.\n\nWe now derive a similar formulation for our density estimation boosting algorithm. We\n\nstart with a quantitative description of the performance of the previous weak learner \u000e\nat the combined model \u0001\n, where \t\nLemma 1 Using the algorithm of section 2 we get: \u0004\nnumber of training examples.\n\n , given in the following lemma:\n\n&W\t\n\n\u0003*V\n\nis the\n\n\u0002\u0005\u0004\n\n\u0002\u0005\u0004\n\n #\"\n #\"\n\n\u000e\n\u0002\n&\n\u001c\n0\n/\n\"\n\n\u000e\n\u0017\n\u0002\n\u0002\n\u0004\n\u000e\n\u0015\n\f\n\u000f\n\u000f\n\u001b\n\u0003\n\u0017\n\"\n\n\u001c\n\n\u0017\n\u000e\n\u001c\n\u000f\n\u0017\n\u0005\n/\n\n\u0003\n0\n\u0001\n\n\u000e\n\u001c\n\u0001\n\u0012\n\u0015\n\f\n\u0003\n\n\u001a\n\u000e\n\u001c\n\u0001\n\u0003\n\u001a\n\u0019\n\u001a\n(\n#\n\u001a\n\u0017\n\u001a\n\u0019\n\u001a\n(\n#\n\u001a\n-\nA\nA\n\u0007\n\u0015\n\u0012\n\u0012\n\n\n\u000f\n\u001b\n\u001b\n$\n&\n\u001a\n\u001b\n$\n&\n\fProof: The line search (step 2(c) in the algorithm) implies:\n\n/98\t:,;\n\n\u001c$\u001c\n\n\u0001\f\u0002\u0005\u0004\n\n\u0001\u0002\u0003#\"\n\n\u0001\u0002\u0003-\"*\"\n\n\u00062\b:\u0006\n\n0\u0016/\n\n\t9/\n\n\u0017\f\u0002\n\n\u0001\u0002\u0003-\"\n\n\u0001\f\u0002\n\nLemma 1 allows us to derive the following stopping criterion (or optimality condition) for\nthe boosting algorithm, illustrating that in order to improve training set loss, the new weak\nlearner only has to exceed the previous one\u2019s performance at the current \ufb01t.\n\n\u0001\u0002\u0003\n\n\u0001\u0003\u0002\n\n/\u0014\u0013\u0016\u0015\u0018\u0017\n\n\u0003#\"$\"\u000f\u000e,\u0012\n\nsuch that \u001b\n\u001b \u001f\u0014\f\n\nProof: This is a direct result of the optimality conditions for a convex function (in this case\n\nis the global minimum in the domain of normalized linear combinations of \f\n;\u0005\u0004\u0007\u0006\t\b\u000b\n\nTheorem 1 If there does not exist a weak learner \u000e\nthen \u0001\n\u0001\u0011\u0010\n\u001f\r\f\n\u0001\f\u0002\n/98\t:,; ) in a compact domain.\nSo unless we have reached the global optimum in the simplex within \u000e\u0016\u0015\ngenerally happen quickly only if \f\nwe will have some weak learners doing better than \u201crandom\u201d and attaining \u001b\n\nis very small, i.e. the \u201cweak\u201d learners are very weak),\n.\nIf this is indeed the case, we can derive an explicit lower bound for training set loss im-\nprovement as a function of the new weak learner\u2019s performance at the current model:\n\n\u0005 (which will\n\u0003DV\n\n #\"\n #\"\n\n\t\u0013\u0012\n\n,\n:\n\n #\"\n\nTheorem 2 Assume:\n\n1. The sequence of selected weak learners in the algorithm of section 2 has:\n\n\u0018\u0017\n\u0001\u0002\u0003\n\"$\"\n\n&\n\t\n\n\"*\"\n\n\u0001\u0002\u0003-\"\n\u0001\u0002\u0003\n\n\u001c\u0016\u0001\n\n\u0001\u0002\u0003#\"\n\n\u0001\f\u0002\u0005\u0004\n\n2.\n\n\u001b\u0016\u001d\u001f\u001e\n\t3\u0003\n\u001a\u0019\u001c\u001b\nThen we get: /\n8\u0014:2;\n/98\t:,;\n\nProof:\n\n\u001c$\u001c\n\n #\"\n\u001c\u0016\u0001\f\u0002\u0005\u0004\n8\u0014:2;\n0\u001d/\n0\u0016/\n\n\u001c*\u001c\n\n\u0002\u0005\u0004\nCombining these two gives: ')(\n\u001c\u0019\u0001\n\u0002\u0005\u0004\n\n8\u0014:2;\n\n\u001c\u0019\u0001\n\n\"$\"\n\n\"$\"\n\n/\u001e\u001d \u001f\n\n\u001b\"!\n\n8\u0014:2;\n\u001c\u0019\u0001\n\u0001\u0002\u0003-\"$\"\n\n\u0002\u0005\u0004\n\u00062\b\b\u00075&\n\n\"$\"\n\u001b'\u001d\u001f\u001e\n\u001b54\n\n,\u0016-\n\n\u0004+*\n\"*\"\n\n ' \n\n&'\u001a\n/1032\n\n #\"\n\n&/.\n/87\n\n\u0001\f\u0002\u0005\u0004\n\n0\u0016/\n #\"\n\n&'&\n\n\u0002\u0005\u0004\n\u0002\u0005\u0004\n\u001d \u001f\n\n\u0001\u0002\u0003\n\n\u0001\u0002\u0003\n\n\u0001\u0002\u0003-\"%$\n\n\"&$\n\n, which implies:\n\n\u001d \u001f\n\n8\t:,;\n\nThe second assumption of theorem 2 may not seem obvious but it is actually quite mild.\n\n\u001b9!\n\u001b9!\n\u001d \u001f\nWith a bit more notation we could get rid of the need to lower bound \u000e\n\u0002 completely. For\n\u0002 , we can see intuitively that a boosting algorithm will not let any observation have ex-\nwhelming weight in the next boosting iteration and hence the next selected \u000e\nhave a threshold \b\nat least as the sum of squares of the \u201cweak learnability\u201d quantities \u0017\n4 Boosting Bayesian Networks\n\nceptionally low probability over time since that would cause this observation to have over-\nis certain to\ngive it high probability. Thus, after some iterations we can assume that we would actually\nindependent of the iteration number and hence the loss would decrease\n\n\u0002 .\n\n\u001b\"!\n\nWe now focus our attention on a speci\ufb01c application of the boosting methodology for den-\nsity estimation, using Bayesian networks as the weak learners. A Bayesian network is a\n\n\u0001\n&\n\u000f\n\u001b\n\u0003\n0\n/\n\u0017\n\"\n\n\u001c\n\n\u0017\n\u000e\n\u0002\n\u001c\n\u000f\n\u0017\n\u0005\n\u001b\n&\n0\n\u001c\n\n\u0003\n\u000e\n\u0002\n\u001c\n\u0001\n\u0003\n\"\n\u001c\n\"\n\u0012\n\u0015\n\f\n\u0003\n\n\u001a\n\u001b\n$\n&\n\u000e\n\u001c\n\"\n\n\t\n\u0002\n&\n\u001b\n\u0003\n\u001c\n\u001b\n\u001f\n\u000e\n\u001f\n\u001c\n\u0001\n\n\u0012\n\f\n\u000b\n\u001f\n&\n0\n\u0010\n\t\n\n\f\n$\n&\n\u001a\n\u001b\n$\n&\n\n\t\n\u001b\n\u0003\n\n\u001a\n$\n&\n\u000e\n\u0002\n\u001c\n\u0002\n\u0004\n\n\u001c\n\"\n\u000f\n\u000e\n\u0002\n\u001c\n\u000b\n\b\n\u0002\n\u001b\n\u0003\n\u0002\n\u001c\n\u0001\n\u0003\n\u0002\n/\n\u001b\n\u0003\n\n\u001c\n\u0001\n\u0003\n\u001f\n\u001b\n\u0010\n\u0007\n\u000f\n\u001b\n\u0003\n/\n\u0017\n\"\n\n\u001c\n\n\u0017\n\u000e\n\u0002\n\u001c\n\u000f\n\u0017\n\u0005\n\t\n/\n\n\u0003\n\u000e\n\u0002\n\u001c\n\"\n\u0001\n\n\u001c\n\u0001\n\u0003\n\"\n&\n/\n\u0017\n\u0002\n\u000f\n\u0010\n\u001b\n\u0003\n\u0017\n\"\n\u0001\n\n\u001c\n\u0001\n\u0003\n\"\n\n\u0017\n\u000e\n\u0002\n\u001c\n\u0001\n\u0003\n\u000f\n\u0017\n\u0010\n&\n\n\u0003\n#\n\n\u001c\n\"\n/\n\u000e\n\u0002\n\u001c\n\u0010\n#\n\u001c\n\u0017\n\"\n\u0001\n\n\u001c\n\u0001\n\u0003\n\"\n\n\u0017\n\u000e\n\u0002\n\u001c\n\u0001\n\u0003\n\u0010\n\u0002\n\t\n\b\n\u0010\n\u0002\n$\n\n\u0004\n\u0006\n$\n\u0006\nV\n\u001b\n$\n'\n\u0006\n\u0007\n\u0002\n/\n\u0017\n\u0002\n\n\u0006\n\u0007\n!\n\u001f\n\u001b\n\u001b\n\u0003\n\u0002\n\u001c\n\u0001\n\u0003\n/\n\u001b\n\u0003\n\n\u001c\n\u0001\n\u0003\n\u0002\n\u001f\n\u001b\n6\n\u0007\n\u0017\n\u0002\n\u0007\n!\n\u001f\n\u001b\n\u0012\n(\n&\n/\n\u001f\n\u001b\n\u0007\n\n\u001f\n\u001b\n\u0010\n\u0007\n&\n/\n\u001f\n\u001b\n\u0010\n\u0007\n\u0001\n\u0002\n\fgraphical model for describing a joint distribution over a set of random variables. Recently,\nthere has been much work on developing algorithms for learning Bayesian networks (both\nnetwork structure and parameters) from data for the task of density estimation and hence\nthey seem appropriate as our choice of weak learners. Another advantage of Bayesian net-\nworks in our context, is the ability to tune the strength of the weak learners using parameters\nsuch as number of edges and strength of prior.\n\nR\u0016\u0003\n\n\" Find \u000e\n\n\u0003\u0014\b\n\"\u0006\u0005\b\u0007\n\t\n\nedges in each \u201cweak density estimator\u201d learned during the boosting iterations.\n\nThe problem of \ufb01nding an \u201coptimal\u201d weak model at each boosting iteration (step 2(b) of\n\nis to limit the complexity\n. This can be done, for instance, by bounding the number of\n\nAs mentioned above, the two main implementation-speci\ufb01c details in the generic density\nof weak models and the method for searching for the\n\n\u0002 at each boosting iteration. When boosting Bayesian networks,\n\nin a domain where each of the \t observations\n\u00016\" , where \t\n\nAssume we have categorical data \ncontains assignments to \u0001 variables. We rewrite step 2(b) of the boosting algorithm as:\nto maximize \u001b\n\u001c\u0004\u0003\nIn this formulation, all possible values of \u0001 have weights, some of which may be \u0001 .\nestimation algorithm are the set \f\n\u201doptimal\u201d weak model \u000e\na natural way of limiting the \u201cstrength\u201d of weak learners in \f\nof the network structure in \f\nthe algorithm) is trickier. We \ufb01rst note that if we only impose an \r\nof \u000e\nconcentrating all the probability at the value of \u0001 with the highest \u201cweight\u201d: \u000e\f\u000b\n;\u0005\u0004\n\u0001\u0003\u0002\nappear in boosting for classi\ufb01cation if the set of weak learners \f\nhad \ufb01xed \r\nthan the \ufb01xed \r\u0012\u0011\nnorm, implicitly imposed by limiting \f\nconsequence of limiting \f\nwhen boosting Bayesian networks, since \u000e\nnetwork. Thus, limiting \f\nHowever, the boosting algorithm does not explicitly require \f\n\u0010 size constraint, rather than \r\nwith an implicit \r\n(note that using an \r\u0012\u0011\nsolution would be \u000e\n\u000b\u0014\u0013\n\n constraint on the norm\n0 ), then step 2(b) has a trivial solution,\n\u00016\"\n\u0005 . This phenomenon is not limited to the density estimation case and would\n\n norm, rather\nto contain classi\ufb01ers. This\nto contain probability distributions is particularly problematic\n\u000b can be represented with a fully disconnected\n\n as in the case of probability distributions\n0 ). For the unconstrained \u201cdistribution\u201d case (corresponding to a\n\nto include only probability\ndistributions. Let us consider instead a somewhat different family of candidate models,\n\n\u0002 (speci\ufb01cally, the PDF constraint \u001b\n\u0001\u000e\r\u0010\u000f\n\nfully connected Bayesian network), this leads to re-writing step 2(b) of the boosting algo-\nrithm as:\n\nconstraint as in Adaboost is not possible, since the trivial optimal\n\nto \u201csimple\u201d structures by itself does not amend this problem.\n\n\u00016\"\u0016&\n\n02\n\n\u0005\b\u0007\n\n Find \u000e\n\n\u00016\"\n\n\u00016\" , subject to \u001b\n\n8\u0014:2;\n\n\u00016\"7&\n\n. This fact points to an interesting correspondence between\n\nto maximize \u001b\n\"\u0015\u0005\b\u0007\n\u00016\"\n\u00016\"$\" , subject to \u001b\nto maximize \u001b\n\"1&\n\u001a(\u001b\n\u0010 -constrained linear optimization problems and \r\n\u001e \u001f\"!\n\n\"\u0015\u0005\b\u0007\u0016\t\n\u001c\u0004\u0003\nBy considering the Lagrange multiplier version of this problem it is easy to see that the\noptimal solution is \u0017\n\u000e\u0019\u0018\n\u001a\u001c\u001b\nand is proportional to the optimal solution to the\n(\u0016\u001e \u001f\"!\nlog-likelihood maximization problem:\n\u0010 Find \u000e\n\"\u0015\u0005\b\u0007\u0016\t\n\u001c\u0004\u0003\ngiven by \u0017\n\u000e\u0019#%$'&\nsolutions to \r\ntion problems and leads us to believe that good solutions to step \u0002\nalgorithm can be approximated by solving step \u0002\nThe formulation given in \u0002\nnetwork learning, that of maximizing the log-likelihood (or in this case the weighted log-\nlikelihood \u001b\nOur implementation of the boosting algorithm, therefore, does indeed limit\nto in-\nclude probability distributions only, in this case those that can be represented by \u201csimple\u201d\nBayesian networks. It solves a constrained version of step \u0002\ninstead of the original ver-\nsion \u0002\n\n\"\u0006\u0005\b\u0007\n\n -constrained log optimiza-\n\n of the boosting\n\u0010 presents us with a problem that is natural for Bayesian\n\n\" . Note that this use of \u201csurrogate\u201d optimization tasks is not alien to other boosting\n\n\u00016\" ) of the data given the structure.\n\ninstead.\n\n8\t:,;\n\n\"*\t\n\n\u001c)\u0003\n\n\u001c)\u0003\n\n\u001c\u0004\u0003\n\n\u001c\u0004\u0003\n\n\u001c\u0004\u0003\n\n\u0001\n\u0003\n\u0005\n\u0007\n\n\u0002\n\u0002\n\u0015\n\f\n\"\n\u000e\n\u001c\n\"\n&\n\u001b\n\"\n$\n\b\n\"\n\"\n\u000e\n\u001c\n\u001c\n&\n\u0001\n&\n\t\n\u000f\n\u0002\n\"\n\"\n\u000e\n\u001c\n\u000e\n\u001c\n\u0010\n&\n0\n\u001f\n\u001c\n&\n\u001d\n\u001a\n\u001f\n\u001e\n\u0002\n\"\n\"\n\u001c\n\u000e\n\u001c\n\u000e\n\u001c\n0\n\u001c\n\u0001\n(\n\u001a\n\u001e\n\"\n\"\n\u0010\n\"\n\"\n\u000e\n\u001c\n\f\n\"\n\u0010\n\fd\no\no\nh\n\ni\nl\ne\nk\ni\nl\n-\ng\no\nL\n\n \n.\n\ng\nv\nA\n\n-25.5\n\n-26\n\n-26.5\n\n-27\n\n-27.5\n\n-28\n\n-28.5\n\nBoosting\n\nBayesian Network\n\nAutoClass\n\nBoosting\n\nBayesian(cid:0)Network\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\nL\ng\nv\nA\n\n.\n\n-24.5\n\n-24.7\n\n-24.9\n\n-25.1\n\n-25.3\n-25.5\n\n-25.7\n\n-25.9\n\n-26.1\n\n-26.3\n\n-26.5\n\n1 3 5 7 9\n\n1\n1\n\n3\n1\n\n5\n1\n\n7\n1\n\n9\n1\n\n1\n2\n\n3\n2\n\n5\n2\n\n7\n2\n\n9\n2\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9 10 11 12 13 14 15 16 17 18 19\n\nBoosting Iterations\n\n(a)\n\nBoosting(cid:0)Iterations\n\n(b)\n\ny\nt\ni\nl\ni\n\nb\na\nn\nr\na\ne\nL\nk\na\ne\nW\ng(cid:0)\no\nL\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\nTraining(cid:0)performance\n\nWeak(cid:0)Learnability\n\nLog(cid:0)(n)\n\n1 3 5 7 9\n\n1\n1\n\n3\n1\n\n5\n1\n\n7\n1\n\n9\n1\n\n1\n2\n\n3\n2\n\n5\n2\n\n7\n2\n\n9\n2\n\nBoosting(cid:0)Iterations\n\n(c)\n\n-20\n\n-21\n\n-22\n\n-23\n\n-24\n\n-25\n\n-26\n\n-27\n\nd\no\no\nh\n\ni\nl\ne\nk\ni\n\nL\n-\ng\no\nL\ng\nv\nA\n\n.\n\ny\nt\ni\nl\ni\n\nb\na\nn\nr\na\ne\nL\nk\na\ne\nW\ng(cid:0)\no\nL\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\nTraining(cid:0)performance\n\nWeak(cid:0)Learnability\n\nLog(cid:0)(n)\n\n-24\n\n-24.2\n\n-24.4\n\n-24.6\n\n-24.8\n\n-25\n\n-25.2\n\n-25.4\n\n-25.6\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9 10 11 12 13 14 15 16 17 18 19\n\nBoosting(cid:0)Iterations\n\n(d)\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\n\nL\n-\ng\no\nL\ng\nv\nA\n\n.\n\nFigure 2: (a) Comparison of boosting, single Bayesian network and AutoClass performance on the\ngenomic expression dataset. The average log-likelihood for each test set instance is plotted. (b) Same\nas (a) for the census dataset. Results for AutoClass were omitted as they were not competitive in this\ndomain (see text). (c) The weak learnability condition is plotted along with training data performance\n\nfor the genomic expression dataset. The plot is in log-scale and also includesC>E\n\nis the number of training instances (d) Same as (c) for the census dataset.\n\nwhere\n\n7F\u0013\u0001\n\n\u0017 as a reference\n\napplications as well. For example, Adaboost calls for optimizing a re-weighted classi\ufb01ca-\ntion problem at each step; Decision trees, the most popular boosting weak learners, search\nfor \u201coptimal\u201d solutions using surrogate loss functions - such as the Gini index for CART\n[3] or information gain for C4.5 [16].\n\n5 Experimental Results\n\nWe evaluated the performance of our algorithms on two distinct datasets: a genomic ex-\npression dataset and a US census dataset. In gene expression data, the level of mRNA\ntranscript of every gene in the cell is measured simultaneously, using DNA microar-\nray technology, allowing researchers to detect functionally related genes based on the\ncorrelation of their expression pro\ufb01les across the various experiments. We combined\nthree yeast expression data sets [10, 12, 19] for a total of 550 expression experiments.\nTo test our methods on a set of correlated variables, we selected 56 genes associated\nwith the oxidative phosphorlylation pathway in the KEGG database [1]. We discretized\nthe expression measurements of each gene into three levels (down, same, up) as in\n[15]. We obtained the 1990 US census data set from the UC Irvine data repository\n(http://kdd.ics.uci.edu/databases/census1990/USCensus1990.html). The data set includes\n68 discretized attributes such as age, income, occupation, work status, etc. We randomly\nselected 5k entries from the 2.5M available entries in the entire data set.\n\nEach of the data sets was randomly partitioned into 5 equally sized sets and our boosting\nalgorithm was learned from each of the 5 possible combinations of 4 partitions. The perfor-\nmance of each boosting model was evaluated by measuring the log-likelihood achieved on\n\n(cid:0)\n(cid:0)\n(cid:0)\n(cid:0)\n(cid:0)\n\n\fthe data instances in the left out partition. We compared the performance achieved to that\nof a single Bayesian network learned using standard techniques (see [11] and references\ntherein). To test whether our boosting approach gains its performance primarily by using\nan ensemble of Bayesian networks, we also compared the performance to that achieved\nby an ensemble of Bayesian networks learned using AutoClass [4], varying the number of\nclasses from 2 to 100. We report results for the setting of AutoClass achieving the best\nperformance. The results are reported as the average log-likelihood measured for each in-\nstance in the test data and summarized in Fig. 2(a,b). We omit the results of AutoClass\nfor the census data as it was not comparable to boosting and a single Bayesian network,\nachieving an average test instance log-likelihood of /\u0001\n\u0002\b\u0007 . As can be seen, our\nboosting algorithm performs signi\ufb01cantly better, rendering each instance in the test data\nroughly and \u0002\ntimes more likely than it is using other approaches in the genomic and\ncensus datasets, respectively.\n\n02\u0012\n\n\u0002\u0004\u0003\u0006\u0005\n\n #\"\n\n #\"\n\nif \u001b\n\n, then adding \u000e\n\n\u0001\u0002\u0003\n\u0001\u0002\u0003-\" , the training set log-likelihood and the threshold \t\n\nTo illustrate the theoretical concepts discussed in Section 3, we recorded the performance\nof our boosting algorithm on the training set for both data sets. As shown in Section 3,\nto the model is guaranteed to improve our training\nset performance. Theorem 2 relates the magnitude of this difference to the amount of\nimprovement in training set performance. Fig. 2(c,d) plots the weak learnability quantity\nfor both data sets on a\nlog scale. As can be seen, the theory matches nicely, as the improvement is large when the\nweak learnability condition is large and stops entirely once it asymptotes to \t\nFinally, boosting theory tells us that the effect of boosting is more pronounced for \u201cweaker\u201d\nweak learners. To that extent, we experimented (data not shown) with various strength\n(number of allowed edges in each Bayesian\nnetwork, strength of prior). As expected, the overall effect of boosting was much stronger\nfor weaker learners.\n\nparameters for the family of weak learners \f\n\n.\n\n6 Discussion and future work\n\nIn this paper we extended the boosting methodology to the domain of density estimation\nand demonstrated its practical performance on real world datasets. We believe that this di-\nrection shows promise and hope that our work will lead to other boosting implementations\nin density estimation as well as other function estimation domains.\n\nOur theoretical results include an exposition of the training data performance of the generic\nalgorithm, proving analogous results to those in the case of boosting for classi\ufb01cation. Of\nparticular interest is theorem 1, implying that the idealized algorithm converges, asymp-\ntotically, to the global minimum. This result is interesting, as it implies that the greedy\nboosting algorithm converges to the exhaustive solution. However, this global minimum is\nusually not a good solution in terms of test-set performance as it will tend to over\ufb01t (espe-\nis not very small). Boosting can be described as generating a regularized path to\nthis optimal solution [17], and thus we can assume that points along the path will usually\nhave better generalization performance than the non-regularized optimum.\n\ncially if \f\n\nIn Section 4 we described the theoretical and practical dif\ufb01culties in solving the optimiza-\ntion step of the boosting iterations (step 2(b)). We suggested replacing it with a more easily\nsolvable log-optimization problem, a replacement that can be partly justi\ufb01ed by theoretical\narguments. However, it will be interesting to formulate other cases where the original prob-\nto probability distributions\n\nlem has non-trivial solutions - for instance, by not limiting \f\nonly and using non-density estimation algorithms to generate the \u201cweak\u201d models \u000e\n\nThe popularity of Bayesian networks as density estimators stems from their intuitive in-\nterpretation as describing causal relations in data. However, when learning the network\n\n\u0002 .\n\n\u0001\n\u0012\n\u0012\n\u0007\n\u0003\n\n\u001a\n\u001b\n$\n&\n\u000e\n\u001c\n\"\n\n\t\n\u001b\n\u0003\n\n\u001a\n\u001b\n$\n&\n\u000e\n\u001c\n\fstructure learning. If the weak models in \f\n\nstructure from data, a major issue is assigning con\ufb01dence to the learned features. A po-\ntential use of boosting could be in improving interpretability and reducing instability in\nare limited to a small number of edges, we can\ncollect and interpret the \u201ctotal in\ufb02uence\u201d of edges in the combined model. This seems like\na promising avenue for future research, which we intend to pursue.\nAcknowledgements We thank Jerry Friedman, Daphne Koller and Christian Shelton for\nuseful discussions. E. Segal was supported by a Stanford Graduate Fellowship (SGF).\n\nReferences\n\n[1] Kegg: Kyoto encyclopedia of genes and genomes. In http://www.genome.ad.jp/kegg.\n[2] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford,\n\nU.K., 1995.\n\n[3] L. Breiman, J.H. Friedman, R. Olshen, and C. Stone. Classi\ufb01cation and Regression Trees.\n\nWardsworth International Group, 1984.\n\n[4] P. Cheeseman and J. Stutz. Bayesian Classi\ufb01cation (AutoClass): Theory and Results. AAAI\n\nPress, 1995.\n\n[5] R. O. Duda and P. E. Hart. Pattern Classi\ufb01cation and Scene Analysis. John Wiley & Sons, New\n\nYork, 1973.\n\n[6] Y. Freund and R.E. Scahpire. A decision theoretic generalization of on-line learning and an\napplication to boosting. In the 2nd Eurpoean Conference on Computational Learning Theory,\n1995.\n\n[7] J.H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statis-\n\ntics, Vol. 29 No. 5, 2001.\n\n[8] J.H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of\n\nboosting. Annals of Statistics, Vol. 28 pp. 337-407, 2000.\n\n[9] N. Friedman and D. Koller. Being bayesian about network structure: A bayesian approach to\n\nstructure discovery in bayesian networks. Machine Learning Journal, 2002.\n\n[10] A.P. Gasch, P.T. Spellman, C.M. Kao, O.Carmel-Harel, M.B. Eisen, G.Storz, D.Botstein, and\nP.O. Brown. Genomic expression program in the response of yeast cells to environmental\nchanges. Mol. Bio. Cell, 11:4241\u20134257, 2000.\n\n[11] D. Heckerman. A tutorial on learning with Bayesian networks. In M. I. Jordan, editor, Learning\n\nin Graphical Models. MIT Press, Cambridge, MA, 1998.\n\n[12] T. R. Hughes et al. Functional discovery via a compendium of expression pro\ufb01les. Cell,\n\n102(1):109\u201326, 2000.\n\n[13] L. Mason, J. Baxter, P. Bartlett, and P. Frean. Boosting algorithms as gradient descent in func-\n\ntion space. In Proc. NIPS, number 12, pages 512\u2013518, 1999.\n\n[14] M. Meila and T. Jaakkola. Tractable bayesian learning of tree belief networks. Technical Report\n\nCMU-RI-TR-00-15, Robotics institute, Carnegie Mellon University, 2000.\n\n[15] D. Pe\u2019er, A. Regev, G. Elidan, and N. Friedman. Inferring subnetworks from perturbed expres-\n\nsion pro\ufb01les. In ISMB\u201901, 2001.\n\n[16] J.R. Quinlan. C4.5 - Programs for Machine Learning. Morgan-Kaufmann, 1993.\n[17] S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a margin maximizer. Sub-\n\nmitted to NIPS 2002.\n\n[18] R.E. Scahpire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: a new explanation\n\nfor the effectiveness of voting methods. Annals of Statistics, Vol. 26 No. 5, 1998.\n\n[19] P. T. Spellman et al. Comprehensive identi\ufb01cation of cell cycle-regulated genes of the yeast\nsaccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9(12):3273\u201397, 1998.\n[20] B. Thiesson, C. Meek, and D. Heckerman. Learning mixtures of dag models. Technical Report\n\nMSR-TR-98-12, Microsoft Research, 1997.\n\n[21] R.S. Zemel and T. Pitassi. A gradient-based boosting algorithm for regression problems. In\n\nProc. NIPS, 2001.\n\n\f", "award": [], "sourceid": 2298, "authors": [{"given_name": "Saharon", "family_name": "Rosset", "institution": null}, {"given_name": "Eran", "family_name": "Segal", "institution": null}]}