{"title": "Boosting and Maximum Likelihood for Exponential Models", "book": "Advances in Neural Information Processing Systems", "page_first": 447, "page_last": 454, "abstract": null, "full_text": "Boosting and Maximum Likelihood for\n\nExponential Models\n\nGuy Lebanon\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nJohn Lafferty\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nlebanon@cs.cmu.edu\n\nlafferty@cs.cmu.edu\n\nAbstract\n\nWe derive an equivalence between AdaBoost and the dual of a convex\noptimization problem, showing that the only difference between mini-\nmizing the exponential loss used by AdaBoost and maximum likelihood\nfor exponential models is that the latter requires the model to be normal-\nized to form a conditional probability distribution over labels. In addi-\ntion to establishing a simple and easily understood connection between\nthe two methods, this framework enables us to derive new regularization\nprocedures for boosting that directly correspond to penalized maximum\nlikelihood. Experiments on UCI datasets support our theoretical analy-\nsis and give additional insight into the relationship between boosting and\nlogistic regression.\n\n1 Introduction\n\nSeveral recent papers in statistics and machine learning have been devoted to the relation-\nship between boosting and more standard statistical procedures such as logistic regression.\nIn spite of this activity, an easy-to-understand and clean connection between these differ-\nent techniques has not emerged. Friedman, Hastie and Tibshirani [7] note the similarity\nbetween boosting and stepwise logistic regression procedures, and suggest a least-squares\nalternative, but view the loss functions of the two problems as different, leaving the precise\nrelationship between boosting and maximum likelihood unresolved. Kivinen and Warmuth\n[8] note that boosting is a form of \u201centropy projection,\u201d and Lafferty [9] suggests the use of\nBregman distances to approximate the exponential loss. Mason et al. [10] consider boost-\ning algorithms as functional gradient descent and Duffy and Helmbold [5] study various\nloss functions with respect to the PAC boosting property. More recently, Collins, Schapire\nand Singer [2] show how different Bregman distances precisely account for boosting and\nlogistic regression, and use this framework to give the \ufb01rst convergence proof of AdaBoost.\nHowever, in this work the two methods are viewed as minimizing different loss functions.\nMoreover, the optimization problems are formulated in terms of a reference distribution\nconsisting of the zero vector, rather than the empirical distribution of the data, making the\ninterpretation of this use of Bregman distances problematic from a statistical point of view.\n\nIn this paper we present a very basic connection between boosting and maximum likelihood\nfor exponential models through a simple convex optimization problem. In this setting, it is\n\n\fseen that the only difference between AdaBoost and maximum likelihood for exponential\nmodels, in particular logistic regression, is that the latter requires the model to be nor-\nmalized to form a probability distribution. The two methods minimize the same extended\nKullback-Leibler divergence objective function subject to the same feature constraints. Us-\ning information geometry, we show that projecting the exponential loss model onto the\nsimplex of conditional probability distributions gives precisely the maximum likelihood\nexponential model with the speci\ufb01ed suf\ufb01cient statistics. In many cases of practical inter-\nest, the resulting models will be identical; in particular, as the number of features increases\nto \ufb01t the training data the two methods will give the same classi\ufb01ers. We note that through-\nout the paper we view boosting as a procedure for minimizing the exponential loss, using\neither parallel or sequential update algorithms as in [2], rather than as a forward stepwise\nprocedure as presented in [7] or [6].\n\nGiven the recent interest in these techniques, it is striking that this connection has gone un-\nobserved until now. However in general, there may be many ways of writing the constraints\nfor a convex optimization problem, and many different settings of the Lagrange multipli-\ners (or Kuhn-Tucker vectors) that represent identical solutions. The key to the connection\nwe present here lies in the use of a particular non-standard presentation of the constraints.\nWhen viewed in this way, there is no need for special-purpose Bregman distances to give a\nuni\ufb01ed account of boosting and maximum likelihood, as we only make use of the standard\nKullback-Leibler divergence. But our analysis gives more than a formal framework for\nunderstanding old algorithms; it also leads to new algorithms for regularizing AdaBoost,\nwhich is required when the training data is noisy. In particular, we derive a regularization\nprocedure for AdaBoost that directly corresponds to penalized maximum likelihood using a\nGaussian prior. Experiments on UCI data support our theoretical analysis, demonstrate the\neffectiveness of the new regularization method, and give further insight into the relationship\nbetween boosting and maximum likelihood exponential models.\n\n2 Notation\n\nCRI,KTS\n\ndata has the following property.\n\n\u000b\n\n\u0012 . For \u00063\u001b\u001d\u0002\n\n:\n\n\u0001\r\f\u000f\u000e\n\n. Throughout the paper, we assume (for notational convenience) that the training\n\nbe \ufb01nite sets. We denote by \u0002\n, and by \u0015\u0017\u0016\u0018\u0002\n\b\u001f\u001e! #\"%$&\u0006('*),+.-0/&\u0003\u00191%+\n\nfor each )2\u001b\n\nthe set of non-\nthe set of conditional probability distributions,\n, we will overload\n\n\u0001\r\f\u000f\u000e\u0011\u0010\u0013\u0012\n\n\u0003\u0005\u0004\u0007\u0006\t\b\n\nMN'B)O+.-0/ and marginal L\n'B-%C.+E-0/G? We assume, without loss of generality, that L\n\n,\nbe given functions, which we will refer to as features. These will correspond to the weak\nlearners in boosting, and to the suf\ufb01cient statistics in an exponential model. Suppose that\n\nLet \nand \u0001\nnegative measures on \u0014\n\n\u0015\t\u0003\u0019\u0004\u001a\u0006\u001c\u001b\u001d\u0002\nthe notation \u0006('*),+.-0/ and \u0006('*-&45)6/ ; the latter will be suggestive of a conditional probability\ndistribution, but in general it need not be normalized. Let 798\u0013\b\n, ;<\u0003=1>+@?\u0007?@?6+.\u0006\nMP'*)D/ ; thus,\nCJI,K with empirical distribution L\nwe have data \u0004A'B)DC.+E-%CF/G\u0012\u001aH\n'*)UCV+E)D/\nMN'B)O+.-0/Q\u0003\nMN'*)6/XWZY\nfor all )\nConsistent Data Assumption. For each )[\u001b\nfor which L\nFor most data sets of interest, each ) appears only once, so that the assumption trivially\nholds. However, if ) appears more than once, we require that it is labeled consistently. We\nGiven 7\n8 , we de\ufb01ne the exponential model _#`D'*-&45)D/ , for a3\u001b3\u000ecb\n, by _\u001a`d'B-\u00134e)D/f\u0003\nh.sRt9u>v\n'*),+.-0/ . The maximum likeli-\nnVoqprm\ng?\n\n'*-&45)D/\u0018\u0017\n\u0007\u0016\u0012\n'B)O+.-0/\"\u0007$#\b%\n\nand\nthen this becomes the more familiar KL divergence for probabilities. Let\nas\n\n45)6/\u0002\u0001\u0004\u0003\u0006\u0005\n\u0012,'B-\nde\ufb01ned on \u0002\n'\u001d\u001c#45)6/\nfeatures 7\u001a8 and a \ufb01xed default distribution _\u001f\u001e\nMN+k7T/\n'*-\u001345)D/\n'*L\nSince L\n\u001b+ \n#,%\nK@/\n\n'\u0013\u0012,+E_z/\n\u0004!\u0012[\u001b\n-6'B)D/E/ . Consider now the following two convex optimization\n7c'B)O+VL\n45)-)X\u0003\nK and\n.0/\n'\u0013\u0012O+V_\n\u0012[\u001b2 ]'\n\nThus, problem\nAs we\u2019ll show, the dual problem\ncorresponds to maximum likelihood for exponential models.\n\nMN+k7T/\nK only in that the solution is required to be normalized.\nK corresponds to AdaBoost, and the dual problem\n.65\n.75\n4e)D/07c'B)O+.-0/&\u0003\n\nBoost and maximum likelihood. Note that the constraint \u001e\n78)\n#\b%\n&0'\n\n, which is the usual presentation of the constraints for maximum likelihood (as dual\nto maximum entropy), doesn\u2019t make sense for unnormalized models, since the two sides\nof the equation may not be \u201con the same scale.\u201d Note further that attempting to rescale by\ndividing by the mass of\n\nThis presentation of the constraints is the key to making the correspondence between Ada-\n\n'3\u0012O+V_\n\u0012]\u001b4 \n\nminimize\nsubject to\n\nminimize\nsubject to\n\ndiffers from\n\n\u0012,'B-\n\nMQ'*)D/\n\nto get\n\n\u001eZ \n\n'1.\n\n'*L\n\n+V7T/\n\n'1.\n\nwould yield nonlinear constraints.\n\nWe now derive the dual problems formally; the following section gives a precise statement\nof the duality result. To derive the dual problem\n\n, we calculate the Lagrangian as\n\n78)\n\n&0'\n\nMP'B)D/\n\n\u0003:#\b%\n\n\u0012,'B-\u00134e)D/07c'B)O+E-\n\u001eZ 9\u0012,'*-&45)6/\n.75\n\u0012,'B-\u00134e)D/\n4e)D/\n\u0012,'B-\n7<45)*)\n1,\u0007^~FaO+V7c'B)O+.-0/\"\u0007$#\b%\n'*-&45)D/\n, the connecting equation _#`\nis given by _\u001a`D'*-\u001345)6/X\u0003\n\"\u0006C\nK9'3\u0012O+VaD/\n=?>\u0004@BA\n @|\u000f\u000e9D8E\nx%yR{zw\n? Thus, the dual function\nK%'B_\u001a`U+VaD/\naU/N\u0003\nM6K9'\n r|\u001a\u000e9D8E\nx\"IB{KJ\n`%w\nxzyJ{zw\nFHG\n'*-\u001345)D/\nMP'*)6/\nM6K9'FaD/\naU/ . To derive the dual for\n\u0003PO\u0006QR\u0005S=TOVU\nMO'\nfor the constraints \u001e\n\u0003=1 .\n 8\u0012,'B-\u00134e)D/\nW,{\n\nMN'*)6/\nx\"I*{KJ\nFHG\n\nis given by\n\n\u0001<\u0003\u000b\u0005\n\n&0'\n\narg\n\ndef\n\n(2)\n\n, we\n\nsimply add additional Lagrange multipliers\n\nThe dual problem is to determine a8N\n\n'3\u0012O+kaU/\nFor a\u001d\u001b\nuAv\n'*-&45)6/\n\n`%w\n\n\u001e\n\u001e\nH\nI\n\f\nw\n\f\nw\n\u0010\n\u0011\n\u0003\n\u0014\n{\nL\n\u0014\n \n\u0015\n\u0012\n_\n_\n\n\u0002\n\u001b\n\u0015\n_\n\u001b\n\u0015\n\u001b\n\u0002\n \n\u0003\n\u0002\n\b\n\u0014\n{\nL\n\u0014\n \n\u0012\n8\n7\n8\n/\n\u0003\nM\n7\n.\n\u0011\n\u001e\n/\nL\n/\n/\n\u0011\n\u001e\n/\nM\n\u0012\n\u001b\n\u0015\n.\n/\n.\n/\n{\nL\n\u0012\n\u0014\n{\nL\n\u001e\n \n/\nK\n;\nK\n\u0003\n\u0014\n{\nL\n\u0014\n \n\u0015\n_\n\u001e\n\u0007\n\u007f\n\u0019\n?\n\u000e\nb\n\u0003\n;\n_\n\u001e\n}\nL\n;\nL\n\u0003\n\u0007\n\u0014\n{\nL\n\u0014\n \n_\n\u001e\nu\nv\n}\n?\n`\nL\n.\n/\n\f3.1 Special cases\n\nIt is now straightforward to derive various boosting and logistic regression problems as\nspecial cases of the above optimization problems.\n\n`zw\n\nuAv\n\n-N7\n\nx%yR{\n\nx%yR{\n\n'FaD/\n\n`\u0018L\n\n'B)O+E-\n\n=TO\n\nis equiva-\n\nproblem of AdaBoost.M2.\n\nCase 2: Binary AdaBoost. In addition to the assumptions for the previous case, now as-\n\nCase 3: Maximum Likelihood for Exponential Models. In this case we take the same setup\nas for AdaBoost.M2 but add the additional normalization constraints:\n\n1 . Then the dual problem\n \r\f\n/\u0011\u0003\n\nCase 1: AdaBoost.M2. Take _\u001f\u001eA'B-\u00134e)D/\n r|\u000f\u000e\n \r\f*| which is the optimization\nlent to computing a*N\u0013\u0003\nOVQ!\u0005\n=?><@\n1%\u0012 , and take 7\nsume that -\n'B)D/ . Then the dual problem is given by\n\u0007&1%+\r\u0017\n\u001b2\u0004\n\fB|B} which is the optimization problem of binary AdaBoost.\n\u000e6 \r\f\nx%yR{\n`%w\nO\u0006QR\u0005\n=7>\u0004@\n/\u001d\u0003\n 8\u0012,'B-\u00134e)\n? If these constraints are satis\ufb01ed, then the other constraints take the\n1>+@?@?\u0007?T+\u0003\u0002\n1%+\nform \u001e\n'*),+.-0/ and the connecting\n\u001e\r \n'*-&45)6/.7\n'B)O+.-0/\nMN'*)6/\nequation becomes _\n'*-&45)6/\n r|F} , which corresponds to setting the Lagrange multiplier\nu>v\n`%w\nxzyJ{zw\n\u001eZ c_\n)D/\n\u001e>'*-\u001f4\nMQ'*)O+E-0/\u0002\u0001<\u0003\u000b\u0005\naU/<\u0003\njRl\u0007m\ni\u0007\u0006\n4e)D/&\u0003\n\nconditional exponential model with suf\ufb01cient statistics 798>'*)O+E-0/ .\nsion, since _\u001a`D'e1\n\n_K\u001e>'B-\u00134e)D/\n_\u0007`D'B-\u00134e)D/ which corresponds to maximum likelihood for a\n\nto the appropriate value. In this case, after a simple calculation the dual problem is seen\nto be\n\nCase 4: Logistic Regression. Returning to the case of binary AdaBoost, we see that when\nwe add normalization constraints as above, the model is equivalent to binary logistic regres-\n\n? We note that it is not necessary to scale the features\n\nby a constant factor here, as in [7]; the correspondence between logistic regression and\nboosting is direct.\n\nis the normalizing term\n\nMN'B)O+.-0/E7\n r|B} were\n\n{zw\nx%yR{%w\n\nuAv\n\n`zw\n\nnEo\n\nsJt\n\n3.2 Duality\n\nLet\n\n\u0004#_\n\u0004#_\n\n\u001b[\u0002\n\nThus\n\n\u000f\u0011\u0010T\u001e\n\n\b&_\n\b&_\n\n'*-&45)D/\n\nMN'*)6/d\u001e\r c_\n\n=7>\u0004@\u000e\r#\"\n\nis unnormalized while\n\nx%yR{%w\n\t\n+Ua]\u001b[\u000e\n\n+Ua]\u001b[\u000e\n\u0012A?\n\n'*-&45)6/\n'*-&45)6/\f\u000b\n\ndenotes the closure of the set\n\n @|\u000f\u000e\nx%yR{zw\n`%w\n r|F}\n`%w\nxzyJ{zw\n\nbe de\ufb01ned as the following exponential families:\n\nis normalized. We now de\ufb01ne the boosting solution\n\n. The following theorem corresponds to\nwhere\nProposition 4 of [3] for the usual KL divergence; in [4] the duality theorem is proved for\na general class of Bregman distances, including the extended KL divergence as a special\n\nK and\n_K\u001eA'*-&45)6/\n'F_K\u001e%+k7T/\n_K\u001eA'*-&45)D/\n/%'F_K\u001e%+k7T/\n\b\u0013K\nNml as\nNboost and maximum likelihood solution _\nMP'*)D/B\u0001<\u0003\u000b\u0005\nNml \u0003:OVQ!\u0005S=TO\nNboost \u0003\nU\u0011\r#\"\nO\u0006QR\u0005\nY0+V_z/ as in [2], but rather\ncase. Note that we do not work with divergences such as \u0011\n'\u0018\u0017\nMN+E_z/ , which is more natural and interpretable from a statistical point-of-view.\n'*L\nTheorem. Supposethat\u0011\nNboost and_\nNml exist,areunique,andsatisfy\nNboost\nOVQ!\u0005S=7>\u0004@\nNml\nOVQ!\u0005S=7>\u0004@\nNml \u0003:OVQ!\u0005\n=?>\u0004@\n\n/\u001a\u0019\nOVQ!\u0005\n=?><@\n\"\u001c\u001b\n=?><@\nOVQ!\u0005\n\"\u001c\u001b\u001f\u001e! \nNml iscomputedintermsof_\n\n. Then_\n'3\u0012O+V_\n'3\u0012O+V_K\u001e\u001a/\nNboost as_\n\n'*L\nMN+E_z/\n'*L\nMN+#\"\n\"\u001c\u001b\u001f\u001e! \n\nNboost/ .\n\nMoreover,_\n\n'\u0013\u0012O+V_\n\n'\u0015\u0014\n-&45)6/\n\n\u000f\u0013\u0012O\u001e\n\nMN+V_\n\n|B|B}\n\nyR{\n\n/6?\n\n\u0003\nU\nM\nK\n`\n\u001e\nC\n\u001e\n \n\nI\n\f\nw\n\f\nw\n8\nK\n/\n8\na\nN\n\u0003\n`\n\u001e\nC\nu\nv\n\u001e\nC\n\u0001\n\u0003\n{\nL\n\u0012\n8\n\u0003\n\u001e\n \nL\n8\n`\n\u0003\nK\n\u0004\np\n\u0005\n{\n\u0005\n{\n\u0003\nW\n{\nL\nM\n/\n'\n\u001e\n{\nL\nK\nK\n\u0010\np\n\b\n\b\n/\n\b\nK\n\u0003\n\u0003\nu\nv\n \nb\n\u0012\n\b\n\u0003\n\u001b\n\u0015\nu\nv\nb\n\b\n/\n_\n_\n{\nL\n_\n{\nL\n_\n\u0016\n\u0016\n\u0016\n\u0002\n\u0011\n'\nL\n\u001e\n\u001b\n_\n\u0003\nA\n\u0011\n\u001e\n/\n\u0003\n\u001d\n\"\n\u000f\n\u0010\n\u0011\n_\n\u0003\nA\n\u0011\n\u0003\n\u001d\n\"\n\u000f\n\u0012\n\u0011\n&\n\u0011\n\fPSfrag replacements\n\nml\n\n\b\u0013K\n\nPSfrag replacements\n\nNboost\n\n \u0001\n\n\b\u0003\u0002\n\nNml\n\nNboost\n\nFigure 1: Geometric view of the duality theorem. Minimizing the exponential loss \ufb01nds the member\nthat intersects with the feasible set of measures satisfying the moment constraints (left). When\nof\n\u0004\u0006\u0005\nwe impose the additional constraint that each conditional distribution\nmust be normalized,\n, giving a higher-dimensional family\nwe introduce a Lagrange multiplier for each training example\n. By the duality theorem, projecting the exponential loss solution onto the intersection of the\n\n\u0007\t\b\u000b\n\r\f\u000f\u000e\u0011\u0010\u0013\u0012\n\n\u0004\u0015\u0014\nfeasible set with the simplex gives the maximum likelihood solution.\n\nThis result has a simple geometric interpretation. The unnormalized exponential family\nsatisfying the constraints (1) at a single point.\nThe algorithms presented in [2] determine this point, which is the exponential loss solution\n\nintersects the feasible set of measures\n\nOn the other hand, maximum conditional likelihood estimation for an exponential model\n\b\u0003\u0002\n\u000f\u0017\u0016\nis the exponential family with additional Lagrange multipliers, one for each normalization\n, by the Pythagorean\nconstraint. The feasible set for this problem is\n\n+V_z/ where\n\n. Since\n\n'*L\n\n\u000f\u0011\u0010\n\n\b\u0013K\n+E_z/ (see Figure 1, left).\nNboost \u0003:OVQ!\u0005S=?><@\u0013\u001d\nwith the same features is equivalent to the problem _VNml \u0003\n \u0018\nequality we have that _\n'3\u0012O+V_\n4 Regularization\n\nNml \u0003\n\n=7>\u0004@\n\n\"\u001c\u001b\u001f\u001e! \n\nO\u0006QR\u0005\n\nO\u0006QR\u0005S=?>\u0004@\n\u0015\u0017\u0016\n \u0018\n\nNboost/ (see Figure 1, right).\n\nMinimizing the exponential loss or the log loss on real data often fails to produce \ufb01nite\n\nparameters. Speci\ufb01cally, this happens when for some feature 7\n\nor\n\n'*),+.-0/\"\u0007\u00147\n'*),+.-0/\"\u0007\u00147\n\n'*),+GL\n'*),+GL\n\n-\u001f'*)6/./\u000f\u0019^Y\n-\u001f'*)6/./\u000f\u001a^Y\n\nfor all - and ) with L\nfor all - and ) with L\n\nMP'B)D/PW\u0018Y\nMP'B)D/PW\u0018Y0?\n\n(3)\n\nThis is especially harmful since often the features for which (3) holds are the most impor-\ntant for the purpose of discrimination. Of course, even when (3) does not hold, models\ntrained by maximum likelihood or the exponential loss can over\ufb01t the training data. A\nstandard regularization technique in the case of maximum likelihood employs parameter\npriors in a Bayesian framework. See [11] for non-Bayesian alternatives in the context of\nboosting.\n\nIn terms of convex duality, parameter priors for the dual problem correspond to \u201cpoten-\n,\nfor example, corresponds to a quadratic potential on the constraint values in the primal\nproblem.\n\ntials\u201d on the constraint values in the primal problem. The case of a Gaussian prior on a\n\n \n_\n \n_\n_\n\u0015\nK\n\u0010\n\u0005\n \n_\n\"\n\u0011\nM\n\u001d\n\"\n\u0010\n\u0011\n'\nL\nM\nK\n\u0015\n \nA\n\u0011\n8\n7\n8\n8\n7\n8\n8\n\fand\n\nas\n\n\u0012[\u001b\n\n+k7D+\u0001@/\n\nand consider the primal problem\n\nvector that relaxes the original constraints. De\ufb01ne\n\nMQ'*)D/\nKkw reg given by\n\nWe now consider primal problems over '3\u0012O+\u0001\u0007/ where\n ]'\n\n\\\u001b\u0018\u000e\n\u0012^\u001b^\u0002\nMN+k7D+\u0001@/N\u0016^\u0002\n'*-&45)6/\u0013'F7\u00078>'B)O+.-0/\"\u0007$#\b%\n&0'\n.cKGw reg/\n'\u0013\u0012,+E_\n'\b@/\n\u0017\u0007\u0006\n+V7D+\t@/\n\u0012[\u001b4 \n'*L\nis a convex function whose minimum is at Y .\nTo derive the dual problem, the Lagrangian is calculated as ;\n'\u0013\u0012,+\u00019+kaU/&\u0003\n5z'FaD/ where\nKkw reg'FaD/\n5%'FaD/\n8 , we have\n5%'FaD/X\u0003\n|B}\n r|\u000f\u000e\nyR{\nx%yR{%w\n\t\n`%w\nx%yR{%w\n\nand the dual function is\nof\nthe dual function becomes\n\n'FaD/(\u0017\u000b\u0006\n/\u000e\n\n. For a quadratic penalty\n\n\u0006f\b%\u000e\nM6KGw reg'\n\nminimize\nsubject to\n\n'*-&45)6/\n\n'\f@/\n\n\f\u000f\u000e\n\naU/\n\nMN'B)D/\n\nwhere\n\nis a parameter\n\n(4)\n\n45)*)\n\n\u0003\u0003V8\u0005\u0004\n\n7\u00078\n\nis the convex conjugate\n\n'\b@/\n'\u0013\u0012,+VaD/S\u0017\n\u0006\n8 and\n/\u000e\n\n(5)\n\n'*L\n\nA sequential update rule for (5) incurs the small additional cost of solving a nonlinear\nequation by Newton-Raphson every iteration. See [1] for a discussion of this technique in\nthe context of exponential models in statistical language modeling.\n\n5 Experiments\n\nWe performed experiments on some of the UC Irvine datasets in order to investigate the\nrelationship between boosting and maximum likelihood empirically. The weak learner was\nthe decision stump FindAttrTest as described in [6], and the training set consisted of\na randomly chosen 90% of the data. Table 1 shows experiments with regularized boosting.\n\nTwo boosting models are compared. The \ufb01rst model _zK was trained for 10 features gener-\nout using the parallel update method described in [2]. The second model, _\n\nated by FindAttrTest, excluding features satisfying condition (3). Training was carried\n, was trained\nusing the exponential loss with quadratic regularization. The performance was measured\nusing the conditional log-likelihood of the (normalized) models over the training and test\ntest. The table entries were\nset, denoted\naveraged by 10-fold cross validation.\n\ntest, as well as using the test error rate\n\ntrain and\n\nFor the weak learner FindAttrTest, only the Iris dataset produced features that satisfy\n(3). On average, 4 out of the 10 features were removed. As the \ufb02exibility of the weak\nlearner is increased, (3) is expected to hold more often. On this dataset regularization\nimproves both the test set log-likelihood and error rate considerably. In datasets where\n\nand the error rate. In cases of little over\ufb01tting (according to the log-likelihood measure),\nregularization only improves the test set log-likelihood at the expense of the training set\nlog-likelihood, however without affecting test set error.\n\n_\u001aK shows signi\ufb01cant over\ufb01tting, regularization improves both the log-likelihood measure\nNext we performed a set of experiments to test how much _\u0006Nboost differs from _\nNml , where\nNboost/ as well as between\nNboost/ . As the number of features increases so that the training\n\nthe boosting model is normalized (after training) to form a conditional probability distribu-\ntion. For different experiments, FindAttrTest generated a different number of features\n(10\u2013100), and the training set was selected randomly. The top row in Figure 2 shows for\nthe Sonar dataset the relationship between\n\nNml/ and\u0011\n\nNml/ and\n\nNml+V_\n\ntrain'F_\n\ntrain'F_\n\ntrain'F_\n\ntrain'B_\n\nb\n \nL\nM\n\u0003\n\u0002\n\u0002\n\b\n\u0014\n{\nL\n\u0014\n \n\u0012\n/\n.\n'\n\u0011\n\u001e\n/\nM\nb\n;\nM\n\u0003\nM\nK\n\u0006\n\u0006\n\u0006\n\u0003\n\u001e\n8\nK\n/\n8\n\n/\n\u0006\n\u0007\n\u001e\n8\nK\n\u000e\n/\n8\na\n/\n\u0003\n\u0007\n\u0014\n{\nL\n\u0014\n \n_\n\u001e\nu\nv\n \n\u0007\n\u0014\n8\na\n/\n8\n\u000f\n\n/\n8\n?\n/\n\n\n\u0010\n\n\n\n\fUnregularized\n\nRegularized\n\nData\nPromoters\nIris\nSonar\nGlass\nIonosphere\nHepatitis\nBreast\nPima\n\ntrain'F_\u001aK\u0007/\n\n-0.29\n-0.29\n-0.22\n-0.82\n-0.18\n-0.28\n-0.12\n-0.48\n\ntest'B_#K@/\n\n-0.60\n-1.16\n-0.58\n-0.90\n-0.36\n-0.42\n-0.14\n-0.53\n\ntest'F_\u001aK\u0007/\n\n0.28\n0.21\n0.25\n0.36\n0.13\n0.19\n0.04\n0.26\n\ntrain'B_\n\n-0.32\n-0.10\n-0.26\n-0.84\n-0.21\n-0.28\n-0.12\n-0.48\n\ntest'F_\n\n-0.50\n-0.20\n-0.48\n-0.90\n-0.28\n-0.39\n-0.14\n-0.52\n\ntest'B_\n\n0.26\n0.09\n0.19\n0.36\n0.10\n0.19\n0.04\n0.25\n\nTable 1: Comparison of unregularized to regularized boosting. For both the regularized and un-\nregularized cases, the \ufb01rst column shows training log-likelihood, the second column shows test log-\nlikelihood, and the third column shows test error rate. Regularization reduces error rate in some\ncases while it consistently improves the test set log-likelihood measure on all datasets. All entries\nwere averaged using 10-fold cross validation.\n\ntrain'B_ ml/7\u0007U\f\n\nThe bottom row in Figure 2 shows the relationship between the test set log-likelihoods,\n\ndata is more closely \ufb01t (\nbecome more similar, as measured by the KL divergence. This result does not hold when\nthe model is unidenti\ufb01able and the two models diverge in arbitrary directions.\n\nY ), the boosting and maximum likelihood models\nNboost/ . In\nNboost/ , together with the test set error rates\nNboost/ , as expected, on the test data the linear trend is reversed,\nNboost/ . Identical experiments on Hepatitis, Glass and Promoters\n\nthese \ufb01gures the testing set was chosen to be 50% of the total data. In order to indicate the\nnumber of points at each error rate, each circle was shifted by a small random value to avoid\npoints falling on top of each other. While the plots in the bottom row of Figure 2 indicate\nthat\nso that\nresulted in similar results and are omitted due to lack of space.\n\nThe duality result suggests a possible explanation for the higher performance of boosting\ntest. The boosting model is less constrained due to the lack of normalization\nwith respect to\n-divergence to the uniform model. This may be\nconstraints, and therefore has a smaller\ninterpreted as a higher extended entropy, or less concentrated conditional model.\n\nNml/ and\nNml/\ntrain'F_\ntest'B_\n\nNml/\f\u0019\n\ntest'B_\n\nNml/ and\n\ntest'F_\n\ntest'B_\n\ntest'F_\n\ntrain'B_\ntest'F_\n\nHowever, as\n\nNml/\b\u0007D\f\n\ndivergence between the two models also gets smaller. The empirical results are consistent\nwith the theoretical analysis. As the number of features is increased so that the training\n\n%'F_\nNboost it is seen that as the difference between\n'B)D/ becomes a constant. In this case, normalizing the boosting model\n\nY , the two models come to agree (up to identi\ufb01ability). It is easy to\n%'B_\u001a`>/r? By taking\nshow that for any exponential model _#`\\\u001b\n%'B_\nNml+V_\u0007`\nNml/\"\u0007\n%'F_VNml/ and\nNboost/ gets smaller, the\n_\u0007`(\u0003\u000b_\n%'B_\nM and the normalizing\ndata is \ufb01t more closely, the model matches the empirical distribution L\nNboost does\n\nterm\nnot violate the constraints, and results in the maximum likelihood model.\n\ntrain'B_\n\nAcknowledgments\n\nWe thank Michael Collins, Michael Jordan, Andrew Ng, Fernando Pereira, Rob Schapire,\nand Yair Weiss for helpful comments on an early version of this paper. Part of this work was\ncarried out while the second author was visiting the Department of Statistics, University of\nCalifornia at Berkeley.\n\n\n\n\u0010\n\n/\n/\n\n/\n/\n\u0010\n/\n/\n\n\n\n\u0010\n\u0010\n\nW\n\n\n\n\n\u0010\n\b\n/\n+\n\u0011\n/\n\u0003\n\u0005\n`\n\"\n\fs\no\no\n\nt\n\n\t b\nn\u0007\n\u0006 t\n\ni\na\nr\n\n0\n\n\u22120.05\n\n\u22120.1\n\n\u22120.15\n\n\u22120.2\n\nPSfrag replacements\n\n\u22120.25\n\n\u22120.3\n\n\u22120.3\n\n\u22120.25\n\n\u22120.2\n\ns\no\no\n\nt\u0016\n\u0014 b\nl\u0015\n\u0014 m\nn\u0012\n\u0011t\n\ni\na\nr\n\nPSfrag replacements\n\n0.045\n\n0.04\n\n0.035\n\n0.03\n\n0.025\n\n0.02\n\n0.015\n\n0.01\n\n0.005\n\n\u22120.05\n\n0\n\n0\n\n\u22120.25\n\n\u22120.2\n\n\u22120.15\n\n\u22120.1\n\n\u000b train\f\u000e\r\u0004\u000fml\u0010\n\n\u22120.05\n\n0\n\n\u22120.15\n\ntrain\u0001\n\n\u22120.1\n\n\u0002\u0004\u0003ml\u0005\n\ns\no\no\n\nt\u0016\n\u0014 b\nt\u0012\n\u0017t\n\ns\ne\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\ns\no\no\n\nt\n\n\t b\nt\u0007\n\u0019t\n\ns\ne\n\n0.4\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.1\n\n0.15\n\n0.2\n\nPSfrag replacements\n\nPSfrag replacements\n\n\u221215\n\n\u22125\n\n0.35\n\n0.4\n\n\u221225\n\n\u221220\n\n\u221225\n\n\u221210\n\n\u000b test\f\n\u000fml\u0010\n\n\u0007\u001d\u001bml\u0012\n(left) and\u001a train\npares\u001a train\nto \u001e\n\n\u0007\u001c\u001bboost\u0012\n\n\u0007\u001d\u001bboost\u0012\nand\u001a test\nshows the relationship between\u001a test\n\n\u0007\u001d\u001bml\u0012\n\nto\u001a train\n\n\u0007\u001c\u001bml\u0012\n\nThe experimental results for other UCI datasets were very similar.\n\nFigure 2: Comparison of AdaBoost and maximum likelihood for Sonar dataset. The top row com-\n(right). The bottom row\n(right).\n\ntrain\n\n\u0007\u001c\u001bboost\u0012\n\n\u0007\u001c\u001bml\u001f\n(left) and test\n\n\u0007\u001c\u001bml\u0012\n\n0.25\n\n0.3\n\n\u0003ml\u0005\n\u0018 test\u0001\nand test\n\n\u0007\u001c\u001bboost\u0012\n\nReferences\n[1] S. Chen and R. Rosenfeld. A survey of smoothing techniques for ME models. IEEE Transac-\n\ntions on Speech and Audio Processing, 8(1), 2000.\n\n[2] M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman dis-\n\ntances. Machine Learning, to appear.\n\n[3] S. Della Pietra, V. Della Pietra, and J. Lafferty.\n\nInducing features of random \ufb01elds.\n\nTransactions on Pattern Analysis and Machine Intelligence, 19(4), 1997.\n\nIEEE\n\n[4] S. Della Pietra, V. Della Pietra, and J. Lafferty. Duality and auxiliary functions for Bregman\n\ndistances. Technical Report CMU-CS-01-109, Carnegie Mellon University, 2001.\n\n[5] N. Duffy and D. Helmbold. Potential boosters? In Neural Information Processing Systems,\n\n2000.\n\n[6] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In International\n\nConference on Machine Learning, 1996.\n\n[7] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of\n\nboosting. The Annals of Statistics, 28(2), 2000.\n\n[8] J. Kivinen and M. K. Warmuth. Boosting as entropy projection. In Computational Learning\n\nTheory, 1999.\n\n[9] J. Lafferty. Additive models, boosting, and inference for generalized divergences. In Computa-\n\ntional Learning Theory, 1999.\n\n[10] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining\nhypotheses. In A. Smola, P. Bartlett, B. Sch\u00a8olkopf, and D. Schuurmans, editors, Advances in\nLarge Margin Classi\ufb01ers, 1999.\n\n[11] G. R\u00a8atsch, T. Onoda, and K.-R. M\u00a8uller. Soft margins for AdaBoost. Machine Learning, 2001.\n\n\n\b\n\u0013\n\u0013\n\n\u0013\n\u0002\n\b\n\f", "award": [], "sourceid": 2042, "authors": [{"given_name": "Guy", "family_name": "Lebanon", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}]}