{"title": "Variance Penalizing AdaBoost", "book": "Advances in Neural Information Processing Systems", "page_first": 1908, "page_last": 1916, "abstract": "This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efficiently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results confirm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Significant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART).", "full_text": "Variance Penalizing AdaBoost\n\nPannagadatta K. Shivaswamy\nDepartment of Computer Science\n\nCornell University, Ithaca NY\npannaga@cs.cornell.edu\n\nDepartment of Compter Science\n\nColumbia University, New York NY\n\nTony Jebara\n\njebara@cs.columbia.edu\n\nAbstract\n\nThis paper proposes a novel boosting algorithm called VadaBoost which is mo-\ntivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a\ncost function that balances the sample mean and the sample variance of the expo-\nnential loss. Each step of the proposed algorithm minimizes the cost ef\ufb01ciently\nby providing weighted data to a weak learner rather than requiring a brute force\nevaluation of all possible weak learners. Thus, the proposed algorithm solves a\nkey limitation of previous empirical Bernstein boosting methods which required\nbrute force enumeration of all possible weak learners. Experimental results con-\n\ufb01rm that the new algorithm achieves the performance improvements of EBBoost\nyet goes beyond decision stumps to handle any weak learner. Signi\ufb01cant perfor-\nmance gains are obtained over AdaBoost for arbitrary weak learners including\ndecision trees (CART).\n\n1\n\nIntroduction\n\nMany machine learning algorithms implement empirical risk minimization or a regularized variant\nof it. For example, the popular AdaBoost [4] algorithm minimizes exponential loss on the training\nexamples. Similarly, the support vector machine [11] minimizes hinge loss on the training examples.\nThe convexity of these losses is helpful for computational as well as generalization reasons [2].\nThe goal of most learning problems, however, is not to obtain a function that performs well on\ntraining data, but rather to estimate a function (using training data) that performs well on future\nunseen test data. Therefore, empirical risk minimization on the training set is often performed\nwhile regularizing the complexity of the function classes being explored. The rationale behind this\nregularization approach is that it ensures that the empirical risk converges (uniformly) to the true\nunknown risk. Various concentration inequalities formalize the rate of convergence in terms of the\nfunction class complexity and the number of samples.\nA key tool in obtaining such concentration inequalities is Hoeffding\u2019s inequality which relates the\nempirical mean of a bounded random variable to its true mean. Bernstein\u2019s and Bennett\u2019s inequalities\nrelate the true mean of a random variable to the empirical mean but also incorporate the true variance\nof the random variable. If the true variance of a random variable is small, these bounds can be\nsigni\ufb01cantly tighter than Hoeffding\u2019s bound. Recently, there have been empirical counterparts of\nBernstein\u2019s inequality [1, 5]; these bounds incorporate the empirical variance of a random variable\nrather than its true variance. The advantage of these bounds is that the quantities they involve are\nempirical. Previously, these bounds have been applied in sampling procedures [6] and in multi-\narmed bandit problems [1]. An alternative to empirical risk minimization, called sample variance\npenalization [5], has been proposed and is motivated by empirical Bernstein bounds.\nA new boosting algorithm is proposed in this paper which implements sample variance penalization.\nThe algorithm minimizes the empirical risk on the training set as well as the empirical variance. The\ntwo quantities (the risk and the variance) are traded-off through a scalar parameter. Moreover, the\n\n1\n\n\falgorithm proposed in this article does not require exhaustive enumeration of the weak learners\n(unlike an earlier algorithm by [10]).\ni=1 is provided where Xi \u2208 X and yi \u2208 {\u00b11} are drawn inde-\nAssume that a training set (Xi, yi)n\npendently and identically distributed (iid) from a \ufb01xed but unknown distribution D. The goal is to\nlearn a classi\ufb01er or a function f : X \u2192 {\u00b11} that performs well on test examples drawn from the\nsame distribution D. In the rest of this article, G : X \u2192 {\u00b11} denotes the so-called weak learner.\nThe notation Gs denotes the weak learner in a particular iteration s. Further, the two indices sets Is\nand Js, respectively, denote examples that the weak learner Gs correctly classi\ufb01ed and misclassi\ufb01ed,\ni.e., Is := {i|Gs(Xi) = yi} and Js := {j|Gs(Xj) (cid:54)= yj}.\n\nAlgorithm 1 AdaBoost Require: (Xi, yi)n\n\ni=1, and weak learners H\n\nInitialize the weights: wi \u2190 1/n for i = 1, . . . , n; Initialize f to predict zero on all inputs.\nfor s \u2190 1 to S do\n\nwi /(cid:80)\n\n2 log\n\nEstimate a weak learner Gs(\u00b7) from training examples weighted by (wi)n\n\u03b1s = 1\nif \u03b1s \u2264 0 then break end if\nf (\u00b7) \u2190 f (\u00b7) + \u03b1sGs(\u00b7)\n\nwi \u2190 wi exp(\u2212yiGs(Xi)\u03b1s)/Zs where Zs is such that(cid:80)n\n\nj:Gs(Xj )(cid:54)=yj\n\ni:Gs(Xi)=yi\n\ni=1 wi = 1.\n\nwj\n\ni=1.\n\n(cid:16)(cid:80)\n\n(cid:17)\n\nend for\n\nAlgorithm 2 VadaBoost Require: (Xi, yi)n\n\ni=1, scalar parameter 1 \u2265 \u03bb \u2265 0, and weak learners H\n\nInitialize the weights: wi \u2190 1/n for i = 1, . . . , n; Initialize f to predict zero on all inputs.\nfor s \u2190 1 to S do\n\n(cid:16)(cid:80)\n\ni + (1 \u2212 \u03bb)wi\n\nui \u2190 \u03bbnw2\nEstimate a weak learner Gs(\u00b7) from training examples weighted by (ui)n\n\u03b1s = 1\nif \u03b1s \u2264 0 then break end if\nf (\u00b7) \u2190 f (\u00b7) + \u03b1sGs(\u00b7)\n\nwi \u2190 wi exp(\u2212yiGs(Xi)\u03b1s)/Zs where Zs is such that(cid:80)n\n\nui /(cid:80)\n\nj:Gs(Xj )(cid:54)=yj\n\ni:Gs(Xi)=yi\n\n4 log\n\n(cid:17)\n\ni=1 wi = 1.\n\nuj\n\ni=1.\n\nend for\n\n2 Algorithms\n\nAdaBoost iteratively builds(cid:80)S\ntion: minf\u2208F(cid:0) 1\n\nIn this section, we brie\ufb02y discuss AdaBoost [4] and then propose a new algorithm called the Vad-\naBoost. The derivation of VadaBoost will be provided in detail in the next section.\nAdaBoost (Algorithm 1) assigns a weight wi to each training example. In each step of the AdaBoost,\na weak learner Gs(\u00b7) is obtained on the weighted examples and a weight \u03b1s is assigned to it. Thus,\ns=1 \u03b1sGs(\u00b7). If a training example is correctly classi\ufb01ed, its weight\nis exponentially decreased; if it is misclassi\ufb01ed, its weight is exponentially increased. The process\nis repeated until a stopping criterion is met. AdaBoost essentially performs empirical risk minimiza-\ns=1 \u03b1sGs(\u00b7).\nRecently an alternative to empirical risk minimization has been proposed. This new criterion, known\nas the sample variance penalization [5] trades-off the empirical risk with the empirical variance:\n\ni=1 e\u2212yif (Xi)(cid:1) by greedily constructing the function f (\u00b7) via(cid:80)S\n(cid:80)n\n\nn\n\n(cid:115)\n\nn(cid:88)\n\ni=1\n\narg min\nf\u2208F\n\n1\nn\n\nl(f (Xi), yi) + \u03c4\n\n\u02c6V[l(f (X), y)]\n\nn\n\n,\n\n(1)\n\nwhere \u03c4 \u2265 0 explores the trade-off between the two quantities. The motivation for sample variance\npenalization comes from the following theorem [5]:\n\n2\n\n\fTheorem 1 Let (Xi, yi)n\nf : X \u2192 R. Then, for a loss l : R \u00d7 Y \u2192 [0, 1], for any \u03b4 > 0, w.p. at least 1 \u2212 \u03b4, \u2200f \u2208 F\n18 \u02c6V[l(f (X), y)] ln(M(n)/\u03b4)\n\ni=1 be drawn iid from a distribution D. Let F be a class of functions\nn(cid:88)\n\n15 ln(M(n)/\u03b4)\n\n(cid:115)\n\nl(f (Xi), yi) +\n\n(n \u2212 1)\n\n+\n\nn\n\nE[l(f (X), y)] \u2264 1\nn\n\ni=1\n\n,\n\n(2)\n\nwhere M(n) is a complexity measure.\n\nFrom the above uniform convergence result, it can be argued that future loss can be minimized by\nminimizing the right hand side of the bound on training examples. Since the variance \u02c6V[l(f (X), y)]\nhas a multiplicative factor involving M(n), \u03b4 and n, for a given problem, it is dif\ufb01cult to specify the\nrelative importance between empirical risk and empirical variance a priori. Hence, sample variance\npenalization (1) necessarily involves a trade-off parameter \u03c4.\nEmpirical risk minimization or sample variance penalization on the 0 \u2212 1 loss is a hard problem;\nthis problem is often circumvented by minimizing a convex upper bound on the 0 \u2212 1 loss. In this\npaper, we consider the exponential loss l(f (X), y) := e\u2212yf (X). With the above loss, it was shown\nby [10] that sample variance penalization is equivalent to minimizing the following cost,\n\n(cid:32) n(cid:88)\n\n(cid:33)2\n\n\uf8eb\uf8edn\n\nn(cid:88)\n\n(cid:32) n(cid:88)\n\n(cid:33)2\uf8f6\uf8f8 .\n\ne\u2212yif (Xi)\n\n+ \u03bb\n\ne\u22122yif (Xi) \u2212\n\ne\u2212yif (Xi)\n\n(3)\n\ni=1\n\ni=1\n\ni=1\n\nTheorem 1 requires that the loss function be bounded. Even though the exponential loss is un-\nbounded, boosting is typically performed only for a \ufb01nite number of iterations in most practical\napplications. Moreover, since weak learners typically perform only slightly better than random\nguessing, each \u03b1s in AdaBoost (or in VadaBoost) is typically small thus limiting the range of the\nfunction learned. Furthermore, experiments will con\ufb01rm that sample variance penalization results\nin a signi\ufb01cant empirical performance improvement over empirical risk minimization.\nOur proposed algorithm is called VadaBoost1 and is described in Algorithm 2. VadaBoost iteratively\nperforms sample variance penalization (i.e., it minimizes the cost (3) iteratively). Clearly, VadaBoost\nshares the simplicity and ease of implementation found in AdaBoost.\n\n3 Derivation of VadaBoost\n\na function of \u03b1:\n\n(cid:80)s\u22121\nIn the sth iteration, our objective is to choose a weak learner Gs and a weight \u03b1s such that\nt=1 \u03b1tGt(xi)/Zs. Given a can-\nt=1 \u03b1tGt(\u00b7) + \u03b1Gs(\u00b7) can be expressed as\n\n(cid:80)s\ndidate weak learner Gs(\u00b7), the cost (3) for the function(cid:80)s\u22121\nt=1 \u03b1tGt(\u00b7) reduces the cost (3). Denote by wi the quantity e\u2212yi\n\uf8eb\uf8ed(cid:88)\n\n\uf8f6\uf8f82\uf8f6\uf8f7\uf8f8 .\n\n(cid:88)\n\nj\u2208J\n\nwje\u03b1\n\n(4)\n\nwie\u2212\u03b1 +\n\n\uf8eb\uf8ed(cid:88)\n\ni e\u22122\u03b1 + n\nw2\n\n\uf8eb\uf8ec\uf8edn\n\nV (\u03b1; w, \u03bb, I, J) :=\n\n(cid:88)\n\nj\u2208J\n\n(cid:88)\n\nj\u2208J\n\n(cid:88)\n\ni\u2208I\n\nj e2\u03b1\u2212\nw2\n\nwie\u2212\u03b1 +\n\n\uf8f6\uf8f82\n\nwje\u03b1\n\n+\u03bb\n\ni\u2208I\n\ni\u2208I\n\nup to a multiplicative factor. In the quantity above, I and J are the two index sets (of correctly\nclassi\ufb01ed and incorrectly classi\ufb01ed examples) over Gs. Let the vector w whose ith component is\nwi denote the current set of weights on the training examples. Here, we have dropped the sub-\nscripts/superscripts s for brevity.\n\nLemma 2 The update of \u03b1s in Algorithm 2 minimizes the cost\n\nU (\u03b1; w, \u03bb, I, J) :=\n\ni + (1 \u2212 \u03bb)wi\n\ne\u22122\u03b1 +\n\n(cid:32)(cid:88)\n\ni\u2208I\n\n(cid:0)\u03bbnw2\n\n(cid:1)(cid:33)\n\n(cid:0)\u03bbnw2\n\nj + (1 \u2212 \u03bb)wj\n\n(cid:1)\uf8f6\uf8f8 e2\u03b1.\n\n(5)\n\n\uf8eb\uf8ed(cid:88)\n\nj\u2208J\n\n1The V in VadaBoost emphasizes the fact that Algorithm 2 penalizes the empirical variance.\n\n3\n\n\fProof By obtaining the second derivative of the above expression (with respect to \u03b1), it is easy to\nsee that it is convex in \u03b1. Thus, setting the derivative with respect to \u03b1 to zero gives the optimal\nchoice of \u03b1 as shown in Algorithm 2.\n\n\uf8f6\uf8f82\uf8f6\uf8f7\uf8f8\n\n(cid:88)\n\nj\u2208J\n\nwie\u2212\u03b1 +\n\nwje\u03b1\n\n(cid:33)\uf8eb\uf8ed(cid:88)\n\nj\u2208J\n\nwi\n\n\uf8f6\uf8f7\uf8f8\n\uf8f6\uf8f8\n\nwj\n\ni=1 wi = 1 (i.e. normalized weights). Then,\nV (\u03b1; w, \u03bb, I, J) \u2264 U (\u03b1; w, \u03bb, I, J) and V (0; w, \u03bb, I, J) = U (0; w, \u03bb, I, J). That is, U is an upper\nbound on V and the bound is exact at \u03b1 = 0.\nProof Denoting 1 \u2212 \u03bb by \u00af\u03bb, we have:\n\nV (\u03b1; w, \u03bb, I, J) =\n\ni\u2208I\n\n\uf8f6\uf8f8\n\nj\u2208J\n\ni\u2208I\n\n= \u00af\u03bb\n\n= \u03bb\n\n= \u03bb\n\ni\u2208I\n\nj\u2208J\n\ni\u2208I\n\ni\u2208I\n\nw2\n\nj e2\u03b1\n\nj\u2208J\n\nw2\n\nj e2\u03b1\n\nj\u2208J\n\nwje\u03b1\n\n+ \u03bb\n\n\uf8f6\uf8f82\n\nwie\u2212\u03b1 +\n\nwje\u03b1\n\n+ \u03bb\n\nwie\u2212\u03b1 +\n\nj e2\u03b1 \u2212\nw2\n\n(cid:88)\n\ni\u2208I\n\ni e\u22122\u03b1 + n\nw2\n\n\uf8eb\uf8ed(cid:88)\n\n(cid:88)\n\uf8f6\uf8f82\n\nj\u2208J\n\n\uf8eb\uf8ed(cid:88)\n\uf8eb\uf8edn\n(cid:88)\n\uf8eb\uf8edn\n(cid:88)\n\nTheorem 3 Assume that 0 \u2264 \u03bb \u2264 1 and (cid:80)n\n\uf8eb\uf8ec\uf8edn\n\uf8f6\uf8f82\n(cid:88)\ni e\u22122\u03b1 + n\nw2\n\uf8eb\uf8edn\n(cid:88)\n(cid:88)\ni e\u22122\u03b1 + n\n\uf8eb\uf8ec\uf8ed(cid:32)(cid:88)\nw2\n\uf8f6\uf8f8+ \u00af\u03bb\n\uf8eb\uf8ed(cid:88)\n(cid:33)2\n\uf8f6\uf8f8 + \u00af\u03bb\n\uf8eb\uf8ed(cid:32)(cid:88)\n(cid:33)\uf8eb\uf8ed1 \u2212(cid:88)\n\uf8f6\uf8f8 + 2\u00af\u03bb\n(cid:33)\uf8eb\uf8ed(cid:88)\n(cid:33)\n(cid:32)(cid:88)\n(cid:1)\uf8f6\uf8f8 e2\u03b1\n\uf8eb\uf8ed(cid:88)\n(cid:0)\u03bbnw2\n\uf8f6\uf8f8(cid:0)\u2212e2\u03b1 \u2212 e\u22122\u03b1 + 2(cid:1)\n(cid:1)\uf8f6\uf8f8 e2\u03b1 = U (\u03b1; w, \u03bb, I, J).\n\uf8eb\uf8ed(cid:88)\n(cid:32)(cid:88)\n(cid:0)\u03bbnw2\ni\u2208I wi +(cid:80)\nexpanded. On the next line, we used the fact that(cid:80)\n\n\uf8eb\uf8ed(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\uf8f6\uf8f8(cid:32)\n1 \u2212(cid:88)\n(cid:1)(cid:33)\n(cid:33)\uf8eb\uf8ed(cid:88)\n(cid:1)(cid:33)\n\n(cid:32)(cid:88)\n\uf8f6\uf8f8 e\u22122\u03b1\n\uf8f6\uf8f8\n\n\uf8eb\uf8ed(cid:88)\n(cid:0)\u03bbnw2\n(cid:32)(cid:88)\n(cid:0)\u03bbnw2\n\n+ \u00af\u03bb\n\nj\u2208J =(cid:80)n\n\n(cid:32)(cid:88)\n\ni e\u22122\u03b1 + n\nw2\n\ne\u22122\u03b1 +\n\nwi\n\nwj\n\ne2\u03b1 + 2\n\nwj\n\nj\u2208J\n\nwj\n\nj\u2208J\n\nwj\n\nj\u2208J\n\nwj\n\nj\u2208J\n\nwi\n\ni\u2208I\n\n=\n\ni\u2208I\n\nw2\n\nj e2\u03b1\n\nj\u2208J\n\ni + \u00af\u03bbwi\n\ne\u22122\u03b1 +\n\ni + \u00af\u03bbwi\n\ne\u22122\u03b1 +\n\nwi\n\ni\u2208I\n\nwi\n\ne2\u03b1\n\ni\u2208I\n\n\u2264\n\ni\u2208I\n\ni\u2208I\n\ni\u2208I\n\n+\n\nj\u2208J\n\ni\u2208I\n\nj\u2208J\n\nj\u2208J\n\nwi\n\ni\u2208I\n\nj + \u00af\u03bbwj\n\nj + \u00af\u03bbwj\n\nOn line two, terms were simply regrouped. On line three, the square term from line two was\ni=1 wi = 1. On the \ufb01fth\nline, we once again regrouped terms; the last term in this expression (which is e2\u03b1 + e\u22122\u03b1 \u2212 2)\ncan be written as (e\u03b1\u2212e\u2212\u03b1)2. When \u03b1 = 0 this term vanishes. Hence the bound is exact at \u03b1 = 0.\n\nCorollary 4 VadaBoost monotonically decreases the cost (3).\n\nThe above corollary follows from:\n\nV (\u03b1s; w, \u03bb, I, J) \u2264 U (\u03b1s; w, \u03bb, I, J) < U (0; w, \u03bb, I, J) = V (0; w, \u03bb, I, J).\n\nIn the above, the \ufb01rst inequality follows from Theorem (3). The second strict inequality holds\nbecause \u03b1s is a minimizer of U from Lemma (2); it is not hard to show that U (\u03b1s; w, \u03bb, I, J) is\nstrictly less than U (0; w, \u03bb, I, J) from the termination criterion of VadaBoost. The third equality\nagain follows from Theorem (3). Finally, we notice that V (0; w, \u03bb, I, J) merely corresponds to the\n\nt=1 \u03b1tGt(\u00b7). Thus, we have shown that taking a step \u03b1s decreases the cost (3).\n\ncost (3) at(cid:80)s\u22121\n\n4\n\n\fFigure 1: Typical Upper bound U (\u03b1; w, \u03bb, I, J) and the actual cost function V (\u03b1; w, \u03bb, I, J) values\nunder varying \u03b1. The bound is exact at \u03b1 = 0. The bound gets closer to the actual function value as\n\u03bb grows. The left plot shows the bound for \u03bb = 0 and the right plot shows it for \u03bb = 0.9\n\nWe point out that we use a different upper bound in each iteration since V and U are parameterized\nby the current weights in the VadaBoost algorithm. Also note that our upper bound holds only for\n0 \u2264 \u03bb \u2264 1. Although the choice 0 \u2264 \u03bb \u2264 1 seems restrictive, intuitively, it is natural to have a\nhigher penalization on the empirical mean rather than the empirical variance during minimization.\nAlso, a closer look at the empirical Bernstein inequality in [5] shows that the empirical variance\n\nterm is multiplied by(cid:112)1/n while the empirical mean is multiplied by one. Thus, for large values of\n\nn, the weight on the sample variance is small. Furthermore, our experiments suggest that restricting\n\u03bb to this range does not signi\ufb01cantly change the results.\n\n4 How good is the upper bound?\n\nFirst, we observe that our upper bound is exact when \u03bb = 1. Also, our upper bound is loosest for\nthe case \u03bb = 0. We visualize the upper bound and the true cost for two settings of \u03bb in Figure 1.\nSince the cost (4) is minimized via an upper bound (5), a natural question is: how good is this ap-\nproximation? We evaluate the tightness of this upper bound by considering its impact on learning\nef\ufb01ciency. As is clear from \ufb01gure (1), when \u03bb = 1, the upper bound is exact and incurs no inef\ufb01-\nciency. In the other extreme when \u03bb = 0, the cost of VadaBoost coincides with AdaBoost and the\nbound is effectively at its loosest. Even in this extreme case, VadaBoost derived through an upper\nbound only requires at most twice the number of iterations as AdaBoost to achieve a particular cost.\nThe following theorem shows that our algorithm remains ef\ufb01cient even in this worst-case scenario.\n\nTheorem 5 Let OA denote the squared cost obtained by AdaBoost after S iterations. For weak\nlearners in any iteration achieving a \ufb01xed error rate \u0001 < 0.5, VadaBoost with the setting \u03bb = 0\nattains a cost at least as low as OA in no more than 2S iterations.\n\nProof Denote the weight on the example i in sth iteration by ws\n\ni . The weighted error rate of the sth\n\nclassi\ufb01er is \u0001s =(cid:80)\n\nexp(\u2212yi\n\n(cid:80)S\nn(cid:81)S\n\ns=1 \u03b1sGs(Xi))\ns=1 Zs\n\n.\n\nj\u2208Js\n\nws\n\nwS+1\n\ni\n\n=\n\nj . We have, for both algorithms,\ni exp(\u2212yi\u03b1SGS(Xi))\nwS\n(cid:88)\n(cid:88)\n\nZs\n\n=\n\nZ a\n\ns =\n\nws\n\nj e\u03b1s +\n\nj\u2208js\n\ni\u2208Is\n\ni e\u2212\u03b1s = 2(cid:112)\u0001s(1 \u2212 \u0001s).\n\nws\n\nThe value of the normalization factor in the case of AdaBoost is\n\nSimilarly, the value of the normalization factor for VadaBoost is given by\n\n(6)\n\n(7)\n\n(8)\n\n(cid:88)\n\nj\u2208Js\n\n(cid:88)\n\ni\u2208Is\n\nZ v\n\ns =\n\nws\n\nj e\u03b1s +\n\n\u221a\n\n\u0001s +\n\n\u221a\n\n1 \u2212 \u0001s).\n\ni e\u2212\u03b1s = ((\u0001s)(1 \u2212 \u0001s))\nws\n\n1\n\n4 (\n\n5\n\n00.30.60.912\u03b1Cost Actual Cost:VUpper Bound:U00.30.60.9123\u03b1Cost Actual Cost:VUpper Bound:U\fThe squared cost function of AdaBoost after S steps is given by\n\nOA =\n\nS(cid:88)\n\n(cid:33)2\n(cid:32) n(cid:88)\nWe used (6), (7) and the fact that(cid:80)n\n\n\u03b1syiGs(X))\n\nexp(\u2212yi\n\ns=1\n\ni=1\n\n\u03bb = 0 the cost of VadaBoost satis\ufb01es2\n\n(cid:32) n(cid:88)\nS(cid:89)\n\ni=1\n\ni\n\ni=1 wS+1\n\nexp(\u2212yi\n\nS(cid:88)\n\n(cid:33)2\n(2\u0001s(1 \u2212 \u0001s) +(cid:112)\u0001s(1 \u2212 \u0001s)).\n\n\u03b1syiGs(X))\n\ns=1\n\n= n2\n\nOV =\n\n(cid:32)\n\nS(cid:89)\n\nn(cid:88)\n\n(cid:33)2\n\n=\n\nn\n\nZ a\ns\n\nws+1\n\ni\n\n= n2\n\ns=1\n\ni=1\n\n(cid:33)2\n\nZ a\ns\n\n= n2\n\n(cid:32) S(cid:89)\n\ns=1\n\nS(cid:89)\n\ns=1\n\n4\u0001s(1 \u2212 \u0001s).\n\n= 1 to derive the above expression. Similarly, for\n\n(cid:32)\n\nS(cid:89)\n\nn(cid:88)\n\n(cid:33)2\n\n(cid:32) S(cid:89)\n\n(cid:33)2\n\n=\n\nn\n\nZ a\ns\n\nws+1\n\ni\n\n= n2\n\nZ v\ns\n\ns=1\n\ni=1\n\ns=1\n\ns=1\n\nNow, suppose that \u0001s = \u0001 for all s. Then, the squared cost achieved by AdaBoost is given by\nn2(4\u0001(1 \u2212 \u0001))S. To achieve the same cost value, VadaBoost, with weak learners with the same\nerror rate needs at most S\ntimes. Within the range of interest for \u0001, the term\nmultiplying S above is at most 2.\n\n\u221a\nlog(4\u0001(1\u2212\u0001))\nlog(2\u0001(1\u2212\u0001)+\n\n\u0001(1\u2212\u0001))\n\nAlthough the above worse-case bound achieves a factor of two, for \u0001 > 0.4, VadaBoost requires\nonly about 33% more iterations than AdaBoost. To summarize, even in the worst possible scenario\nwhere \u03bb = 0 (when the variational bound is at its loosest), the VadaBoost algorithm takes no more\nthan double (a small constant factor) the number of iterations of AdaBoost to achieve the same cost.\n\nAlgorithm 3 EBBoost:\n\nRequire: (Xi, yi)n\n\ni=1, scalar parameter \u03bb \u2265 0, and weak learners H\nInitialize the weights: wi \u2190 1/n for i = 1, . . . , n; Initialize f to predict zero on all inputs.\n(cid:19)\nfor s \u2190 1 to S do\n\n(cid:18) (1\u2212\u03bb)((cid:80)\nGet a weak learner Gs(\u00b7) that minimizes (3) with the following choice of \u03b1s:\n(1\u2212\u03bb)((cid:80)\ni\u2208Is\n\u03b1s = 1\ni\u2208Js\nif \u03b1s < 0 then break end if\nf (\u00b7) \u2190 f (\u00b7) + \u03b1sGs(\u00b7)\n\nwi \u2190 wi exp(\u2212yiGs(Xi)\u03b1s)/Zs where Zs is such that(cid:80)n\n\nwi)2+\u03bbn(cid:80)\nwi)2+\u03bbn(cid:80)\n\ni\u2208Is\ni\u2208Js\n\n4 log\n\nw2\ni\nw2\ni\n\ni=1 wi = 1.\n\nend for\n\n5 A limitation of the EBBoost algorithm\n\nA sample variance penalization algorithm known as EBBoost was previously explored [10]. While\nthis algorithm was simple to implement and showed signi\ufb01cant improvements over AdaBoost, it\nsuffers from a severe limitation:\nit requires enumeration and evaluation of every possible weak\nlearner per iteration. Recall the steps implementing EBBoost in Algorithm 3. An implementation\nof EBBoost requires exhaustive enumeration of weak learners in search of the one that minimizes\ncost (3). It is preferable, instead, to \ufb01nd the best weak learner by providing weights on the training\nexamples and ef\ufb01ciently computing the rule whose performance on that weighted set of examples\nis guaranteed to be better than random guessing. However, with the EBBoost algorithm, the weight\n\ni +(cid:0)(cid:80)\n(cid:1)2 and the weight on correctly classi-\n(cid:1)2; these aggregate weights on misclassi\ufb01ed examples and\n\non all the misclassi\ufb01ed examples is(cid:80)\n\ufb01ed examples is(cid:80)\n\ni +(cid:0)(cid:80)\n\ncorrectly classi\ufb01ed examples do not translate into weights on the individual examples. Thus, it be-\ncomes necessary to exhaustively enumerate weak learners in Algorithm 3. While enumeration of\nweak learners is possible in the case of decision stumps, it poses serious dif\ufb01culties in the case of\nweak learners such as decision trees, ridge regression, etc. Thus, VadaBoost is the more versatile\nboosting algorithm for sample variance penalization.\n\ni\u2208Js\nwi\n\ni\u2208Js\n\ni\u2208Is\n\ni\u2208Is\n\nw2\n\nw2\n\nwi\n\n2The cost which VadaBoost minimizes at \u03bb = 0 is the squared cost of AdaBoost, we do not square it again.\n\n6\n\n\fTable 1: Mean and standard errors with decision stump as the weak learner.\n\nRLP-Boost\n16.21 \u00b1 0.1\n22.29 \u00b1 0.2\n3.18 \u00b1 0.1\n0.01 \u00b1 0.0\n3.60 \u00b1 0.1\n0.98 \u00b1 0.0\n0.68 \u00b1 0.0\n2.06 \u00b1 0.1\n4.51 \u00b1 0.1\n2.77 \u00b1 0.1\n13.02 \u00b1 0.6\n5.81 \u00b1 0.1\n8.55 \u00b1 0.2\n3.29 \u00b1 0.1\n2.44 \u00b1 0.1\n10.95 \u00b1 0.1\n24.16 \u00b1 0.1\n4.96 \u00b1 0.3\n\nRQP-Boost\n16.04 \u00b1 0.1\n21.79 \u00b1 0.2\n3.09 \u00b1 0.1\n0.00 \u00b1 0.0\n3.41 \u00b1 0.1\n0.88 \u00b1 0.0\n0.63 \u00b1 0.0\n1.95 \u00b1 0.1\n4.25 \u00b1 0.1\n2.72 \u00b1 0.1\n12.86 \u00b1 0.6\n5.75 \u00b1 0.1\n8.47 \u00b1 0.1\n3.07 \u00b1 0.1\n2.36 \u00b1 0.1\n10.60 \u00b1 0.1\n23.61 \u00b1 0.1\n4.72 \u00b1 0.3\n\nDataset\na5a\nabalone\nimage\nmushrooms\nmusk\nmnist09\nmnist14\nmnist27\nmnist38\nmnist56\nringnorm\nspambase\nsplice\ntwonorm\nw4a\nwaveform\nwine\nwisc\n\nAdaBoost\n16.15 \u00b1 0.1\n21.64 \u00b1 0.2\n3.37 \u00b1 0.1\n0.02 \u00b1 0.0\n3.84 \u00b1 0.1\n0.89 \u00b1 0.0\n0.64 \u00b1 0.0\n2.11 \u00b1 0.1\n4.45 \u00b1 0.1\n2.79 \u00b1 0.1\n13.16 \u00b1 0.6\n5.90 \u00b1 0.1\n8.83 \u00b1 0.2\n3.16 \u00b1 0.1\n2.60 \u00b1 0.1\n10.99 \u00b1 0.1\n23.62 \u00b1 0.2\n5.32 \u00b1 0.3\n\nEBBoost\n16.05 \u00b1 0.1\n21.52 \u00b1 0.2\n3.14 \u00b1 0.1\n0.02 \u00b1 0.0\n3.51 \u00b1 0.1\n0.85 \u00b1 0.0\n0.58 \u00b1 0.0\n1.86 \u00b1 0.1\n4.12 \u00b1 0.1\n2.56 \u00b1 0.1\n11.74 \u00b1 0.6\n5.64 \u00b1 0.1\n8.33 \u00b1 0.1\n2.98 \u00b1 0.1\n2.38 \u00b1 0.1\n10.96 \u00b1 0.1\n23.52 \u00b1 0.2\n4.38 \u00b1 0.2\n\nVadaBoost\n16.22 \u00b1 0.1\n21.63 \u00b1 0.2\n3.14 \u00b1 0.1\n0.01 \u00b1 0.0\n3.59 \u00b1 0.1\n0.84 \u00b1 0.0\n0.60 \u00b1 0.0\n2.01 \u00b1 0.1\n4.32 \u00b1 0.1\n2.62 \u00b1 0.1\n12.46 \u00b1 0.6\n5.78 \u00b1 0.1\n8.48 \u00b1 0.1\n3.09 \u00b1 0.1\n2.50 \u00b1 0.1\n10.75 \u00b1 0.1\n23.41 \u00b1 0.1\n5.00 \u00b1 0.2\n\nTable 2: Mean and standard errors with CART as the weak learner.\n\nDataset\na5a\nabalone\nimage\nmushrooms\nmusk\nmnist09\nmnist14\nmnist27\nmnist38\nmnist56\nringnorm\nspambase\nsplice\ntwonorm\nw4a\nwaveform\nwine\nwisc\n\nAdaBoost\n17.59 \u00b1 0.2\n21.87 \u00b1 0.2\n1.93 \u00b1 0.1\n0.01 \u00b1 0.0\n2.36 \u00b1 0.1\n0.73 \u00b1 0.0\n0.52 \u00b1 0.0\n1.31 \u00b1 0.0\n1.89 \u00b1 0.1\n1.23 \u00b1 0.1\n7.94 \u00b1 0.4\n6.14 \u00b1 0.1\n4.02 \u00b1 0.1\n3.40 \u00b1 0.1\n2.90 \u00b1 0.1\n11.09 \u00b1 0.1\n21.94 \u00b1 0.2\n4.61 \u00b1 0.2\n\nVadaBoost\n17.16 \u00b1 0.1\n21.30 \u00b1 0.2\n1.98 \u00b1 0.1\n0.01 \u00b1 0.0\n2.07 \u00b1 0.1\n0.72 \u00b1 0.0\n0.50 \u00b1 0.0\n1.24 \u00b1 0.0\n1.72 \u00b1 0.1\n1.17 \u00b1 0.0\n7.78 \u00b1 0.4\n5.76 \u00b1 0.1\n3.67 \u00b1 0.1\n3.27 \u00b1 0.1\n2.90 \u00b1 0.1\n10.59 \u00b1 0.1\n21.18 \u00b1 0.2\n4.18 \u00b1 0.2\n\nRLP-Boost\n18.24 \u00b1 0.1\n22.16 \u00b1 0.2\n1.99 \u00b1 0.1\n0.02 \u00b1 0.0\n2.40 \u00b1 0.1\n0.76 \u00b1 0.0\n0.55 \u00b1 0.0\n1.32 \u00b1 0.0\n1.88 \u00b1 0.1\n1.20 \u00b1 0.0\n8.60 \u00b1 0.5\n6.25 \u00b1 0.1\n4.03 \u00b1 0.1\n3.50 \u00b1 0.1\n2.90 \u00b1 0.1\n11.11 \u00b1 0.1\n22.44 \u00b1 0.2\n4.63 \u00b1 0.2\n\nRQP-Boost\n17.99 \u00b1 0.1\n21.84 \u00b1 0.2\n1.95 \u00b1 0.1\n0.01 \u00b1 0.0\n2.29 \u00b1 0.1\n0.71 \u00b1 0.0\n0.52 \u00b1 0.0\n1.29 \u00b1 0.0\n1.87 \u00b1 0.1\n1.19 \u00b1 0.1\n7.84 \u00b1 0.4\n6.03 \u00b1 0.1\n3.97 \u00b1 0.1\n3.38 \u00b1 0.1\n2.90 \u00b1 0.1\n10.82 \u00b1 0.1\n22.18 \u00b1 0.2\n4.37 \u00b1 0.2\n\n6 Experiments\n\nIn this section, we evaluate the empirical performance of the VadaBoost algorithm with respect to\nseveral other algorithms. The primary purpose of our experiments is to compare sample variance\npenalization versus empirical risk minimization and to show that we can ef\ufb01ciently perform sample\nvariance penalization for weak learners beyond decision stumps. We compared VadaBoost against\nEBBoost, AdaBoost, regularized LP and QP boost algorithms [7]. All the algorithms except Ad-\naBoost have one extra parameter to tune.\nExperiments were performed on benchmark datasets that have been previously used in [10]. These\ndatasets include a variety of tasks including all digits from the MNIST dataset. Each dataset was\ndivided into three parts: 50% for training, 25% for validation and 25% for test. The total number\nof examples was restricted to 5000 in the case of MNIST and musk datasets due to computational\nrestrictions of solving LP/QP.\nThe \ufb01rst set of experiments use decision stumps as the weak learners. The second set of experiments\nused Classi\ufb01cation and Regression Trees or CART [3] as weak learners. A standard MATLAB\nimplementation of CART was used without modi\ufb01cation. For all the datasets, in both experiments,\n\n7\n\n\fAdaBoost, VadaBoost and EBBoost (in the case of stumps) were run until there was no drop in the\nerror rate on the validation for the next 100 consecutive iterations. The values of the parameters for\nVadaBoost and EBBoost were chosen to minimize the validation error upon termination. RLP-Boost\nand RQP-Boost were given the predictions obtained by AdaBoost. Their regularization parameter\nwas also chosen to minimize the error rate on the validation set. Once the parameter values were\n\ufb01xed via the validation set, we noted the test set error corresponding to that parameter value. The\nentire experiment was repeated 50 times by randomly selecting train, test and validation sets. The\nnumbers reported here are average from these runs.\nThe results for the decision stump and CART experiments are reported in Tables 1 and 2. For each\ndataset, the algorithm with the best percentage test error is represented by a dark shaded cell. All\nlightly shaded entries in a row denote results that are not signi\ufb01cantly different from the minimum\nerror (according to a paired t-test at a 1% signi\ufb01cance level). With decision stumps, both EBBoost\nand VadaBoost have comparable performance and signi\ufb01cantly outperform AdaBoost. With CART\nas the weak learner, VadaBoost is once again signi\ufb01cantly better than AdaBoost.\nWe gave a guarantee on the number of itera-\ntions required in the worst case for Vadaboost\n(which approximately matches the AdaBoost\ncost (squared) in Theorem 5). An assumption\nin that theorem was that the error rate of each\nweak learner was \ufb01xed. However, in practice,\nthe error rates of the weak learners are not con-\nstant over the iterations. To see this behavior\nin practice, we have shown the results with the\nMNIST 3 versus 8 classi\ufb01cation experiment. In\n\ufb01gure 2 we show the cost (plus 1) for each al-\ngorithm (the AdaBoost cost has been squared)\nversus the number of iterations using a loga-\nrithmic scale on the Y-axis. Since at \u03bb = 0,\nEBBoost reduces to AdaBoost, we omit its plot\nat that setting. From the \ufb01gure, it can be seen\nthat the number of iterations required by VadaBoost is roughly twice the number of iterations re-\nquired by AdaBoost. At \u03bb = 0.5, there is only a minor difference in the number of iterations\nrequired by EBBoost and VadaBoost.\n\nFigure 2: 1+ cost vs the number of iterations.\n\n7 Conclusions\n\nThis paper identi\ufb01ed a key weakness in the EBBoost algorithm and proposed a novel algorithm\nthat ef\ufb01ciently overcomes the limitation to enumerable weak learners. VadaBoost reduces a well\nmotivated cost by iteratively minimizing an upper bound which, unlike EBBoost, allows the boosting\nmethod to handle any weak learner by estimating weights on the data. The update rule of VadaBoost\nhas a simplicity that is reminiscent of AdaBoost. Furthermore, despite the use of an upper bound,\nthe novel boosting method remains ef\ufb01cient. Even when the bound is at its loosest, the number\nof iterations required by VadaBoost is a small constant factor more than the number of iterations\nrequired by AdaBoost. Experimental results showed that VadaBoost outperforms AdaBoost in terms\nof classi\ufb01cation accuracy and ef\ufb01ciently applying to any family of weak learners. The effectiveness\nof boosting has been explained via margin theory [9] though it has taken a number of years to settle\ncertain open questions [8]. Considering the simplicity and effectiveness of VadaBoost, one natural\nfuture research direction is to study the margin distributions it obtains. Another future research\ndirection is to design ef\ufb01cient sample variance penalization algorithms for other problems such as\nmulti-class classi\ufb01cation, ranking, and so on.\n\nAcknowledgements This material is based upon work supported by the National Science Founda-\ntion under Grant No. 1117631, by a Google Research Award, and by the Department of Homeland\nSecurity under Grant No. N66001-09-C-0080.\n\n8\n\n100200300400500600700100101102103104105Iterationcost + 1 AdaBoostEBBoost \u03bb=0.5VadaBoost \u03bb=0VadaBoost \u03bb=0.5\fReferences\n[1] J-Y. Audibert, R. Munos, and C. Szepesv\u00b4ari. Tuning bandit algorithms in stochastic environ-\n\nments. In ALT, 2007.\n\n[2] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[3] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classi\ufb01cation and Regression Trees.\n\nChapman and Hall, New York, 1984.\n\n[4] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[5] A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. In\n\nCOLT, 2009.\n\n[6] V. Mnih, C. Szepesv\u00b4ari, and J-Y. Audibert. Empirical Bernstein stopping. In COLT, 2008.\n[7] G. Raetsch, T. Onoda, and K.-R. Muller. Soft margins for AdaBoost. Machine Learning,\n\n43:287\u2013320, 2001.\n\n[8] L. Reyzin and R. Schapire. How boosting the margin can also boost classi\ufb01er complexity. In\n\nICML, 2006.\n\n[9] R. E. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee. Boosting the margin: a new explanation\n\nfor the effectiveness of voting methods. Annals of Statistics, 26(5):1651\u20131686, 1998.\n\n[10] P. K. Shivaswamy and T. Jebara. Empirical Bernstein boosting. In AISTATS, 2010.\n[11] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, NY, 1995.\n\n9\n\n\f", "award": [], "sourceid": 1075, "authors": [{"given_name": "Pannagadatta", "family_name": "Shivaswamy", "institution": null}, {"given_name": "Tony", "family_name": "Jebara", "institution": null}]}