{"title": "A Boosting Framework on Grounds of Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2267, "page_last": 2275, "abstract": "By exploiting the duality between boosting and online learning, we present a boosting framework which proves to be extremely powerful thanks to employing the vast knowledge available in the online learning area. Using this framework, we develop various algorithms to address multiple practically and theoretically interesting questions including sparse boosting, smooth-distribution boosting, agnostic learning and, as a by-product, some generalization to double-projection online learning algorithms.", "full_text": "A Boosting Framework on Grounds of Online\n\nLearning\n\nTo\ufb01gh Naghibi, Beat P\ufb01ster\n\nComputer Engineering and Networks Laboratory\n\nETH Zurich, Switzerland\n\nnaghibi@tik.ee.ethz.ch, pfister@tik.ee.ethz.ch\n\nAbstract\n\nBy exploiting the duality between boosting and online learning, we present a\nboosting framework which proves to be extremely powerful thanks to employing\nthe vast knowledge available in the online learning area. Using this framework,\nwe develop various algorithms to address multiple practically and theoretically\ninteresting questions including sparse boosting, smooth-distribution boosting, ag-\nnostic learning and, as a by-product, some generalization to double-projection\nonline learning algorithms.\n\n1 Introduction\n\nA boosting algorithm can be seen as a meta-algorithm that maintains a distribution over the sample\nspace. At each iteration a weak hypothesis is learned and the distribution is updated, accordingly.\nThe output (strong hypothesis) is a convex combination of the weak hypotheses. Two dominant\nviews to describe and design boosting algorithms are \u201cweak to strong learner\u201d (WTSL), which is\nthe original viewpoint presented in [1, 2], and boosting by \u201ccoordinate-wise gradient descent in the\nfunctional space\u201d (CWGD) appearing in later works [3, 4, 5]. A boosting algorithm adhering to the\n\ufb01rst view guarantees that it only requires a \ufb01nite number of iterations (equivalently, \ufb01nite number of\nweak hypotheses) to learn a (1\u2212 \u01eb)-accurate hypothesis. In contrast, an algorithm resulting from the\nCWGD viewpoint (usually called potential booster) may not necessarily be a boosting algorithm in\nthe probability approximately correct (PAC) learning sense. However, while it is rather dif\ufb01cult to\nconstruct a boosting algorithm based on the \ufb01rst view, the algorithmic frameworks, e.g., AnyBoost\n[4], resulting from the second viewpoint have proven to be particularly proli\ufb01c when it comes to\ndeveloping new boosting algorithms. Under the CWGD view, the choice of the convex loss function\nto be minimized is (arguably) the cornerstone of designing a boosting algorithm. This, however, is\na severe disadvantage in some applications.\n\nIn CWGD, the weights are not directly controllable (designable) and are only viewed as the values\nof the gradient of the loss function. In many applications, some characteristics of the desired dis-\ntribution are known or given as problem requirements while, \ufb01nding a loss function that generates\nsuch a distribution is likely to be dif\ufb01cult. For instance, what loss functions can generate sparse\ndistributions?1 What family of loss functions results in a smooth distribution?2 We even can go\nfurther and imagine the scenarios in which a loss function needs to put more weights on a given\nsubset of examples than others, either because that subset has more reliable labels or it is a prob-\nlem requirement to have a more accurate hypothesis for that part of the sample space. Then, what\n\n1In the boosting terminology, sparsity usually refers to the greedy hypothesis-selection strategy of boost-\ning methods in the functional space. However, sparsity in this paper refers to the sparsity of the distribution\n(weights) over the sample space.\n\n2A smooth distribution is a distribution that does not put too much weight on any single sample or in other\nwords, a distribution emulated by the booster does not dramatically diverge from the target distribution [6, 7].\n\n1\n\n\floss function can generate such a customized distribution? Moreover, does it result in a provable\nboosting algorithm? In general, how can we characterize the accuracy of the \ufb01nal hypothesis?\n\nAlthough, to be fair, the so-called loss function hunting approach has given rise to useful boosting\nalgorithms such as LogitBoost, FilterBoost, GiniBoost and MadaBoost [5, 8, 9, 10] which (to some\nextent) answer some of the above questions, it is an in\ufb02exible and relatively unsuccessful approach\nto addressing the boosting problems with distribution constraints.\n\nAnother approach to designing a boosting algorithm is to directly follow the WTSL viewpoint\n[11, 6, 12]. The immediate advantages of such an approach are, \ufb01rst, the resultant algorithms\nare provable boosting algorithms, i.e., they output a hypothesis of arbitrary accuracy. Second, the\nbooster has direct control over the weights, making it more suitable for boosting problems subject to\nsome distribution constraints. However, since the WTSL view does not offer any algorithmic frame-\nwork (as opposed to the CWGD view), it is rather dif\ufb01cult to come up with a distribution update\nmechanism resulting in a provable boosting algorithm. There are, however, a few useful, and al-\nbeit fairly limited, algorithmic frameworks such as TotalBoost [13] that can be used to derive other\nprovable boosting algorithms. The TotalBoost algorithm can maximize the margin by iteratively\nsolving a convex problem with the totally corrective constraint. A more general family of boost-\ning algorithms was later proposed by Shalev-Shwartz et. al. [14], where it was shown that weak\nlearnability and linear separability are equivalent, a result following from von Neumann\u2019s minmax\ntheorem. Using this theorem, they constructed a family of algorithms that maintain smooth distribu-\ntions over the sample space, and consequently are noise tolerant. Their proposed algorithms \ufb01nd an\n(1\u2212 \u01eb)-accurate solution after performing at most O(log(N )/\u01eb2) iterations, where N is the number\nof training examples.\n\n1.1 Our Results\n\nWe present a family of boosting algorithms that can be derived from well-known online learning\nalgorithms, including projected gradient descent [15] and its generalization, mirror descent (both\nactive and lazy updates, see [16]) and composite objective mirror descent (COMID) [17]. We prove\nthe PAC learnability of the algorithms derived from this framework and we show that this framework\nin fact generates maximum margin algorithms. That is, given a desired accuracy level \u03bd, it outputs a\nhypothesis of margin \u03b3min \u2212 \u03bd with \u03b3min being the minimum edge that the weak classi\ufb01er guarantees\nto return.\n\nThe duality between (linear) online learning and boosting is by no means new. This duality was \ufb01rst\npointed out in [2] and was later elaborated and formalized by using the von Neumann\u2019s minmax\ntheorem [18]. Following this line, we provide several proof techniques required to show the PAC\nlearnability of the derived boosting algorithms. These techniques are fairly versatile and can be used\nto translate many other online learning methods into our boosting framework. To motivate our boost-\ning framework, we derive two practically and theoretically interesting algorithms: (I) SparseBoost\nalgorithm which by maintaining a sparse distribution over the sample space tries to reduce the space\nand the computation complexity. In fact this problem, i.e., applying batch boosting on the successive\nsubsets of data when there is not suf\ufb01cient memory to store an entire dataset, was \ufb01rst discussed by\nBreiman in [19], though no algorithm with theoretical guarantee was suggested. SparseBoost is the\n\ufb01rst provable batch booster that can (partially) address this problem. By analyzing this algorithm,\nwe show that the tuning parameter of the regularization term \u21131 at each round t should not exceed\n\u03b3t\n2 \u03b7t to still have a boosting algorithm, where \u03b7t is the coef\ufb01cient of the tth weak hypothesis and \u03b3t\nis its edge. Exploiting sparsity in example domain has also been investigated in [20] by Hatano and\nTakimoto. (II) A smooth boosting algorithm that requires only O(log 1/\u01eb) number of rounds to learn\na (1 \u2212 \u01eb)-accurate hypothesis. This algorithm can also be seen as an agnostic boosting algorithm3\ndue to the fact that smooth distributions provide a theoretical guarantee for noise tolerance in various\nnoisy learning settings, such as agnostic boosting [22, 23].\n\nFurthermore, we provide an interesting theoretical result about MadaBoost [10]. We give a proof\n(to the best of our knowledge the only available unconditional proof) for the boosting property of\n(a variant of) MadaBoost and show that, unlike the common presumption, its convergence rate is of\nO(1/\u01eb2) rather than O(1/\u01eb).\n\n3Unlike the PAC model, the agnostic learning model allows an arbitrary target function (labeling function)\n\nthat may not belong to the class studied, and hence, can be viewed as a noise tolerant learning model [21].\n\n2\n\n\fFinally, we show our proof technique can be employed to generalize some of the known online\nlearning algorithms. Speci\ufb01cally, consider the Lazy update variant of the online Mirror Descent\n(LMD) algorithm (see for instance [16]). The standard proof to show that the LMD update scheme\nachieves vanishing regret bound is through showing its equivalence to the FTRL algorithm [16] in\nthe case that they are both linearized, i.e., the cost function is linear. However, this indirect proof is\nfairly restrictive when it comes to generalizing the LMD-type algorithms. Here, we present a direct\nproof for it, which can be easily adopted to generalize the LMD-type algorithms.\n\n2 Preliminaries\n\nLet {(xi, ai)}, 1 \u2264 i \u2264 N , be N training samples, where xi \u2208 X and ai \u2208 {\u22121, +1}. Assume\nh \u2208 H is a real-valued function mapping X into [\u22121, 1]. Denote a distribution over the training data\nby w = [w1, . . . , wN ]\u22a4 and de\ufb01ne a loss vector d = [\u2212a1h(x1), . . . , \u2212aN h(xN )]\u22a4. We de\ufb01ne\n\u03b3 = \u2212w\u22a4d as the edge of the hypothesis h under the distribution w and it is assumed to be positive\nwhen h is returned by a weak learner. In this paper we do not consider the branching program based\nboosters and adhere to the typical boosting protocol (described in Section 1).\n\nSince a central notion throughout this paper is that of Bregman divergences, we brie\ufb02y revisit some\nof their properties. A Bregman divergence is de\ufb01ned with respect to a convex function R as\n\nBR(x, y) = R(x) \u2212 R(y) \u2212 \u2207R(y)(x \u2212 y)\u22a4\n\n(1)\n\nand can be interpreted as a distance measure between x and y. Due to the convexity of R, a\nBregman divergence is always non-negative, i.e., BR(x, y) \u2265 0. In this work we consider R to\nbe a \u03b2-strongly convex function4 with respect to a norm ||.||. With this choice of R, the Bregman\ndivergence BR(x, y) \u2265 \u03b2\n2 x\u22a4x (which is 1-strongly convex\nwith respect to ||.||2), then BR(x, y) = 1\n2 is the Euclidean distance. Another example\nis the negative entropy function R(x) = PN\ni=1 xi log xi (resulting in the KL-divergence) which is\n\nknown to be 1-strongly convex over the probability simplex with respect to \u21131 norm.\n\n2 ||x \u2212 y||2. As an example, if R(x) = 1\n\n2 ||x \u2212 y||2\n\nThe Bregman projection is another fundamental concept of our framework.\n\nDe\ufb01nition 1 (Bregman Projection). The Bregman projection of a vector y onto a convex set S with\nrespect to a Bregman divergence BR is\n\n\u03a0S(y) = arg min\n\nx\u2208S\n\nBR(x, y)\n\n(2)\n\nMoreover, the following generalized Pythagorean theorem holds for Bregman projections.\n\nLemma 1 (Generalized Pythagorean) [24, Lemma 11.3]. Given a point y \u2208 RN , a convex set S\nand \u02c6y = \u03a0S(y) as the Bregman projection of y onto S, for all x \u2208 S we have\n\n(3)\n\n(4)\n\n(5)\n\nExact:\n\nRelaxed:\n\nBR(x, y) \u2265 BR(x, \u02c6y) + BR(\u02c6y, y)\nBR(x, y) \u2265 BR(x, \u02c6y)\n\nThe relaxed version follows from the fact that BR(\u02c6y, y) \u2265 0 and thus can be ignored.\nLemma 2. For any vectors x, y, z, we have\n\n(x \u2212 y)\u22a4(\u2207R(z) \u2212 \u2207R(y)) = BR(x, y) \u2212 BR(x, z) + BR(y, z)\n\nThe above lemma follows directly from the Bregman divergence de\ufb01nition in (1). Additionally, the\nfollowing de\ufb01nitions from convex analysis are useful throughout the paper.\n\nDe\ufb01nition 2 (Norm & dual norm). Let ||.||A be a norm. Then its dual norm is de\ufb01ned as\n\n(6)\nFor instance, the dual norm of ||.||2 = \u21132 is ||.||2\u2217 = \u21132 norm and the dual norm of \u21131 is \u2113\u221e norm.\nFurther,\n\n||y||A\u2217 = sup{y\u22a4x, ||x||A \u2264 1}\n\nLemma 3. For any vectors x, y and any norm ||.||A, the following inequality holds:\n\nx\u22a4y \u2264 ||x||A||y||A\u2217 \u2264\n\n1\n2\n\n||x||2\n\nA +\n\n1\n2\n\n||y||2\n\nA\u2217\n\n(7)\n\n4That is, its second derivative (Hessian in higher dimensions) is bounded away from zero by at least \u03b2.\n\n3\n\n\fThroughout this paper, we use the shorthands ||.||A = ||.|| and ||.||A\u2217 = ||.||\u2217 for the norm and its\ndual, respectively.\n\nFinally, before continuing, we establish our notations. Vectors are lower case bold letters and their\nentries are non-bold letters with subscripts, such as xi of x, or non-bold letter with superscripts if the\nvector already has a subscript, such as xi\nt of xt. Moreover, an N-dimensional probability simplex is\ni=1 wi = 1, wi \u2265 0}. The proofs of the theorems and the lemmas can be\n\ndenoted by S = {w|PN\n\nfound in the Supplement.\n\n3 Boosting Framework\n\nLet R(x) be a 1-strongly convex function with respect to a norm ||.|| and denote its as-\nsociated Bregman divergence BR. Moreover,\nthe dual norm of a loss vector dt\nfor dt as de\ufb01ned\nIt\nbe upper bounded,\nin MABoost, L = 1 when ||.||\u2217 = \u2113\u221e and L = N when ||.||\u2217 = \u21132.\nThe\nfollowing Mirror Ascent Boosting (MABoost) algorithm is our boosting framework.\n\n||dt||\u2217 \u2264 L.\n\nis easy to verify that\n\ni.e.,\n\nlet\n\nAlgorithm 1: Mirror Ascent Boosting (MABoost)\n\nInput: R(x) 1-strongly convex function, w1 = [ 1\n\nN , . . . , 1\n\nN ]\u22a4 and z1 = [ 1\n\nN , . . . , 1\n\nN ]\u22a4\n\nFor t = 1, . . . , T do\n\n(a) Train classi\ufb01er with wt and get ht, let dt = [\u2212a1ht(x1), . . . , \u2212aN ht(xN )]\n\nand \u03b3t = \u2212w\u22a4t dt.\n\n(b) Set \u03b7t = \u03b3t\nL\n\n(c) Update weights:\n\n(d) Project onto S:\n\nEnd\n\n\u2207R(zt+1) = \u2207R(zt) + \u03b7tdt\n\u2207R(zt+1) = \u2207R(wt) + \u03b7tdt\nBR(w, zt+1)\n\nwt+1 = argmin\n\nw\u2208S\n\n(lazy update)\n\n(active update)\n\nOutput: The \ufb01nal hypothesis f (x) = sign(cid:18)PT\n\nt=1 \u03b7tht(x)(cid:19).\n\nThis algorithm is a variant of the mirror descent algorithm [16], modi\ufb01ed to work as a boosting\nalgorithm. The basic principle in this algorithm is quite clear. As in ADABoost, the weight of\na wrongly (correctly) classi\ufb01ed sample increases (decreases). The weight vector is then projected\nonto the probability simplex in order to keep the weight sum equal to 1. The distinction between\nthe active and lazy update versions and the fact that the algorithm may behave quite differently\nunder different update strategies should be emphasized. In the lazy update version, the norm of the\nauxiliary variable zt is unbounded which makes the lazy update inappropriate in some situations.\nIn the active update version, on the other hand, the algorithm always needs to access (compute) the\nprevious projected weight wt to update the weight at round t and this may not be possible in some\napplications (such as boosting-by-\ufb01ltering).\n\nDue to the duality between online learning and boosting, it is not surprising that MABoost (both\nthe active and lazy versions) is a boosting algorithm. The proof of its boosting property, however,\nreveals some interesting properties which enables us to generalize the MABoost framework. In the\nfollowing, only the proof of the active update is given and the lazy update is left to Section 3.4.\n\nTheorem 1. Suppose that MABoost generates weak hypotheses h1, . . . , hT whose edges are\n\u03b31, . . . , \u03b3T . Then the error \u01eb of the combined hypothesis f on the training set is bounded as:\n\nR(w) =\n\n1\n2\n\n||w||2\n\n2 :\n\nR(w) =\n\nN\n\nXi=1\n\nwi log wi :\n\n\u01eb \u2264\n\n1 +PT\n\u01eb \u2264 e\u2212 PT\n\nt=1\n\n1\nt=1 \u03b32\nt\n\n1\n\n2 \u03b3 2\n\nt\n\n(8)\n\n(9)\n\nIn fact, the \ufb01rst bound (8) holds for any 1-strongly convex R, though for some R (e.g., negative\nentropy) a much tighter bound as in (9) can be achieved.\n\n4\n\n\fProof : Assume w\u2217 = [w\u22171, . . . , w\u2217N ]\u22a4 is a distribution vector where w\u2217i = 1\nN \u01eb if f (xi) 6= ai,\nand 0 otherwise. w\u2217 can be seen as a uniform distribution over the wrongly classi\ufb01ed samples by\nthe ensemble hypothesis f . Using this vector and following the approach in [16], we derive the\nt ] is a loss vector as de\ufb01ned in\n\nt=1 \u03b7t(w\u2217\u22a4dt\u2212w\u22a4t dt) where dt = [d1\n\nt , . . . ,dN\n\nupper bound of PT\n\nAlgorithm 1.\n\n(w\u2217 \u2212 wt)\u22a4\u03b7tdt = (w\u2217 \u2212 wt)\u22a4(cid:0)\u2207R(zt+1) \u2212 \u2207R(wt)(cid:1)\n\n= BR(w\u2217, wt) \u2212 BR(w\u2217, zt+1) + BR(wt, zt+1)\n\u2264 BR(w\u2217, wt) \u2212 BR(w\u2217, wt+1) + BR(wt, zt+1)\n\n(10c)\nwhere the \ufb01rst equation follows Lemma 2 and inequality (10c) results from the relaxed version of\nLemma 1. Note that Lemma 1 can be applied here because w\u2217 \u2208 S.\nFurther, the BR(wt, zt+1) term is bounded. By applying Lemma 3\n\n(10a)\n\n(10b)\n\nBR(wt, zt+1) + BR(zt+1, wt) = (zt+1 \u2212 wt)\u22a4\u03b7tdt \u2264\n\n\u03b72\nt ||dt||2\n\u2217\n2 ||zt+1 \u2212 wt||2 due to the 1-strongly convexity of R, we have\n\n||zt+1 \u2212 wt||2 +\n\nand since BR(zt+1, wt) \u2265 1\n\n1\n2\n\n1\n2\n\n(11)\n\n(12)\n\n(13)\n\nBR(wt, zt+1) \u2264\n\n1\n2\n\nt ||dt||2\n\u03b72\n\u2217\n\nNow, replacing (12) into (10c) and summing it up from t = 1 to T , yields\n\nT\n\nXt=1\n\nw\u2217\u22a4\u03b7tdt\u2212 w\u22a4t \u03b7tdt \u2264\n\nT\n\nXt=1\n\n1\n2\n\n\u03b72\nt ||dt||2\n\n\u2217 + BR(w\u2217, w1) \u2212 BR(w\u2217, wT +1)\n\nMoreover, it is evident from the algorithm description that for mistakenly classi\ufb01ed samples\n\n\u03b7tdi\n\n\u2212aif (xi) = \u2212aisign(cid:18) T\nXt=1\n\n\u03b7tht(xi)(cid:19) = sign(cid:18) T\nXt=1\nFollowing (14), the \ufb01rst term in (13) will be w\u2217\u22a4PT\nt=1 \u03b7tdt \u2265 0 and thus, can be ignored. More-\nt=1 \u2212w\u22a4t \u03b7tdt = PT\nover, by the de\ufb01nition of \u03b3, the second term is PT\nt=1 \u03b7t\u03b3t. Putting all these\nXt=1\n\n\u2217 with its upper bound L, yields\n\u03b72\nt \u2212\n\nt(cid:19) \u2265 0 \u2200xi \u2208 {x|f (xi) 6= ai}\n\ntogether, ignoring the last term in (13) and replacing ||dt||2\n\n\u2212BR(w\u2217, w1) \u2264 L\n\nXt=1\n\n\u03b7t\u03b3t\n\n(15)\n\n(14)\n\n1\n2\n\nT\n\nT\n\nReplacing the left side with \u2212BR = \u2212 1\n2N \u01eb for the case of quadratic R, and with\n\u2212BR = log(\u01eb) when R is a negative entropy function, taking the derivative w.r.t \u03b7t and equating\nit to zero (which yields \u03b7t = \u03b3t\nL ) we achieve the error bounds in (8) and (9). Note that in the case\nof R being the negative entropy function, Algorithm 1 degenerates into ADABoost with a different\nchoice of \u03b7t.\n\n2 ||w\u2217 \u2212 w1||2 = \u01eb\u22121\n\nBefore continuing our discussion, it is important to mention that the cornerstone concept of the\nproof is the choice of w\u2217. For instance, a different choice of w\u2217 results in the following max-margin\ntheorem.\nTheorem 2. Setting \u03b7t = \u03b3t\nL\u221at\nis a desired accuracy level and tends to zero in O( log T\u221aT\nObservations: Two observations follow immediately from the proof of Theorem 1. First, the re-\nquirement of using Lemma 1 is w\u2217 \u2208 S, so in the case of projecting onto a smaller convex set\nSk \u2286 S, as long as w\u2217 \u2208 Sk holds, the proof is intact. Second, only the relaxed version of Lemma 1\nis required in the proof (to obtain inequality (10c)). Hence, if there is an approximate projection\n\n, MABoost outputs a hypothesis of margin at least \u03b3min \u2212 \u03bd, where \u03bd\n\n) rounds of boosting.\n\noperator \u02c6\u03a0S that satis\ufb01es the inequality BR(w\u2217, zt+1) \u2265 BR(cid:0)w\u2217, \u02c6\u03a0S(zt+1)(cid:1), it can be substituted\nfor the exact projection operator \u03a0S and the active update version of the algorithm still works. A\npractical approximate operator of this type can be obtained through a double-projection strategy.\n\nLemma 4. Consider the convex sets K and S, where S \u2286 K. Then for any x \u2208 S and y \u2208 RN ,\n\n\u02c6\u03a0S(y) = \u03a0S(cid:16)\u03a0K(y)(cid:17) is an approximate projection that satis\ufb01es BR(x, y) \u2265 BR(cid:0)x, \u02c6\u03a0S (y)(cid:1).\n\nThese observations are employed to generalize Algorithm 1. However, we want to emphasis that the\napproximate Bregman projection is only valid for the active update version of MABoost.\n\n5\n\n\f3.1 Smooth Boosting\n\nLet k > 0 be a smoothness parameter. A distribution w is smooth w.r.t a given distribution D if\nwi \u2264 kDi for all 1 \u2264 i \u2264 N . Here, we consider the smoothness w.r.t to the uniform distribution,\ni.e., Di = 1\nN . Then, given a desired smoothness parameter k, we require a boosting algorithm\nthat only constructs distributions w such that wi \u2264 k\nk )-\naccurate hypothesis. To this end, we only need to replace the probability simplex S with Sk =\nN } in MABoost to obtain a smooth distribution boosting algorithm,\n\nN , while guaranteeing to output a (1 \u2212 1\n\ni=1 wi = 1, 0 \u2264 wi \u2264 k\n\ncalled smooth-MABoost. That is, the update rule is: wt+1 = argmin\nw\u2208Sk\n\nBR(w, zt+1).\n\n{w|PN\n\nNote that the proof of Theorem 1 holds for smooth-MABoost, as well. As long as \u01eb \u2265 1\nk , the error\nN \u01eb \u2264 k\ndistribution w\u2217 (w\u2217i = 1\nN \u01eb if f (xi) 6= ai, and 0 otherwise) is in Sk because 1\nN . Thus, based\non the \ufb01rst observation, the error bounds achieved in Theorem 1 hold for \u01eb \u2265 1\nk . In particular, \u01eb = 1\nk\nis reached after a \ufb01nite number of iterations. This projection problem has already appeared in the\nliterature. An entropic projection algorithm (R is negative entropy), for instance, was proposed\nin [14]. Using negative entropy and their suggested projection algorithm results in a fast smooth\nboosting algorithm with the following convergence rate.\n\nTheorem 3. Given R(w) = PN\n\naccurate hypothesis in O(log( 1\n\n\u01eb )/\u03b32) of iterations.\n\ni=1 wi log wi and a desired \u01eb, smooth-MABoost \ufb01nds a (1 \u2212 \u01eb)-\n\n3.2 Combining Datasets\n\nLet\u2019s assume we have two sets of data. A primary dataset A and a secondary dataset B. The goal\nis to train a classi\ufb01er that achieves (1 \u2212 \u01eb) accuracy on A while limiting the error on dataset B to\n\u01ebB \u2264 1\nk . This scenario has many potential applications including transfer learning [25], weighted\ncombination of datasets based on their noise level and emphasizing on a particular region of a sam-\nple space as a problem requirement (e.g., a medical diagnostic test that should not make a wrong\ndiagnosis when the sample is a pregnant woman). To address this problem, we only need to replace\nN \u2200i \u2208 B} where i \u2208 A\nshorthands the indices of samples in A. By generating smooth distributions on B, this algorithm\nlimits the weight of the secondary dataset, which intuitively results in limiting its effect on the \ufb01nal\nhypothesis. The proof of its boosting property is quite similar to Theorem 1 (see supplement).\n\nS in MABoost with Sc = {w|PN\n\ni=1 wi = 1, 0 \u2264 wi \u2200i \u2208 A \u2227 0 \u2264 wi \u2264 k\n\n3.3 Sparse Boosting\n\n2 ||w||2\n\nLet R(w) = 1\n2. Since in this case the projection onto the simplex is in fact an \u21131-constrained\noptimization problem, it is plausible that some of the weights are zero (sparse distribution), which\nis already a useful observation. To promote the sparsity of the weight vector, we want to directly\nregularize the projection with the \u21131 norm, i.e., adding ||w||1 to the objective function in the pro-\njection step. It is, however, not possible in MABoost, since ||w||1 is trivially constant on the sim-\nplex. Therefore, we split the projection step into two consecutive steps. The \ufb01rst projection is onto\nRN\n\n+ = {y | 0 \u2264 yi}.\n\nSurprisingly, projection onto RN\n+ implicitly regularizes the weights of the correctly classi\ufb01ed sam-\nples with a weighted \u21131 norm term (see supplement). To further enhance sparsity, we may introduce\nan explicit \u21131 norm regularization term into the projection step with a regularization factor denoted\nby \u03b1t\u03b7t. The solution of the projection step is then normalized to get a feasible point on the prob-\nability simplex. This algorithm is listed in Algorithm 2. \u03b1t\u03b7t is the regularization factor of the\nexplicit \u21131 norm at round t. Note that the dominant regularization factor is \u03b7tdi\nt which only pushes\nthe weights of the correctly classi\ufb01ed samples to zero .i.e., when di\nt < 0. This can become evident\nby substituting the update step in the projection step for zt+1.\nFor simplicity we consider two cases: when \u03b1t = min(1, 1\ntheorem bounds the training error.\n\n2 \u03b3t||yt||1)and when \u03b1t = 0. The following\n\nTheorem 4. Suppose that SparseBoost generates weak hypotheses h1, . . . , hT whose edges are\n\u03b31, . . . , \u03b3T . Then the error \u01eb of the combined hypothesis f on the training set is bounded as follows:\n\n\u01eb \u2264\n\n1\nt=1 \u03b32\n\nt ||yt||2\n1\n\n1 + cPT\n\n6\n\n(16)\n\n\fNote that this bound holds for any choice of \u03b1 \u2208 (cid:2)0, min(1, \u03b3t||yt||1)(cid:1). Particularly, in our two\n\ncases constant c is 1 for \u03b1t = 0, and 1\n\n4 when \u03b1t = min(1, 1\n\n2 \u03b3t||yt||1).\n\nFor \u03b1t = 0, the \u21131 norm of the weights ||yt||1 can be bounded away from zero by 1\nN (see sup-\nplement). Thus, the error \u01eb tends to zero by O( N 2\n\u03b3 2T ). That is, in this case Sparseboost is a\nprovable boosting algorithm. However, for \u03b1t 6= 0, the \u21131 norm ||yt||1 may rapidly go to zero\nwhich consequently results in a non-vanishing upper bound (as T increases) for the training error in\n(16). In this case, it may not be possible to conclude that the algorithm is in fact a boosting algo-\nrithm5. It is noteworthy that SparseBoost can be seen as a variant of the COMID algorithm in [17].\n\nAlgorithm 2: SparseBoost\n\n+ = {y | 0 \u2264 yi}; Set y1 = [ 1\n\nLet RN\nAt t = 1, . . . , T , train ht, set (\u03b7t = \u03b3t||yt||1\nupdate\n\nN , . . . , 1\n\nN ]\u22a4;\n\nN , \u03b1t = 0) or (\u03b7t = \u03b3t||yt||1\n\n2N , \u03b1t = 1\n\n2 \u03b3t||yt||1), and\n\n||y \u2212 zt+1||2 + \u03b1t\u03b7t||y||1 \u2192 yi\n\nt+1 = max(0, yi\n\nt + \u03b7tdi\n\nt \u2212 \u03b1t\u03b7t)\n\nzt+1 = yt + \u03b7tdt\n1\n2\n\nyt+1 = arg min\ny\u2208RN\nyt+1\ni=1 yi\nt\n\nwt+1 =\n\n+\n\nPN\n\nOutput the \ufb01nal hypothesis f (x) = sign(cid:18)PT\n\nt=1 \u03b7tht(x)(cid:19).\n\n3.4 Lazy Update Boosting\n\nIn this section, we present the proof for the lazy update version of MABoost (LAMABoost) in\nTheorem 1. The proof technique is novel and can be used to generalize several known online learning\nalgorithms such as OMDA in [26] and Meta algorithm in [27]. Moreover, we show that MadaBoost\n[10] can be presented in the LAMABoost setting. This gives a simple proof for MadaBoost without\nmaking the assumption that the edge sequence is monotonically decreasing (as in [10]).\nProof : Assume w\u2217 = [w\u22171, . . . , w\u2217N ]\u22a4 is a distribution vector where w\u2217i = 1\notherwise. Then,\n(w\u2217 \u2212 wt)\u22a4\u03b7tdt = (wt+1 \u2212 wt)\u22a4(cid:0)\u2207R(zt+1) \u2212 \u2207R(zt)(cid:1)\n\n+ (zt+1 \u2212 wt+1)\u22a4(cid:0)\u2207R(zt+1) \u2212 \u2207R(zt)(cid:1) + (w\u2217 \u2212 zt+1)\u22a4(cid:0)\u2207R(zt+1) \u2212 \u2207R(zt)(cid:1)\n\u2217 + BR(wt+1, zt+1) \u2212 BR(wt+1, zt) + BR(zt+1, zt)\n\nN \u01eb if f (xi) 6= ai, and 0\n\n||wt+1 \u2212 wt||2 +\n\n\u03b72\nt ||dt||2\n\n1\n2\n\n1\n2\n\n\u2264\n\u2212 BR(w\u2217, zt+1) + BR(w\u2217, zt) \u2212 BR(zt+1, zt)\n\u2264\n+ BR(wt+1, zt+1) \u2212 BR(wt, zt) \u2212 BR(w\u2217, zt+1) + BR(w\u2217, zt)\n\n\u2217 \u2212 BR(wt+1, wt)\n\n||wt+1 \u2212 wt||2 +\n\n\u03b72\nt ||dt||2\n\n1\n2\n\n1\n2\n\nwhere the \ufb01rst inequality follows applying Lemma 3 to the \ufb01rst term and Lemma 2 to the rest\nof the terms and the second inequality is the result of applying the exact version of Lemma 1 to\nBR(wt+1, zt). Moreover, since BR(wt+1, wt) \u2212 1\n2 ||wt+1 \u2212 wt||2 \u2265 0, they can be ignored in (17).\nSumming up the inequality (17) from t = 1 to T , yields\n\n\u2212BR(w\u2217, z1) \u2264 L\n\nT\n\nXt=1\n\n1\n2\n\nT\n\n\u03b72\nt \u2212\n\n\u03b7t\u03b3t\n\nXt=1\nt=1 \u2212w\u22a4t \u03b7tdt = PT\n\nwhere we used the facts that w\u2217\u22a4PT\ninequality is exactly the same as (15), and replacing \u2212BR with \u01eb\u22121\n\nt=1 \u03b7tdt \u2265 0 and PT\n\nt=1 \u03b7t\u03b3t. The above\nN \u01eb or log(\u01eb) yields the same\n\n5Nevertheless, for some choices of \u03b1t 6= 0 such as \u03b1t \u221d 1\n\nt2 , the boosting property of the algorithm is still\n\nprovable.\n\n7\n\n(17)\n\n(18)\n\n\ferror bounds in Theorem 1. Note that, since the exact version of Lemma 1 is required to obtain\n(17), this proof does not reveal whether LAMABoost can be generalized to employ the double-\nprojection strategy. In some particular cases, however, we may show that a double-projection variant\nof LAMABoost is still a provable boosting algorithm.\n\nIn the following, we brie\ufb02y show that MadaBoost can be seen as a double-projection LAMABoost.\n\nAlgorithm 3: Variant of MadaBoost\n\nLet R(w) be the negative entropy and K a unit hypercube; Set z1 = [1, . . . , 1]\u22a4;\n\nAt t = 1, . . . , T , train ht with wt, set ft(x) = sign(cid:18)Pt\n\u01ebt = PN\n\n, set \u03b7t = \u01ebt\u03b3t and update\n\n1\n2 |ft(xi) \u2212 ai|\n\ni=1\n\nN\n\nt\u2032=1 \u03b7t\u2032 ht\u2032(x)(cid:19) and calculate\n\n\u2207R(zt+1) = \u2207R(zt) + \u03b7tdt\nBR(y, zt+1)\n\nyt+1 = arg min\n\ny\u2208K\n\n\u2192 zi\n\u2192 yi\n\nte\u03b7tdi\n\nt+1 = zi\nt+1 = min(1, zi\n\nt\n\nt+1)\n\n\u2192 wi\n\nt+1 =\n\nyi\nt+1\n\n||yt+1||1\n\nwt+1 = arg min\n\nBR(w, yt+1)\nOutput the \ufb01nal hypothesis f (x) = sign(cid:18)PT\n\nw\u2208S\n\nt=1 \u03b7tht(x)(cid:19).\n\nAlgorithm 3 is essentially MadaBoost, only with a different choice of \u03b7t. It is well-known that the\nentropy projection onto the probability simplex results in the normalization and thus, the second\nprojection of Algorithm 3. The entropy projection onto the unit hypercube, however, maybe less\nknown and thus, its proof is given in the Supplement.\n\nTheorem 5. Algorithm 3 yields a (1\u2212 \u01eb)-accurate hypothesis after at most T = O(\n\n1\n\n\u01eb2\u03b32 ).\n\nThis is an important result since it shows that MadaBoost seems, at least in theory, to be slower than\nwhat we hoped, namely O( 1\n\n\u01eb\u03b32 ).\n\n4 Conclusion and Discussion\n\nIn this work, we provided a boosting framework that can produce provable boosting algorithms.\nThis framework is mainly suitable for designing boosting algorithms with distribution constraints.\nA sparse boosting algorithm that samples only a fraction of examples at each round was derived\nfrom this framework. However, since our proposed algorithm cannot control the exact number of\nzeros in the weight vector, a natural extension to this algorithm is to develop a boosting algorithm\nthat receives the sparsity level as an input. However, this immediately raises the question: what is\nthe maximum number of examples that can be removed at each round from the dataset, while still\nachieving a (1\u2212 \u01eb)-accurate hypothesis?\n\nThe boosting framework derived in this work is essentially the dual of the online mirror descent\nalgorithm. This framework can be generalized in different ways. Here, we showed that replacing the\nBregman projection step with the double-projection strategy, or as we call it approximate Bregman\nprojection, still results in a boosting algorithm in the active version of MABoost, though this may\nnot hold for the lazy version. In some special cases (MadaBoost for instance), however, it can be\nshown that this double-projection strategy works for the lazy version as well. Our conjecture is that\nunder some conditions on the \ufb01rst convex set, the lazy version can also be generalized to work with\nthe approximate projection operator. Finally, we provided a new error bound for the MadaBoost\nalgorithm that does not depend on any assumption. Unlike the common conjecture, the convergence\nrate of MadaBoost (at least with our choice of \u03b7) is of O(1/\u01eb2).\n\nAcknowledgments\n\nThis work was partially supported by SNSF. We would like to thank Professor Rocco Servedio for\nan inspiring email conversation and our colleague Hui Liang for his helpful comments.\n\n8\n\n\fReferences\n\n[1] R. E. Schapire. The strength of weak learnability. Journal of Machine Learning Research, 1990.\n\n[2] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application\n\nto boosting. Journal of Computer and System Sciences, 1997.\n\n[3] L. Breiman. Prediction games and arcing algorithms. Neural Computation, 1999.\n\n[4] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent. In NIPS, 1999.\n\n[5] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting.\n\nAnnals of Statistics, 1998.\n\n[6] R. A. Servedio. Smooth boosting and learning with malicious noise. Journal of Machine Learning\n\nResearch, 2003.\n\n[7] D. Gavinsky. Optimally-smooth adaptive boosting and application to agnostic learning. Journal of Ma-\n\nchine Learning Research, 2003.\n\n[8] J. K. Bradley and R. E. Schapire. Filterboost: Regression and classi\ufb01cation on large datasets. In NIPS.\n\n2008.\n\n[9] K. Hatano. Smooth boosting using an information-based criterion. In Algorithmic Learning Theory. 2006.\n\n[10] C. Domingo and O. Watanabe. Madaboost: A modi\ufb01cation of AdaBoost. In COLT, 2000.\n\n[11] Y. Freund. Boosting a weak learning algorithm by majority. Journal of Information and Computation,\n\n1995.\n\n[12] N. H. Bshouty, D. Gavinsky, and M. Long. On boosting with polynomially bounded distributions. Journal\n\nof Machine Learning Research, 2002.\n\n[13] M. K. Warmuth, J. Liao, and G. R\u00a8atsch. Totally corrective boosting algorithms that maximize the margin.\n\nIn ICML, 2006.\n\n[14] S. Shalev-Shwartz and Y. Singer. On the equivalence of weak learnability and linear separability: new\n\nrelaxations and ef\ufb01cient boosting algorithms. In COLT, 2008.\n\n[15] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In ICML, 2003.\n\n[16] E. Hazan. A survey: The convex optimization approach to regret minimization. Working draft, 2009.\n\n[17] J. C. Duchi, S. Shalev-shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In COLT,\n\n2010.\n\n[18] Y. Freund and R. E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996.\n\n[19] L. Breiman. Pasting bites together for prediction in large data sets and on-line. Technical report, Dept.\n\nStatistics, Univ. California, Berkeley, 1997.\n\n[20] K. Hatano and E. Takimoto. Linear programming boosting by column and row generation. In Proceedings\n\nof Discovery Science, 2009.\n\n[21] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward ef\ufb01cient agnostic learning. In COLT, 1992.\n\n[22] A. Kalai and V. Kanade. Potential-based agnostic boosting. In NIPS. 2009.\n\n[23] S. Ben-David, P. Long, and Y. Mansour. Agnostic boosting. In COLT. 2001.\n\n[24] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.\n\n[25] W. Dai, Q. Yang, G. Xue, and Y. Yong. Boosting for transfer learning. In ICML, 2007.\n\n[26] A. Rakhlin and K. Sridharan. Online learning with predictable sequences. In COLT, 2013.\n\n[27] C. Chiang, T. Yang, C. Lee, M. Mahdavi, C. Lu, R. Jin, and S. Zhu. Online optimization with gradual\n\nvariations. In COLT, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1211, "authors": [{"given_name": "Tofigh", "family_name": "Naghibi Mohamadpoor", "institution": "ETHZ"}, {"given_name": "Beat", "family_name": "Pfister", "institution": "ETH Zurich"}]}