{"title": "Potential-Based Agnostic Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 880, "page_last": 888, "abstract": "We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. Our boosting theorem gives, as easy corollaries, alternative derivations of two recent non-trivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). Experiments suggest that the algorithm performs similarly to Madaboost.", "full_text": "Potential-Based Agnostic Boosting\n\nAdam Tauman Kalai\nMicrosoft Research\n\nadum@microsoft.com\n\nVarun Kanade\n\nHarvard University\n\nvkanade@fas.harvard.edu\n\nAbstract\n\nWe prove strong noise-tolerance properties of a potential-based boosting algo-\nrithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost\n(Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire\nand Sellie (1994), giving polynomial-time guarantees in presence of arbitrary\nnoise. A remarkable feature of our algorithm is that it can be implemented with-\nout reweighting examples, by randomly relabeling them instead. Our boosting\ntheorem gives, as easy corollaries, alternative derivations of two recent nontriv-\nial results in computational learning theory: agnostically learning decision trees\n(Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005).\nExperiments suggest that the algorithm performs similarly to MadaBoost.\n\n1 Introduction\n\nBoosting procedures attempt to improve the accuracy of general machine learning algorithms,\nthrough repeated executions on reweighted data. Aggressive reweighting of data may lead to poor\nperformance in the presence of certain types of noise [1]. This has been addressed by a number of\n\u201crobust\u201d boosting algorithms, such as SmoothBoost [2, 3] and MadaBoost [4] as well as boosting\nby branching programs [5, 6]. Some of these algorithms are potential-based boosters, i.e., natu-\nral variants on AdaBoost [7], while others are perhaps less practical but have stronger theoretical\nguarantees in the presence of noise.\nThe present work gives a simple potential-based boosting algorithm with guarantees in the (arbi-\ntrary noise) agnostic learning setting [8, 9]. A unique feature of our algorithm, illustrated in Figure\n1, is that it does not alter the distribution on unlabeled examples but rather it alters the labels. This\nenables us to prove a strong boosting theorem in which the weak learner need only succeed for one\ndistribution on unlabeled examples. To the best of our knowledge, earlier weak-to-strong boosting\ntheorems have always relied on the ability of the weak learner to succeed under arbitrary distribu-\ntions. The utility of our boosting theorem is demonstrated by re-deriving two non-trivial results\nin computational learning theory, namely agnostically learning decision trees [10] and agnostically\nlearning halfspaces [11], which were previously solved using very different techniques.\nThe main contributions of this paper are, \ufb01rst, giving the \ufb01rst provably noise-tolerant analysis of a\npotential-based boosting algorithm, and, second, giving a distribution-speci\ufb01c boosting theorem that\ndoes not require the weak learner to learn over all distributions on x \u2208 X. This is in contrast to recent\nwork by Long and Servedio, showing that convex potential boosters cannot work in the presence of\nrandom classi\ufb01cation noise [12]. The present algorithm circumvents that impossibility result in two\nways. First, the algorithm has the possibility of negating the current hypothesis and hence is not\ntechnically a standard potential-based boosting algorithm. Second, weak agnostic learning is more\nchallenging than weak learning with random classi\ufb01cation noise, in the sense that an algorithm\nwhich is a weak-learner in the random classi\ufb01cation noise setting need not be a weak-learner in the\nagnostic setting.\nRelated work. There is a substantial literature on robust boosting algorithms, including algorithms\nalready mentioned, MadaBoost, SmoothBoost, as well as LogitBoost [13], BrownBoost [14], Nad-\n\n1\n\n\fSimpli\ufb01ed Boosting by Relabeling Procedure\nInputs: (x1, y1), . . . , (xm, ym) \u2208 X \u00d7 {\u22121, 1}, T \u2265 1, and weak learner W .\nOutput: classi\ufb01er h : X \u2192 {\u22121, 1}.\n\n1. Let H 0 = 0\n2. For t = 1, . . . , T :\n\ni = min{1, exp(\u2212H t\u22121(xi)yi)}\n\n\u2022 wt\n\u2022 With probability wt\n\ni = yi, otherwise pick \u02dcyt\n\ni \u2208 {\u22121, 1} randomly\n\nm)(cid:1).\n\niyig(xi). /* possibly take negated hypothesis */\n\n(a) For i = 1, . . . , m:\n\n(b) gt = W(cid:0)(x1, \u02c6yt\n(cid:80)m\n\n(c) ht =\n\ni, set \u02dcyt\n1), . . . , (xm, \u02c6yt\nwt\n\n(cid:88)\n\nargmax\n\ng\u2208{gt,\u2212 sign(H t\u22121)}\n\ni\n\n(d) \u03b3t = 1\nm\n(e) H t(x) = H t\u22121(x) + \u03b3tht(x)\n\niyiht(xi)\n\ni=1 wt\n\n3. Output h = sign(H T ) as hypothesis.\n\nFigure 1: Simpli\ufb01ed Boosting by Relabeling Procedure. Each epoch, the algorithm runs the weak\nlearner on relabeled data (cid:104)(xi, \u02dcyt\ni=1. In traditional boosting, on each epoch, H t is a linear com-\nbination of weak hypotheses. For our agnostic analysis, we also need to include the negated current\nhypothesis, \u2212 sign(H t\u22121) : X \u2192 {\u22121, 1}, as a possible weak classi\ufb01er. \u2217In practice, to avoid\nadding noise, each example would be replaced with three weighted examples: (xi, yi) with weight\ni, and (xi,\u00b11) each with weight (1 \u2212 wt\nwt\n\ni)(cid:105)m\n\ni)/2.\n\naBoost [15] and others [16, 17], including extensive experimentation [18, 15, 19]. These are all sim-\nple boosting algorithms whose output is a weighted majority of classi\ufb01ers. Many have been shown\nto have formal boosting properties (weak to strong PAC-learning) in a noiseless setting, or partial\nboosting properties in noisy settings. There has also been a line of work on boosting algorithms that\nprovably boost from weak to strong learners either under agnostic or random classi\ufb01cation noise,\nusing branching programs [17, 20, 5, 21, 6]. Our results are stronger than those in the recent work\nof Kalai, Mansour, Verbin [6], for two main reasons. First, we propose a simple potential-based\nalgorithm that can be implemented ef\ufb01ciently. Second, since we don\u2019t change the distribution over\nunlabeled examples, we can boost distribution-speci\ufb01c weak learners. In recent work, using a simi-\nlar idea of relabeling, Kalai, Kanade and Mansour[22] proved that the class of DNFs is learnable in\na one-sided error agnostic learning model. Their algorithm is essentially a simpler form of boosting.\nExperiments. Our boosting procedure is quite similar to MadaBoost. The main differences are: (1)\nthere is the possibility of using the negation of the current hypothesis at each step, (2) examples are\nrelabeled rather than reweighted, and (3) the step size is slightly different. The goal of experiments\nwas to understand how signi\ufb01cant these differences may be in practice. Preliminary experimental\nresults, presented in Section 5, suggest that all of these modi\ufb01cations are less important in practice\nthan theory. Hence, the present simple analysis can be viewed as a theoretical justi\ufb01cation for the\nnoise-tolerance of MadaBoost and SmoothBoost.\n\n1.1 Preliminaries\nIn the agnostic setting, we consider learning with respect to a distribution over X\u00d7Y . For simplicity,\nwe will take X be to \ufb01nite or countable and Y = {\u22121, 1}. Formally, learning is with respect to some\nclass of functions, C, where each c \u2208 C is a binary classi\ufb01er c : X \u2192 {\u22121, 1}. There is an arbitrary\ndistribution \u00b5 over X and an arbitrary target function f : X \u2192 [\u22121, 1]. Together these determine\nan arbitrary joint distribution D = (cid:104)\u00b5, f(cid:105) over X \u00d7 {\u22121, 1} where D(x, y) = \u00b5(x) 1+yf (x)\n, i.e.,\nf(x) = ED[y|x]. The error and correlation1 of a classi\ufb01er h : X \u2192 {\u22121, 1} with respect to D, are\n\n2\n\n1This quantity is typically referred to as edge in the boosting literature. However, cor(h,D) = 2 edge(h,D)\n\naccording to the standard notation, hence we use the notation cor.\n\n2\n\n\f(cid:18)\n\nWe will omit D when understood from context. The goal of the learning algorithm is to achieve\nerror (equivalently correlation) arbitrarily close to that of the best classi\ufb01er in C, namely,\n\nerr(C) = err(C,D) = inf\n\nc\u2208C err(c,D);\n\ncor(C) = cor(C,D) = sup\nc\u2208C\n\ncor(c,D)\n\nA \u03b3-weakly accurate classi\ufb01er [23] for PAC (noiseless) learning is simply one whose correlation is\nat least \u03b3 (for some \u03b3 \u2208 (0, 1)). A different de\ufb01nition of weakly accurate classi\ufb01er is appropriate in\nthe agnostic setting. Namely, for some \u03b3 \u2208 (0, 1), h : X \u2192 {\u22121, 1} is said to be \u03b3-optimal for C\n(and D) if,\n\ncor(h,D) \u2265 \u03b3 cor(C,D)\n\nHence, if the labels are totally random then a weak hypothesis need not have any correlation over\nrandom guessing. On the other hand, in a noiseless setting, where cor(C) = 1, this is equivalent\nto a \u03b3-weakly accurate hypothesis. The goal is to boost from an algorithm capable of outputting\n\u03b3-optimal hypotheses to one which outputs a nearly 1-optimal hypothesis, even for small \u03b3.\nLet D be a distribution over X \u00d7 {\u22121, 1}. Let w : X \u00d7 {\u22121, 1} \u2192 [0, 1] be a weighting function.\nWe now de\ufb01ne the distribution D relabeled by w, RD,w. Procedurally, one can think of generating a\nsample from RD,w by drawing an example (x, y) from D, then with probability w(x, y), outputting\n(x, y) directly, and with probability 1 \u2212 w(x, y), outputting (x, y(cid:48)) where y(cid:48) is uniformly random in\n{\u22121, 1}. Formally,\n\nRD,w(x, y) = D(x, y)\n\nw(x, y) +\n\n1 \u2212 w(x, y)\n\n2\n\n+ D(x,\u2212y)\n\n(cid:19)\n\n(cid:18)1 \u2212 w(x,\u2212y)\n\n(cid:19)\n\n2\n\nNote that D and RD,w have the same marginal distributions over unlabeled examples x \u2208 X. Also,\nobserve that, for any D, w, and h : X \u2192 R,\n\nE\n\n(x,y)\u223cRD,w\n\n[h(x)y] = E\n\n(x,y)\u223cD\n\n[h(x)yw(x, y)]\n\n(1)\n\nThis can be seen by the procedural interpretation above. When (x, y) is returned directly, which\nhappens with probability w(x, y), we get a contribution of h(x)y, but E[h(x)y(cid:48)] = 0 for uniform\ny(cid:48) \u2208 {\u22121, 1}.\nIt is possible to describe traditional supervised learning and active (query) learning in the same\nframework. A general (m, q)-learning algorithm is given m unlabeled examples (cid:104)x1, . . . , xm(cid:105), and\nmay make q label queries to a query oracle L : X \u2192 {\u22121, 1}, and it outputs a classi\ufb01er h : X \u2192\n{\u22121, 1}. The queries may be active, meaning that queries may only be made to training examples\nxi, or membership queries meaning that arbitrary examples x \u2208 X may be queried. The active query\nsetting where q = m is the standard supervised learning setting where all m labels may be queried.\nOne can similarly model semi-supervised learning.\nSince our boosting procedure does not change the distribution over unlabeled examples, it offers\ntwo advantages: (1) Agnostic weak learning may be de\ufb01ned with respect to a single distribution \u00b5\nover unlabeled examples, and (2) The weak learning algorithms may be active (or use membership\nqueries). In particular, the agnostic weak learning hypothesis for C and \u00b5 is that for any f : X \u2192\n[\u22121, 1], given examples from D = (cid:104)\u00b5, f(cid:105), the learner will output a \u03b3-optimal classi\ufb01er for C. The\nadvantages of this new de\ufb01nition are: (a) it is not with respect to every distribution on unlabeled\nexamples (the algorithm may only have guarantees for certain distributions), and (b) it is more\nrealistic as it does not assume noiseless data. Finding such a weak learner may be quite challenging\nsince it has to succeed in the agnostic model (where no assumption is made on f), however it may\nbe a bit easier in the sense that the learning algorithm need only handle one particular \u00b5.\nDe\ufb01nition 1. A learning algorithm is a (\u03b3, \u00010, \u03b4) agnostic weak learner for C and \u00b5 over X if,\nfor any f : X \u2192 [\u22121, 1], with probability \u2265 1 \u2212 \u03b4 over its random input, the algorithm outputs\nh : X \u2192 [\u22121, 1] such that, if D = (cid:104)\u00b5, f(cid:105),\n\nrespectively de\ufb01ned as,\n\nerr(h,D) = Pr\ncor(h,D) = E\n\n(x,y)\u223cD\n\n(x,y)\u223cD\n\n[h(x) (cid:54)= y]\n[h(x)y] = E\nx\u223c\u00b5\n\n[h(x)f(x)] = 1 \u2212 2 err(h,D)\n\n(cid:18)\n\n(cid:19)\n\nsup\nc\u2208C\n\nE\nx\u223c\u00b5\n\n[c(x)f(x)]\n\n\u2212 \u00010\n\ncor(h,D) = E\nx\u223c\u00b5\n\n[h(x)f(x)] \u2265 \u03b3\n\n3\n\n\fThe \u00010 parameter typically decreases quickly with the size of training data, e.g., O(m\u22121/2). To see\nwhy it is necessary, consider a class C = {c1, c2} consisting of only two classi\ufb01ers, and one of them\nhas correlation 0 and the other has minuscule positive correlation. Then, one cannot even identify\nwhich one has better correlation to within O(m\u22121/2) using m examples. Note that \u03b4 can easily\nmade exponentially small (boosting con\ufb01dence) using standard techniques.\nLastly, we de\ufb01ne sign(z) to be 1 if z \u2265 0 and \u22121 if z < 0.\n\n2 Formal boosting procedure and main results\n\nThe formal boosting procedure we analyze is given in Figure 2.\n\nAGNOSTIC BOOSTER\nInputs: (cid:104)x1, . . . , xT m+s(cid:105), T, s \u2265 1, label oracle L : X \u2192 {\u22121, 1}, (m, q)-learner W .\nOutput: classi\ufb01er h : X \u2192 {\u22121, 1}.\n1. Let H 0 = 0\n2. Query the labels of the \ufb01rst s examples to get y1 = L(x1), . . . , ys = L(xs).\n3. For t = 1, . . . , T :\n\na) De\ufb01ne wt(x, y) = \u2212\u03c6(cid:48)(H t\u22121(x)y) = min{1, exp(\u2212H t\u22121(x)y)}\n\nDe\ufb01ne Lt : X \u2192 {\u22121, 1} by:\ni) On input x \u2208 X, let y = L(x).\nii) With probability wt(x, y), return y.\niii) Otherwise return \u22121 or 1 with equal probability.\n\nb) Let gt = W ((cid:104)xs+(t\u22121)m+1, . . . , xs+tm(cid:105), Lt)\nc) Let\n\ns(cid:88)\ns(cid:88)\n\ni=1\n\ni) \u03b1t =\n\nii) \u03b2t =\n\n1\ns\n1\ns\nd) If \u03b1t \u2265 \u03b2t,\nht = gt;\n\u03b3t = \u03b1t;\nht = \u2212 sign(H t\u22121);\n\nElse,\n\ni=1\n\ngt(xi)wt(xi, yi)\n\n\u2212 sign(H t\u22121(xi))wt(xi, yi)\n\n\u03b3t = \u03b2t;\n\ne) H t(x) = H t\u22121(x) + \u03b3tht(x)\n(cid:104)(x1, y1), . . . , (xs, ys)(cid:105)\n\n4. Output h = sign(H \u03c4 ) where \u03c4 is chosen so as to minimize empirical error on\n\nFigure 2: Formal Boosting by Relabeling Procedure.\n\n\u03b32\u00012 log(cid:0) 1\n\n(cid:1), T = 29\n\nTheorem 1. If W is a (\u03b3, \u00010, \u03b4) weak learner with respect to C and \u00b5, s = 200\n\u03b32\u00012 ,\nAlgorithm AGNOSTIC BOOSTER (Figure 2) with probability at least 1 \u2212 4\u03b4T outputs a hypothesis\nh satisfying:\n\n\u03b4\n\ncor(h,D) \u2265 cor(C,D) \u2212 \u00010\n\u03b3\n\n\u2212 \u0001\n\nRecall that \u00010 is intended to be very small, e.g., O(m\u22121/2). Also note that the number of calls to\nthe query oracle L is s plus T times the number of calls made by the weak learner (if the weak\nlearner is active, then so is the boosting algorithm). We show that two recent non-trivial results,\nviz. agnostically learning decision trees and agnostically learning halfspaces follow as corollaries to\nTheorem 1. The two results are stated below:\nTheorem 2 ([10]). Let C be the class of binary decision trees on {\u22121, 1}n with at most t leaves,\nand let U be the uniform distribution on {\u22121, 1}n. There exists an algorithm that when given\nt, n, \u0001, \u03b4 > 0, and a label oracle for an arbitrary f : {\u22121, 1}n \u2192 [\u22121, 1], makes q = poly(nt/(\u0001\u03b4))\nmembership queries and, with probability \u2265 1 \u2212 \u03b4, outputs h : {\u22121, 1}n \u2192 {\u22121, 1} such that for\nUf = (cid:104)U, f(cid:105), err(h,Uf ) \u2264 err(C,Uf ) + \u0001.\n\n4\n\n\fTheorem 3 ([11]). For any \ufb01xed \u0001 > 0, there exists a univariate polynomial p such that the following\nholds: Let n \u2265 1, C be the class of halfspaces in n dimensions, let U be the uniform distribution\non {\u22121, 1}n, and f : {\u22121, 1}n \u2192 [\u22121, 1] be an arbitrary function. There exists a polynomial-\ntime algorithm that, when given m = p(n log(1/\u03b4)) labeled examples from Uf = (cid:104)U, f(cid:105), outputs a\nclassi\ufb01er h : {\u22121, 1}n \u2192 {\u22121, 1} such that err(h,Uf ) \u2264 err(C,Uf ) + \u0001. (The algorithm makes no\nqueries.)\nNote that a related theorem was shown for halfspaces over log-concave distributions over X = Rn.\nThe boosting approach here similarly generalizes to that case in a straightforward manner. This\nillustrates how, from the point of view of designing provably ef\ufb01cient agnostic learning algorithms,\nthe current boosting procedure may be useful.\n\n3 Analysis of Boosting Algorithm\n\nThis section is devoted to the analysis of algorithm AGNOSTIC BOOSTER (see Fig 2). As is standard,\nthe boosting algorithm can be viewed as minimizing a convex potential function. However, the proof\nis signi\ufb01cantly different than the analysis of AdaBoost [7], where they simply use the fact that the\npotential is an upper-bound on the error rate.\nOur analysis has two parts. First, we de\ufb01ne a conservative relabeling, such as the one we use, to\nbe one which never relabels/downweights examples that the booster currently misclassi\ufb01es. We\nshow that for a conservative reweighting, either the weak learner will make progress, returning a\nhypothesis correlated with the relabeled distribution or \u2212 sign(H t\u22121) will be correlated with the\nrelabeled distribution.\nSecond, if we \ufb01nd a hypothesis correlated with the relabeled distribution, then the potential on round\nt will be noticeably lower than that of round t \u2212 1. This is essentially a simple gradient descent\nanalysis, using a bound on the second derivative of the potential. Since the potential is between 0\nand 1, it can only drop so many rounds. This implies that sign(H t) must be a near-optimal classi\ufb01er\nfor some t (though the only sure way we have of knowing which one to pick is by testing accuracy\non held-out data).\nThe potential function we consider, as in MadaBoost, is de\ufb01ned by \u03c6 : R \u2192 R,\n\n(cid:26)1 \u2212 z\n\ne\u2212z\n\nif z \u2264 0\nif z > 0\n\n\u03c6(z) =\n\n\u03a6(H,D) = E\n\nDe\ufb01ne the potential of a (real-valued) hypothesis H with respect to a distribution D over X\u00d7{\u22121, 1}\nas:\n\n(x,y)\u223cD\n\n[\u03c6(yH(x))]\n\n(2)\nNote that \u03a6(H 0,D) = \u03a6(0,D) = 1. We will show that the potential decreases every round of\nthe algorithm. Notice that the weights in the boosting algorithm correspond to the derivative of the\npotential, because \u2212\u03c6(cid:48)(z) = min{1, exp(\u2212z)} \u2208 [0, 1]. In other words, the weak learning step is\nessentially a gradient descent step.\nWe next state a key fact about agnostic learning in Lemma 1.\nDe\ufb01nition 2. Let h : X \u2192 {\u22121, 1} be a hypothesis. Then weighting function w : X \u00d7 {\u22121, 1} \u2192\n[0, 1] is called conservative for h if w(x,\u2212h(x)) = 1 for all x \u2208 X.\nNote that, if the hypothesis is sign(H t(x)), then a weighting function de\ufb01ned by \u2212\u03c6(cid:48)(H t(x)y) is\nconservative if and only if \u03c6(cid:48)(z) = \u22121 for all z < 0. We \ufb01rst show that relabeling according to a\nconservative weighting function is good in the sense that, if h is far from optimal according to the\noriginal distribution, then after relabeling by w it is even further from optimal.\nLemma 1. For any distribution D over X \u00d7 {\u22121, 1}, classi\ufb01ers c, h : X \u2192 {\u22121, 1}, and any\nweighting function w : X \u00d7 {\u22121, 1} \u2192 [0, 1] conservative for h,\n\ncor(c, RD,w) \u2212 cor(h, RD,w) \u2265 cor(c,D) \u2212 cor(h,D)\n\n5\n\n\fProof. By the de\ufb01nition of correlation and eq. (1), cor(c, RD,w) = ED[c(x)yw(x, y)]. Hence,\ncor(c, RD,w) \u2212 cor(h, RD,w) = cor(c,D) \u2212 cor(h,D) \u2212 E\nFinally, consider two cases. In the \ufb01rst case, when 1 \u2212 w(x, y) > 0, we have h(x)y = 1 while\nc(x)y \u2264 1. The second case is 1 \u2212 w(x, y) = 0. In either case, (c(x) \u2212 h(x))y(1 \u2212 w(x, y)) \u2264 0.\nThus the above equation implies the lemma.\n\n[(c(x) \u2212 h(x))y(1 \u2212 w(x, y))]\n\n(x,y)\u223cD\n\nWe will use Lemma 1 to show that the weak learner will return a useful hypothesis. The case in\nwhich the weak learner may not return a useful hypothesis is when cor(C, RD,w) = 0, when the\noptimal classi\ufb01er on the reweighted distribution has no correlation. This can happen, but in this case\nit means that either our current hypothesis is close to optimal, or h = sign(H t\u22121) is even worse\nthan random guessing, and hence we can use its negation as a weak agnostic learner.\nWe next explain how a \u03b3-optimal classi\ufb01er on the reweighted distribution decreases the potential.\nWe will use the following property linear approximation of \u03c6.\nLemma 2. For any x, \u03b4 \u2208 R, |\u03c6(x + \u03b4) \u2212 \u03c6(x) \u2212 \u03c6(cid:48)(x)\u03b4| \u2264 \u03b42/2.\n\nProof. This follows from Taylor\u2019s theorem and the fact the function \u03c6 is differentiable everywhere,\nand that the left and right second derivatives exist everywhere and are bounded by 1.\nLet ht : X \u2192 {\u22121, 1} be the weak hypothesis that the algorithm \ufb01nds on round t. This may\neither be the hypothesis returned by the weak learner W or \u2212 sign(H t\u22121). The following lemma\nlower bounds the decrease in potential caused by adding \u03b3tht to H t\u22121. We will apply the following\nLemma on each round of the algorithm to show that the potential decreases on each round, as long\nas the weak hypothesis ht has non-negligible correlation and \u03b3t is suitably chosen.\nLemma 3. Consider any function H : X \u2192 R, hypothesis h : X \u2192 [\u22121, 1], \u03b3 \u2208 R,\nand distribution D over X \u00d7 {\u22121, 1}. Let D(cid:48) = RD,w be the distribution D relabeled by\nw(x, y) = \u2212\u03c6(cid:48)(yH(x)). Then,\n\n\u03a6(H,D) \u2212 \u03a6(H + \u03b3h,D) \u2265 \u03b3 cor(h,D(cid:48)) \u2212 \u03b32\n2\n\nProof. For any (x, y) \u2208 X \u00d7 {\u22121, 1}, using Lemma 2 we know that:\n\n\u03c6(H(x)y) \u2212 \u03c6((H(x) + \u03b3h(x))y) \u2265 \u03b3h(x)y(\u2212\u03c6(cid:48)(H(x)y)) \u2212 \u03b32\n2\n\nIn the step above we use the fact that h(x)2y2 \u2264 1. Taking expectation over (x, y) from D,\n\n\u03a6(H,D) \u2212 \u03c6(H + \u03b3h,D) \u2265 E\n\n(x,y)\u223cD\n\n[h(x)y(\u2212\u03c6(cid:48)(H(x)y))] \u2212 \u03b32\n2\n\nIn the above we have used Eq. (1). We are done, by de\ufb01nition of cor(h,D(cid:48)).\n\n=\n\nE\n\n(x,y)\u223cD(cid:48)\n\n[h(x)y] \u2212 \u03b32\n2\n\nUsing all the above lemmas, we will show that the algorithm AGNOSTIC BOOSTER returns a hy-\npothesis with correlation (or error) close to that of the best classi\ufb01er from C. We are now ready to\nprove the main theorem.\nProof of Theorem 1. Suppose \u2203c \u2208 C such that cor(c,D) > cor(sign(H t\u22121),D) + \u00010\napplying Lemma 1 to H t\u22121 and setting wt(x, y) = \u2212\u03c6(cid:48)(H t\u22121(x)y), we get that\n\n\u03b3 + \u0001, then\n\ncor(c, RD,wt) > cor(sign(H t\u22121), RD,wt) + \u00010\n\u03b3\n\n+ \u0001\n\n(3)\n\nIn this case we want to show that the algorithm successfully \ufb01nds ht with cor(ht, RD,wt) \u2265 \u03b3\u0001\n3 .\n\n6\n\n\fLet gt be the hypothesis returned by the weak learner W . From Step 3c) in the algorithm:\n\ns(cid:88)\n\u03b32\u00012 log(cid:0) 1\n(cid:1), by Chernoff-Hoeffding bounds we know that \u03b1t and \u03b2t are within an\n\n\u2212 sign(H t\u22121)(xi)wt(xi, yi)\n\ng(xi)wt(xi, yi); \u03b2t =\n\ns(cid:88)\n\n\u03b1t =\n\n1\ns\n\n1\ns\n\ni=1\n\ni=1\n\n\u03b4\n\n\u03b3 + \u0001\n\nWhen s = 200\n20 of cor(gt, RD,wt) and cor(\u2212 sign(H t\u22121), RD,wt) respectively with probability at least\nadditive \u03b3\u0001\n1\u22122\u03b4. As de\ufb01ned in Step 3d) in the algorithm, let \u03b3t = max(\u03b1t, \u03b2t). We allow the algorithm to fail\nwith probability 3\u03b4 at this stage, possibly caused by the weak-learner and the estimation of \u03b1t, \u03b2t.\nConsider two cases: First that cor(c, RD,wt) \u2265 \u00010\n2, in this case by the weak learning assumption,\n2 . In the second case, if this does not hold, then cor(\u2212 sign(H t\u22121), RD,wt) \u2265 \u0001\ncor(gt, RD,wt) \u2265 \u03b3\u0001\n2\nusing (3). Thus, even after taking into account the fact that the empirical estimates may be off from\n3 and that |\u03b3t \u2212 cor(ht, RD,wt)| \u2264 \u03b3\u0001\nthe true correlations by \u03b3\u0001\n20.\nUsing this and Lemma 3, we get that by setting H t = H t\u22121 + \u03b3tht the potential decreases by at\nleast \u03b32\u00012\n29 .\nWhen t = 0 and H 0 = 0, \u03a6(H 0,D) = 1. Since for any H : X \u2192 R, \u03a6(H,D) > 0; we can\nhave at most T = 29\n\u03b32\u00012 rounds. This guarantees that when the algorithm is run for T rounds, on\n. For\n\n(cid:1) the empirical estimate of the correlation of the constructed hypothesis on each\n\nsome round t the hypothesis sign(H t) will have correlation at least sup\nc\u2208C\n\ns = 200\nround is within an additive \u0001\nround. Thus the \ufb01nal hypothesis H \u03c4 which has the highest empirical correlation satis\ufb01es,\n\n6 of its true correlation, allowing a further failure probability of \u03b4 each\n\n20, we get that cor(ht, RD,wt) \u2265 \u03b3\u0001\n\n\u03b32\u00012 log(cid:0) 1\n\ncor(c,D) \u2212 \u00010\n\u03b3\n\n\u2212 2\u0001\n3\n\n\u03b4\n\ncor(H \u03c4 ,D) \u2265 sup\nc\u2208C\n\ncor(c,D) \u2212 \u00010\n\u03b3\n\n\u2212 \u0001\n\nSince there is a failure probability of at most 4\u03b4 on each round, the algorithm succeeds with proba-\nbility at least 1 \u2212 4T \u03b4.\n\n4 Applications\n\nWe show that recent agnostic learning analyses can be dramatically simpli\ufb01ed using our boosting\nalgorithm. Both of the agnostic algorithms are distribution-speci\ufb01c, meaning that they only work on\none (or a family) of distributions \u00b5 over unlabeled examples.\n\n4.1 Agnostically Learning Decision Trees\n\nRecent work has shown how to agnostically learn polynomial-sized decision trees using member-\nship queries, by an L1 gradient-projection algorithm [10]. Here, we show that learning decision\ntrees is quite simple using our distribution-speci\ufb01c boosting theorem and the Kushilevitz-Mansour\nmembership query parity learning algorithm as a weak learner [24].\nLemma 4. Running the KM algorithm, using q = poly(n, t, 1/\u00010) queries, and outputting the parity\nwith largest magnitude of estimated Fourier coef\ufb01cient, is a (\u03b3 = 1/t, \u00010) agnostic weak learner for\nsize-t decision trees over the uniform distribution.\n\nThe proof of this Lemma is simple using results in [24] and is given in Appendix A. Theorem 2 now\nfollows easily from Lemma 4 and Theorem 1.\n\n4.2 Agnostically Learning Halfspaces\n\nIn the case of learning halfspaces, the weak learner simply \ufb01nds the degree-d term, \u03c7S(x) with\n|S| \u2264 d, with greatest empirical correlation 1\ni=1 \u03c7S(xi)yi on a data set (x1, y1), . . . , (xm, ym).\nm\nThe following lemma is useful in analyzing it.\nLemma 5. For any \u0001 > 0, there exists d \u2265 1 such that the following holds. Let n \u2265 1, C be the class\nof halfspaces in n dimensions, let U be the uniform distribution on {\u22121, 1}n, and f : {\u22121, 1}n \u2192\n[\u22121, 1] be an arbitrary function. Then there exists a set S \u2286 [n] of size |S| \u2264 d = 20\nsuch that\n\u00014\n| cor(\u03c7S,Uf )| \u2265 (cor(C,Uf ) \u2212 \u00010)/nd.\n0\n\n(cid:80)m\n\n7\n\n\fUsing results from [25] the proofs of Lemma 5 and Theorem 3 are straightforward and are given in\nAppendix B.\n\n5 Experiments\n\nWe performed preliminary experiments with the new boosting algorithm presented here on 8 datasets\nfrom UCI repository [26]. We converted multi-class problems into binary classi\ufb01cation problems\nby arbitrarily grouping classes, and ran Adaboost, Madaboost and Agnostic Boost on these datasets,\nusing stumps as weak learners. Since stumps can accept weighted examples, we passed the exact\nweighted distribution to the weak learner.\nOur experiments were performed with fractional relabeling, which means the following. Rather than\nkeeping the label with probability wt(x, y) and making it completely random with the remaining\nprobability, we added both (x, y) and (x,\u2212y) with weights (1 + wt(x, y))/2 and (1 \u2212 wt(x, y))/2\nrespectively. Experiments with random relabeling showed that random relabeling performs much\nworse than fractional relabeling.\nTable 1 summarizes the \ufb01nal test error on the datasets. In the case of pima and german datasets,\nwe observed over\ufb01tting and the reported test errors are the minimum test error observed for all the\nalgorithms. In all other cases the test error rate at the end of round 500 is reported. Only pendigits\nhad a test dataset, for the rest of the datasets we performed 10-fold cross validation. We also added\nrandom classi\ufb01cation noise of 5%, 10% and 20% to the datasets and ran the boosting algorithms on\nthe modi\ufb01ed dataset.\n\n5% noise\n\nDataset\n\nsonar\n\n10% Noise\n\n20% Noise\n\nionosphere 8.6\n\n9.1\n\nNo Added Noise\nAda Mada Agn Ada Mada Agn Ada Mada Agn Ada Mada Agn\n12.4 14.8 15.3 23.9 20.6 24.0 26.5 26.3 25.1 34.2 32.7 34.5\n28.2 27.8\n23.7 23.0 23.6 26.1 24.9 25.7 27.6 26.4 26.7 34.3 34.5\n34\npima\ngerman\n23.1 23.6 23.1 28.5 27.7 27.5 29.0 29.5 30.0 35.0 34.5 35.1\nwaveform 10.4 10.2 10.3 14.9 15.0 13.9 20.1 19.2 19.1 27.9 27.3 27.1\n14.7 14.9 14.5 18.2 18.3 18.1 21.9 22.0 21.5 29.4 29.1 28.7\n17.4 18.2 18.3 20.9 21.4 21.5 24.6 24.9 25.2 31.4 31.8 31.6\n8.2 12.1 12.0 13.0 16.8 16.3 16.9 25.5 25.2 25.3\n7.4\n\n8.1 15.8 17.2 14.4 24.2 23.8 21.8 32\n\nmagic\nletter\n\npendigits\n\n7.3\n\nTable 1: Final test error rates of Adaboost, Madaboost and Agnostic Boosting on 8 datasets. The\n\ufb01rst column reports error rates on the original datasets, and the next three report errors on datasets\nwith 5%, 10% and 20% classi\ufb01cation noise added.\n\n6 Conclusion\n\nWe show that potential-based agnostic boosting is possible in theory, and also that this may be\ndone without changing the distribution over unlabeled examples. We show that non-trivial agnostic\nlearning results, for learning decision trees and halfspaces, can be viewed as simple applications of\nour boosting theorem combined with well-known weak learners. Our analysis can be viewed as a\ntheoretical justi\ufb01cation of noise tolerance properties of algorithms like Madaboost and Smoothboost.\nPreliminary experiments show that the performance of our boosting algorithm is comparable to that\nof Madaboost and Adaboost. A more thorough empirical evaluation of our boosting procedure using\ndifferent weak learners is part of future research.\n\nReferences\n[1] T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision\n\ntrees: bagging, boosting, and randomization. Machine Learning, 40(2):139\u2013158, 2000.\n\n[2] R. Servedio. Smooth boosting and learning with malicious noise. Journal of Machine Learning Research,\n\n4:633\u2013648, 2003.\n\n[3] D. Gavinsky. Optimally-smooth adaptive boosting and application to agnostic learning. Journal of Ma-\n\nchine Learning Research, 4:101\u2013117, 2003.\n\n8\n\n\f[4] C. Domingo and O. Watanabe. Madaboost: A modi\ufb01cation of adaboost.\n\nIn Proceedings of the Thir-\nteenth Annual Conference on Learning Theory, pages 180\u2013189, San Francisco, CA, USA, 2000. Morgan\nKaufmann Publishers Inc.\n\n[5] A. Kalai and R. Servedio. Boosting in the presence of noise. In Proceedings of the 35th Annual Symposium\n\non Theory of Computing (STOC), pages 196\u2013205, 2003.\n\n[6] A. T. Kalai, Y. Mansour, and E. Verbin. On agnostic boosting and parity learning. In STOC \u201908: Proceed-\nings of the 40th annual ACM symposium on Theory of computing, pages 629\u2013638, New York, NY, USA,\n2008. ACM.\n\n[7] Y. Freund and R. Schapire. Game theory, on-line prediction and boosting. In Proceedings of the Ninth\n\nAnnual Conference on Computational Learning Theory, pages 325\u2013332, 1996.\n\n[8] D. Haussler. Decision theoretic generalizations of the pac model for neural net and other learning appli-\n\ncations. Inf. Comput., 100(1):78\u2013150, 1992.\n\n[9] M. Kearns, R. Schapire, and L. Sellie. Toward Ef\ufb01cient Agnostic Learning. Machine Learning,\n\n17(2):115\u2013141, 1994.\n\n[10] P. Gopalan, A. T. Kalai, and A. R. Klivans. Agnostically learning decision trees. In Proceedings of the\n40th annual ACM symposium on Theory of computing, pages 527\u2013536, New York, NY, USA, 2008. ACM.\n[11] A. T. Kalai, A. R. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. In Proc. 46th\n\nIEEE Symp. on Foundations of Computer Science (FOCS\u201905), 2005.\n\n[12] P. M. Long and R. A. Servedio. Random classi\ufb01cation noise defeats all convex potential boosters. In\n\nICML, pages 608\u2013615, 2008.\n\n[13] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting.\n\nAnnals of Statistics, 28:2000, 1998.\n\n[14] Y. Freund. An adaptive version of the boost-by-majority algorithm. In Proceedings of the Twelfth Annual\n\nConference on Computational Learning Theory, pages 102\u2013113, 1999.\n\n[15] M. Nakamura, H. Nomiya, and K. Uehara. Improvement of boosting algorithm by modifying the weight-\n\ning rule. Annals of Mathematics and Arti\ufb01cial Intelligence, 41(1):95\u2013109, 2004.\n\n[16] T. Bylander and L. Tate. Using validation sets to avoid over\ufb01tting in adaboost.\n\nR. Goebel, editors, FLAIRS Conference, pages 544\u2013549. AAAI Press, 2006.\n\nIn G. Sutcliffe and\n\n[17] S. Ben-David, P. M. Long, and Y. Mansour. Agnostic boosting.\n\nIn Proceedings of the 14th Annual\nConference on Computational Learning Theory, COLT 2001, volume 2111 of Lecture Notes in Arti\ufb01cial\nIntelligence, pages 507\u2013516. Springer, 2001.\n\n[18] R. A. McDonald, D. J. Hand, and I. A. Eckley. An empirical comparison of three boosting algorithms on\nreal data sets with arti\ufb01cial class noise. In T. Windeatt and F. Roli, editors, Multiple Classi\ufb01er Systems,\nvolume 2709 of Lecture Notes in Computer Science, pages 35\u201344. Springer, 2003.\n\n[19] J. K. Bradley and R. Schapire. Filterboost: Regression and classi\ufb01cation on large datasets. In J.C. Platt,\nD. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20,\npages 185\u2013192. MIT Press, Cambridge, MA, 2008.\n\n[20] Y. Mansour and D. McAllester. Boosting using branching programs. Journal of Computer and System\n\nSciences, 64(1):103\u2013112, 2002.\n\n[21] P. M. Long and R. A. Servedio. Adaptive martingale boosting. In NIPS, pages 977\u2013984, 2008.\n[22] A. T. Kalai, V. Kanade, and Y. Mansour. Reliable agnostic learning. In COLT \u201909: Proceedings of the\n\n22nd annual conference on learning theory, 2009.\n\n[23] M. Kearns and L. Valiant. Cryptographic limitations on learning Boolean formulae and \ufb01nite automata.\n\nJournal of the ACM, 41(1):67\u201395, 1994.\n\n[24] E. Kushilevitz and Y. Mansour. Learning decision trees using the Fourier spectrum. SIAM J. on Comput-\n\ning, 22(6):1331\u20131348, 1993.\n\n[25] A. Klivans, R. O\u2019Donnell, and R. Servedio. Learning intersections and thresholds of halfspaces. Journal\n\nof Computer & System Sciences, 68(4):808\u2013840, 2004.\n\n[26] A. Asuncion and D. J. Newman. UCI Machine Learning Repository [http://www.ics.uci.edu/\n\u02dcmlearn/MLRepository.html] Irvine, CA: University of California, School of Information and\nComputer Science, 2007.\n\n9\n\n\f", "award": [], "sourceid": 346, "authors": [{"given_name": "Varun", "family_name": "Kanade", "institution": null}, {"given_name": "Adam", "family_name": "Kalai", "institution": null}]}