{"title": "On the Dynamics of Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 1101, "page_last": 1108, "abstract": "", "full_text": "On the Dynamics of Boosting\u2044\n\nCynthia Rudin\n\nIngrid Daubechies\n\nPrinceton University\n\nProgr. Appl. & Comp. Math.\n\nFine Hall\n\nWashington Road\n\nPrinceton, NJ 08544-1000\n\nDepartment of Computer Science\n\nRobert E. Schapire\nPrinceton University\n\n35 Olden St.\n\nPrinceton, NJ 08544\n\nschapire@cs.princeton.edu\n\nfcrudin,ingridg@math.princeton.edu\n\nAbstract\n\nIn order to understand AdaBoost\u2019s dynamics, especially its ability to\nmaximize margins, we derive an associated simpli\ufb01ed nonlinear iterated\nmap and analyze its behavior in low-dimensional cases. We \ufb01nd stable\ncycles for these cases, which can explicitly be used to solve for Ada-\nBoost\u2019s output. By considering AdaBoost as a dynamical system, we are\nable to prove R\u00a8atsch and Warmuth\u2019s conjecture that AdaBoost may fail\nto converge to a maximal-margin combined classi\ufb01er when given a \u2018non-\noptimal\u2019 weak learning algorithm. AdaBoost is known to be a coordinate\ndescent method, but other known algorithms that explicitly aim to max-\nimize the margin (such as AdaBoost\u2044 and arc-gv) are not. We consider\na differentiable function for which coordinate ascent will yield a maxi-\nmum margin solution. We then make a simple approximation to derive a\nnew boosting algorithm whose updates are slightly more aggressive than\nthose of arc-gv.\n\n1\n\nIntroduction\n\nAdaBoost is an algorithm for constructing a \u201cstrong\u201d classi\ufb01er using only a training set and\na \u201cweak\u201d learning algorithm. A \u201cweak\u201d classi\ufb01er produced by the weak learning algorithm\nhas a probability of misclassi\ufb01cation that is slightly below 50%. A \u201cstrong\u201d classi\ufb01er\nhas a much smaller probability of error on test data. Hence, AdaBoost \u201cboosts\u201d the weak\nlearning algorithm to achieve a stronger classi\ufb01er. AdaBoost was the \ufb01rst practical boosting\nalgorithm, and due to its success, a number of similar boosting algorithms have since been\nintroduced (see [1] for an introduction). AdaBoost maintains a distribution (set of weights)\nover the training examples, and requests a weak classi\ufb01er from the weak learning algorithm\nat each iteration. Training examples that were misclassi\ufb01ed by the weak classi\ufb01er at the\ncurrent iteration then receive higher weights at the following iteration. The end result is a\n\ufb01nal combined classi\ufb01er, given by a thresholded linear combination of the weak classi\ufb01ers.\n\nOften, AdaBoost does not empirically seem to suffer badly from over\ufb01tting, even after\na large number of iterations. This lack of over\ufb01tting has been attributed to AdaBoost\u2019s\n\n\u2044This research was partially supported by NSF Grants IIS-0325500, CCR-0325463, ANI-\n\n0085984 and AFOSR Grant F49620-01-1-0099.\n\n\fability to generate a large margin, leading to a better guarantee on the generalization per-\nformance. When it is possible to achieve a positive margin, AdaBoost has been shown to\napproximately maximize the margin [2]. In particular, it is known that AdaBoost achieves a\nmargin of at least 1\n2 \u2030, where \u2030 is the largest margin that can possibly be attained by a com-\nbined classi\ufb01er (other bounds appear in [3]). Many of the subsequent boosting algorithms\nthat have emerged (such as AdaBoost\u2044 [4], and arc-gv [5]) have the same main outline as\nAdaBoost but attempt more explicitly to maximize the margin at the expense of lowering\nthe convergence rate; the trick seems to be to design an update for the combined classi\ufb01er\nthat maximizes the margin, has a fast rate of convergence, and is robust.\n\nFor all the extensive theoretical and empirical study of AdaBoost, it is still unknown if\nAdaBoost achieves a maximal margin solution, and thus the best upper bound on the prob-\nability of error (for margin-based bounds). While the limiting dynamics of the linearly\ninseparable case (i.e., \u2030 = 0) are fully understood [6], other basic questions about the dy-\nnamics of AdaBoost in the more common case \u2030 > 0 are unknown. For instance, we do\nnot know, in the limit of a large number of rounds, if AdaBoost eventually cycles among\nthe base classi\ufb01ers, or if its behavior is more chaotic.\n\nIn this paper, we study the dynamics of AdaBoost. First we simplify the algorithm to re-\nveal a nonlinear iterated map for AdaBoost\u2019s weight vector. This iterated map gives a direct\nrelation between the weights at time t and the weights at time t + 1, including renormal-\nization, thus providing a much more concise mapping than the original algorithm. We then\nprovide a speci\ufb01c set of examples in which trajectories of this iterated map converge to a\nlimit cycle, allowing us to calculate AdaBoost\u2019s output vector directly.\n\nThere are two interesting cases governing the dynamics: the case where the optimal weak\nclassi\ufb01ers are chosen at each iteration (the \u2018optimal\u2019 case), and the case where permissible\nnon-optimal weak classi\ufb01ers may be chosen (the \u2018non-optimal\u2019 case). In the optimal case,\nthe weak learning algorithm is required to choose a weak classi\ufb01er which has the largest\nedge at every iteration. In the non-optimal case, the weak learning algorithm may choose\nany weak classi\ufb01er as long as its edge exceeds \u2030, the maximum margin achievable by a\ncombined classi\ufb01er. This is a natural notion of non-optimality for boosting; thus it provides\na natural sense in which to measure robustness.\n\nBased on large scale experiments and a gap in theoretical bounds, R\u00a8atsch and Warmuth [3]\nconjectured that AdaBoost does not necessarily converge to a maximum margin classi\ufb01er\nin the non-optimal case, i.e., that AdaBoost is not robust in this sense. In practice, the weak\nclassi\ufb01ers are generated by CART or another heuristic weak learning algorithm, implying\nthat the choice need not always be optimal. In Section 3, we show this conjecture to be true\nusing a low-dimensional example. Thus, our low-dimensional study provides insight into\nAdaBoost\u2019s large scale dynamical behavior.\n\nAdaBoost, as shown by Breiman [5] and others, is actually a coordinate descent algo-\nrithm on a particular exponential loss function. However, minimizing this function in other\nways does not necessarily achieve large margins; the process of coordinate descent must be\nsomehow responsible. In Section 4, we introduce a differentiable function that can be max-\nimized to achieve maximal margins; performing coordinate ascent on this function yields\na new boosting algorithm that directly maximizes margins. This new algorithm and Ada-\nBoost use the same formula to choose a direction of ascent/descent at each iteration; thus\nAdaBoost chooses the optimal direction for this new setting. We approximate the update\nrule for coordinate ascent on this function and derive an algorithm with updates that are\nslightly more aggressive than those of arc-gv.\n\nWe proceed as follows: in Section 2 we introduce some notation and state the AdaBoost\nalgorithm. Then we decouple the dynamics for AdaBoost in the binary case to reveal a\nnonlinear iterated map. In Section 3, we analyze these dynamics for a simple case: the case\nwhere each hypothesis has one misclassi\ufb01ed point. In a 3 \u00a3 3 example, we \ufb01nd 2 stable\n\n\fcycles. We use these cycles to show that AdaBoost produces a maximal margin solution in\nthe optimal case; this result generalizes to m\u00a3 m. Then, we produce the example promised\nabove to show that AdaBoost does not necessarily converge to a maximal margin solution\nin the non-optimal case. In Section 4 we introduce a differentiable function that can be\nused to maximize the margin via coordinate ascent, and then approximate the coordinate\nascent update step to derive a new algorithm.\n\n2 Simpli\ufb01ed Dynamics of AdaBoost\n\nThe training set consists of f(xi; yi)gi=1::m, where each example (xi; yi) 2 X \u00a3 f\u00a11; 1g.\nDenote by dt 2 Rm the distribution (weights) over the training examples at iteration t,\nexpressed as a column vector. (Denote dT\nt as its transpose.) Denote by n the total number\nof classi\ufb01ers that can be produced by the weak learning algorithm. Since our classi\ufb01ers are\nbinary, n is \ufb01nite (at most 2m), but may be very large. The weak classi\ufb01ers are denoted\nh1; :::; hn, with hj : X ! f1;\u00a11g; we assume that for every hj on this list, \u00a1hj also\nappears. We construct a matrix M so that Mij = yihj(xi), i.e., Mij = +1 if training\nexample i is classi\ufb01ed correctly by hypothesis hj, and \u00a11 otherwise. The (unnormalized)\ncoef\ufb01cient of classi\ufb01er hj for the \ufb01nal combined hypothesis is denoted \u201aj, so that the \ufb01nal\ncombined hypothesis is fAda(x) = Pn\nj=1 \u201aj. (In\nthis paper, either hj or -hj remains unused.) The simplex of n-dimensional vectors with\npositive entries that sum to 1 will be denoted \u00a2n. The margin of training example i is\nde\ufb01ned by yifAda(xi), or equivalently (M\u201a)i=k\u201ak1, and the edge of hypothesis j with\nrespect to the training data (weighted by d) is (dT M)j, or 1 \u00a1 2\u00a3(probability of error of\nhj on the training set weighted by d). Our goal is to \ufb01nd a normalized vector ~\u201a 2 \u00a2n that\nmaximizes mini(M~\u201a)i. We call this minimum margin over training examples the margin\nof classi\ufb01er \u201a. Here is the AdaBoost algorithm and our reduction to an iterated map.\nAdaBoost (\u2018optimal\u2019 case):\n\nj=1(\u201aj=k\u201ak1)hj(x) where k\u201ak1 =Pn\n\n1. Input: Matrix M, Number of iterations tmax\n2. Initialize: \u201a1;j = 0 for j = 1; :::; n\n3. Loop for t = 1; :::; tmax\n\nt M)j\n\ni=1 e(\u00a1M\u201at)i for i = 1; :::; m\n\n(a) dt;i = e(\u00a1M\u201at)i =Pm\n2 ln\u2021 1+rt\n1\u00a1rt\u00b7\n\n(b) jt = argmaxj(dT\n(c) rt = (dT\n(d) \ufb01t = 1\n(e) \u201at+1 = \u201at + \ufb01tejt, where ejt is 1 in position jt and 0 elsewhere.\n\nt M)jt\n\n4. Output: \u201acombined;j = \u201atmax+1;j=k\u201atmax+1k1\n\nThus at each iteration, the distribution dt is computed (Step 3a), classi\ufb01er jt with maximum\nedge is selected (Step 3b), and the weight of that classi\ufb01er is updated (Step 3c, 3d, 3e).\n(Note that wlog one can omit from M all the unused columns.)\nAdaBoost can be reduced to the following iterated map for the dt\u2019s. This map gives a\ndirect relationship between dt and dt+1, taking the normalization of Step 3a into account\nautomatically. Initialize d1;i = 1=m for i = 1; :::; m as in the \ufb01rst iteration of AdaBoost.\nReduced Iterated Map:\n\nt M)j\n\n1. jt = argmaxj(dT\nt M)jt\n2. rt = (dT\n3. dt+1;i = dt;i\n\n1+Mijt rt\n\nfor i = 1; :::; m\n\n\fTo derive this map, consider the iteration de\ufb01ned by AdaBoost and reduce as follows.\n\n1\n2\n\nln(cid:181) 1 + rt\n1 \u00a1 rt\u00b6 , so\n\ne\u00a1(M\u201at)i e\u00a1(Mijt \ufb01t)\ni=1 e\u00a1(M\u201at)i e\u00a1(Mijt\n\nwhere \ufb01t =\n\ndt+1;i =\n\n\ufb01t)\n\n2 Mijt\n\nPm\ne\u00a1(Mijt \ufb01t) = (cid:181) 1 \u00a1 rt\n1 + rt\u00b6 1\ni=1 dt;i\u2021 1\u00a1Mijt\nPm\ndt;i and d\u00a1 = 1 \u00a1 d+. Thus, d+ = 1+rt\n\n=(cid:181) 1 \u00a1 Mijt rt\n1 + Mijt rt\u00b6 1\n2\u2021 1+Mijt rt\nrt\u00b7 1\n1\u00a1Mijt rt\u00b7 1\n\ndt+1;i =\n\n1+Mijt\n\n, thus\n\n=1g\n\ndt;i\n\n.\n\n2\n\nrt\n\n2\n\nDe\ufb01ne d+ =Pfi:Mijt\n\neach i such that Mijt = 1, we \ufb01nd:\n\n2\n\nand d\u00a1 = 1\u00a1rt\n\n2 . For\n\ndt;i\n\ndt;i\n\n1 + rt\n\ndt+1;i =\n\nLikewise, for each i such that Mijt = \u00a11, we \ufb01nd dt+1;i = dt;i\n1\u00a1rt\ncomplete. To check thatPm\ni=1 dt+1;i = 1\n1+rt\n\n+ d\u00a1\n2d\u00a1\n\nd+ + d\u00a1\u2021 1+rt\n\n1\u00a1rt\u00b7 =\ni=1 dt+1;i = 1, we seePm\n\n= 1.\n\nd+\n2d+\n\n. Our reduction is\nd\u00a1 =\n\nd+ + 1\n1\u00a1rt\n\n3 The Dynamics of Low-Dimensional AdaBoost\n\nFirst we will introduce a simple 3\u00a33 input matrix and analyze the convergence of AdaBoost\nin the optimal case. Then we will consider a larger matrix and show that AdaBoost fails to\nconverge to a maximum margin solution in the non-optimal case.\nConsider the following input matrix M =(cid:181) \u00a11\n\u00a11 \u00b6 corresponding to the case of\n\nthree training examples, where each weak classi\ufb01er misclassi\ufb01es one example. (We could\nadd additional hypotheses to M, but these would never be chosen by AdaBoost.) The max-\nimum value of the margin for M is 1=3. How will AdaBoost achieve this result? We are in\nt M)j. Consider the dynamical system on the sim-\nthe optimal case, where jt = argmaxj(dT\ni=1 dt;i = 1, dt;i > 0 8i de\ufb01ned by our reduced map above. In the triangular region\nwith vertices (0; 0; 1); (1=3; 1=3; 1=3); (0; 1; 0), jt will be 1. Similarly, we have regions for\njt = 2 and jt = 3 (see Figure 1(a)). Since dt+1 will always satisfy (dT\nt+1M )jt = 0, the\ndynamics are restricted to the edges of a triangle with vertices (0; 1\n2 ; 0; 1\n2 ); ( 1\n2 ; 0)\nafter the \ufb01rst iteration (see Figure 1(b)).\n\nplexP3\n\n2 ); ( 1\n\n2 ; 1\n\n2 ; 1\n\n1\n\u00a11\n1\n\n1\n1\n\n1\n1\n\n1\n\n1/3\n\nt\n_\nd\n\n \nf\no\n\n \nt\nn\ne\nn\no\np\nm\no\nc\n \n\nd\nn\no\nc\ne\ns\n\nj =1\nt\n\nj =3\nt\n\nj =2\nt\n\n1/3\n\nfirst component of d_t\n\n1\n\nt\n_\nd\n\n \nf\no\n\n \nt\nn\ne\nn\no\np\nm\no\nc\n \n\n1/2\n\nd\nn\no\nc\ne\ns\n\nT\n\n(d M) =0\n\n3\n\nT\n\n(d M) =0\n\n2\n\nT\n\n(d M) =0\n\n1\n\n1\n\n1/2\n\n1\n\nfirst component of d_t\n\nFigure 1: (a-Left) Regions of dt-space where classi\ufb01ers jt = 1; 2; 3 will respectively be selected.\n(b-Right) All weight vectors d2; :::; dtmax are restricted to lie on the edges of the inner triangle.\n\n\f(0,.5,.5)\n\n(0,.5,.5)\n\nl\n\ne\ng\nn\na\ni\nr\nt\n \n\nl\n\ng\nn\no\na\nn\no\n\n \n\n(.5,.0,.5)\n\n(.5,.5,0)\n\n(.5,.5,0)\n\n(.5,.0,.5)\n\nposition along triangle\n\ni\nt\ni\ns\no\np\n\n(0,.5,.5)\n\n(0,.5,.5)\n\n0.6\n\nr\no\n\nt\nc\ne\nv\n \nt\n\ni\n\nh\ng\ne\nw\n\n \nf\n\no\n\n \nt\n\nn\ne\nn\no\np\nm\no\nc\n \n\nd\nn\no\nc\ne\ns\n\nl\n\ne\ng\nn\na\ni\nr\nt\n \n\nl\n\ng\nn\no\na\nn\no\n\n \n\n(.5,0,.5)\n\n(.5,.5,0)\n\n(.5,.5,0)\n\n(.5,.0,.5)\n\n(0,.5,.5)\n\nposition along triangle\n\ni\nt\ni\ns\no\np\n\n(0,.5,.5)\n\n(0,.5,.5) \n\n(0,.5,.5)\n\n0.5\n\nr\no\nt\nc\ne\nv\n \nt\n\ni\n\nh\ng\ne\nw\n\n \nf\n\no\n\n \nt\n\nn\ne\nn\no\np\nm\no\nc\n \n\nd\nn\no\nc\ne\ns\n\n0.15\n\n0.15 \n\nfirst component of weight vector\n\n0.55\n\n0\n0\n\nfirst component of weight vector\n\n0.5\n\nFigure 2: (a-Upper Left) The iterated map on the unfolded triangle. Both axes give coordinates on\nthe edges of the inner triangle in Figure 1(b). The plot shows where dt+1 will be, given dt. (b-Upper\nRight) The map from (a) iterated twice, showing where dt+3 will be, given dt. There are 6 stable\n\ufb01xed points, 3 for each cycle. (c-Lower Left) 50 timesteps of AdaBoost showing convergence of\ndt\u2019s to a cycle. Small rings indicate earlier timesteps of AdaBoost, while larger rings indicate later\ncyc, and d(3)\ntimesteps. There are many concentric rings at positions d(1)\ncyc. (d-Lower Right) 500\ntimesteps of AdaBoost on a random 11x21 matrix. The axes are dt;1 vs dt;2.\n\ncyc, d(2)\n\n4\n\n4\n\n4\n\n(1)T\n\n(1)T\n\n;\n\n4\n\np5\u00a11\n\n4\n\n;\n\np5\u00a11\n\n; 1\n2 ), d\n\nOn this reduced 1-dimensional phase space, the iterated map has no stable \ufb01xed points or\norbits of length 2. However, consider the following periodic orbit of length 3:\ncyc = ( 3\u00a1p5\n2 ; 3\u00a1p5\n(2)T\ncyc = ( 1\n). This\nd\n(1)\ncyc, AdaBoost will choose jt = 1. Then r1 =\nis clearly a cycle, since starting from d\ncyc M)1 = (p5 \u00a1 1)=2. Now, computing d\n(2)\ncyc.\n(d\nIn this way, AdaBoost will cycle between hypotheses j = 1; 2; 3; 1; 2; 3; etc. There is\np5\u00a11\n(3)T\nin fact another 3-cycle, d\ncyc0 =\np5\u00a11\n; 1\n2 ). To \ufb01nd these cycles, we hypothesized only that a cycle of length 3 exists,\n(\nvisiting each hypothesis in turn, and used the reduced equations from Section 2 to solve for\nthe cycle coordinates.\n\n(1)\ncyc;i=(1 + Mi;1r1) for each i yields d\n\ncyc0 = ( 3\u00a1p5\n\n(2)T\ncyc0 = ( 1\n2 ;\n\n2 ; 3\u00a1p5\n; 1\n\n4\n\n(3)T\ncyc = (\n\n; 3\u00a1p5\n\n; 3\u00a1p5\n\np5\u00a11\n\np5\u00a11\n\n; 1\n2 ;\n\n), d\n\n), d\n\n), d\n\n(1)T\n\n4\n\n4\n\n4\n\n4\n\n4\n\n4\n\nWe give the following outline of the proof for global stability: This map is a contrac-\ntion, so any small perturbation from the cycle will diminish, yielding local stability of\nthe cycles. One only needs to consider the one-dimensional map de\ufb01ned on the un-\nfolded triangle, since within one iteration every trajectory lands on the triangle. This\nmap and its iterates are piecewise continuous and monotonic in each piece, so one can\n\ufb01nd exactly where each interval will be mapped (see Figure 2(a)). Consider the sec-\nond iteration of this map (Figure 2(b)). One can break the unfolded triangle into in-\ntervals and \ufb01nd the region of attraction of each \ufb01xed cycle;\nin fact the whole trian-\ngle is the union of both regions of attraction. The convergence to one of these two 3-\ncycles is very fast; Figure 2(b) shows that the absolute slope of the second iterated map\nat the \ufb01xed points is much less than 1. The combined classi\ufb01er AdaBoost will output\n(3)T\ncyc M)3)=normaliz. = (1=3; 1=3=1=3), and\nis: \u201acombined = ((d\nsince mini(M\u201acombined)i = 1=3 AdaBoost produces a maximal margin solution.\n\n(1)T\ncyc M)1; (d\n\n(2)T\ncyc M)2; (d\n\n\fThis 3 \u00a3 3 case can be generalized to m classi\ufb01ers, each having one misclassi\ufb01ed training\nexample; in this case there will be periodic cycles of length m, and the contraction will\nalso persist (the cycles will be stable). We note that for every low-dimensional case we\ntried, periodic cycles of larger lengths seem to exist (such as in Figure 2(d)), but that the\ncontraction at each iteration does not, so it is harder to show stability.\n\n1\n\u00a11\n1\n1\n\n1\n1\n\u00a11\n1\n\n1\n1\n1\n\u00a11\n\n\u00a11\n\u00a11\n1\n\n1\n1\n1\n\nNow, we give an example to show that non-optimal AdaBoost does not necessarily con-\nverge to a maximal margin solution. Consider the following input matrix (again, omitting\n\nunused columns): M = \u02c6 \u00a11\n1 !. For this matrix, the maximal mar-\ngin \u2030 is 1/2. In the optimal case, AdaBoost will produce this value by cycling among the\n\ufb01rst four columns of M. Recall that in the non-optimal case jt 2 fj : (dT\nt M)j \u201a \u2030g.\n; 3\u00a1p5\np5\u00a11\n; 1\nConsider the following initial condition for the dynamics: dT\n).\n2 ;\n1 M)5 > \u2030, we are justi\ufb01ed in choosing j1 = 5, although here it is not the optimal\nSince (dT\n1 M)4 > \u2030 for\nchoice. Another iteration yields dT\nwhich we choose j2 = 4. At the third iteration, we choose j3 = 3, and at the fourth iteration\nwe \ufb01nd d4 = d1. This cycle is the same cycle as in our previous example (although there\nis one extra dimension). There is actually a whole manifold of 3-cycles in this non-optimal\ncase, since ~d1\n. In any\ncase, the value of the margin produced by this cycle is 1/3, not 1/2.\n\n) lies on a cycle for any \u2020, 0 \u2022 \u2020 \u2022 3\u00a1p5\n\n:= (\u2020; 3\u00a1p5\n\n1 = ( 3\u00a1p5\n\n8\n\np5\u00a11\n\n; 3\u00a1p5\n\n4 \u00a1 \u2020; 1\n2 ;\n\n), satisfying (dT\n\n2 = ( 1\n\n4 ; 1\n4 ;\n\np5\u00a11\n\n4\n\n8\n\n4\n\n4\n\n4\n\nT\n\n4\n\nWe have thus established that AdaBoost is not robust in the sense we described; if the\nweak learner is not required to choose the optimal hypothesis at each iteration, but is only\nt M)j \u201a \u2030g, then a\nrequired to choose a suf\ufb01ciently good weak classi\ufb01er jt 2 fj : (dT\nmaximum margin solution will not necessarily be attained. In practice, it may be possible\nfor AdaBoost to converge to a maximum margin solution when hypotheses are chosen to be\nonly slightly non-optimal; however the notion of non-optimal we are using is a very natural\nnotion, and we have shown that AdaBoost may not converge to \u2030 here. Note that for some\nmatrices M, a maximum margin solution may still be attained in the non-optimal case (for\nexample the simple 3\u00a33 matrix we analyzed above), but it is not attained in general as\nshown by our example. We are not saying that the only way for AdaBoost to converge to\na non-optimal solution is to fall into the wrong cycle; there may be many other non-cyclic\nways for the algorithm to fail to converge to a maximum margin solution. Also note that\nfor the other algorithms mentioned in Section 1 and for the new algorithms in Section 4,\nthere are \ufb01xed points rather than periodic orbits.\n\n4 Coordinate Ascent for Maximum Margins\n\nAdaBoost can be interpreted as an algorithm based on coordinate descent. There are other\nalgorithms such as AdaBoost\u2044 and arc-gv that attempt to maximize the margin explicitly,\nbut these are not based on coordinate descent. We now suggest a boosting algorithm that\naims to maximize the margin explicitly (like arc-gv and AdaBoost\u2044) yet is based on co-\nordinate ascent. An important note is that AdaBoost and our new algorithm choose the\nt M)j.\ndirection of descent/ascent (value of jt) using the same formula, jt = argmaxj(dT\nThis lends further credence to the conjecture that AdaBoost maximizes the margin in the\noptimal case, since the direction AdaBoost chooses is the same direction one would choose\nto maximize the margin directly via coordinate ascent.\n\nThe function that AdaBoost minimizes via coordinate descent is F (\u201a) = Pm\ni=1 e\u00a1(M\u201a)i.\nConsider any \u201a such that (M\u201a)i > 0 8i. Then lima!1 a\u201a will minimize F , yet the origi-\nnal normalized \u201a might not yield a maximum margin. So it must be the process of coordi-\nnate descent which awards AdaBoost its ability to increase margins, not simply AdaBoost\u2019s\nability to minimize F . Now consider a different function (which bears a resemblance to an\n\n\f\u2020-Boosting objective in [7]):\n\nG(\u201a) = \u00a1\n\n1\nk\u201ak1\n\nln F (\u201a) = \u00a1\n\n1\nk\u201ak1\n\nln\u02c6 mXi=1\n\ne\u00a1(M\u201a)i! where k\u201ak1 :=\n\n\u201aj :\n\nnXj=1\n\nIt can be veri\ufb01ed that G has many nice properties, e.g., G is a concave function for each\n\ufb01xed value of k\u201ak1, whose maximum only occurs in the limit as k\u201ak1 ! 1, and more\nimportantly, as k\u201ak1 ! 1 we have G(\u201a) ! \u201e(\u201a), where \u201e(\u201a) = (mini(M\u201a)i)=k\u201ak1,\nthe margin of \u201a. That is,\n\nme\u00a1\u201e(\u201a)k\u201ak1 \u201a Pm\n\n\u00a1(ln m)=k\u201ak1 + \u201e(\u201a) \u2022\n\ni=1 e\u00a1(M\u201a)i > e\u00a1\u201e(\u201a)k\u201ak1\nG(\u201a)\n\n< \u201e(\u201a)\n\n(1)\n(2)\n\nFor (1), the \ufb01rst inequality becomes equality only when all m examples achieve the same\nminimal margin, and the second inequality holds since we took only one term. Rather than\nperforming coordinate descent on F as in AdaBoost, let us perform coordinate ascent on\nG. The choice of direction jt at iteration t is:\n\n\ufb02\ufb02\ufb02\ufb01=0\n\nargmax\n\nj\n\ndG(\u201at + \ufb01ej)\n\nd\ufb01\n\n= argmax\n\nj\n\n\u2022Pm\n\ni=1 e\u00a1(M\u201at)i Mij\nF (\u201at)k\u201atk1\n\n\u201a +\n\n1\n\nk\u201atk2\n\n1\n\nln(F (\u201at)):\n\nt M)j. Thus the same direction will be chosen here as for AdaBoost.\n\nOf these two terms, the second term does not depend on j, and the \ufb01rst term is proportional\nto (dT\nNow consider the distance to travel along this direction. Ideally, we would like to maximize\nG(\u201at + \ufb01ejt ) with respect to \ufb01, i.e., we would like:\n\n0 =\n\ndG(\u201at + \ufb01ejt )\n\nd\ufb01\n\nk\u201at+1k1 = Pm\n\ni=1 e\u00a1(M\u201at)i e\u00a1Mijt \ufb01Mijt\n\nF (\u201at + \ufb01ejt )\n\n\u00a1 G(\u201at + \ufb01ejt )\n\nThere is not an analytical solution for \ufb01, but maximization of G(\u201at+\ufb01ejt ) is 1-dimensional\nso it can be performed quickly. An approximate coordinate ascent algorithm which avoids\nthis line search is the following approximation to this maximization problem:\n\ni=1 e\u00a1(M\u201at)i e\u00a1Mijt \ufb01Mijt\n\nF (\u201at + \ufb01ejt )\n\n\u00a1 G(\u201at):\n\nWe can solve for \ufb01t analytically:\n\n0 \u2026 Pm\nln(cid:181) 1 + rt\n1 \u00a1 rt\u00b6 \u00a1\n\n\ufb01t =\n\n1\n2\n\n1\n2\n\nln(cid:181) 1 + gt\n1 \u00a1 gt\u00b6 , where gt = maxf0; G(\u201at)g:\n\n(3)\n\nConsider some properties of this iteration scheme. The update for \ufb01t is strictly positive (in\nthe case of positive margins) due to the Von Neumann min-max theorem and equation (2),\n\nmini (Me\u201a)i \u201a mini (M\u201at)i=k\u201atk1\nthat is: rt \u201a \u2030 = mind2\u00a2m maxj (dT M)j = maxe\u201a2\u00a2n\n> G(\u201at), and thus \ufb01t > 0 8t. We have preliminary proofs that the value of G increases\nat each iteration of our approximate coordinate ascent algorithm, and that our algorithms\nconverge to a maximum margin solution, even in the non-optimal case.\n\nOur new update (3) is less aggressive than AdaBoost\u2019s, but slightly more aggressive than\narc-gv\u2019s. The other algorithm we mention, AdaBoost\u2044, has a different sort of update. It\nconverges to a combined classi\ufb01er attaining a margin inside the interval [\u2030 \u00a1 \u201d; \u2030] within\n2(log2 m)=\u201d2 steps, but does not guarantee asymptotic convergence to \u2030 for a \ufb01xed \u201d.\nThere are many other boosting algorithms, but some of them require minimization over\nnon-convex functions; here, we choose to compare with the simple updates of AdaBoost\n(due to its fast convergence rate), AdaBoost\u2044, and arc-gv. AdaBoost, arc-gv, and our\nalgorithm have initially large updates, based on a conservative estimate of the margin.\nAdaBoost\u2044\u2019s updates are initially small based on an estimate of the edge.\n\n\f0.65\n\ni\n\nn\ng\nr\na\nM\n\n0.5\n\narc\u2212gv, approximate \ncoord ascent, \nand coord ascent \n\nAdaBoost \n\narc\u2212gv \n\napproximate \ncoord. ascent \nand coord. \nascent \n\nAdaBoost* \n\nAdaBoost \n\n0.16\n\ni\n\nn\ng\nr\na\nM\n\n0.13\n\n0.4\n0\n\nIterations\n\n20\n\n150\n\n1100\n\n0.1 \n\n90\n\nAdaBoost \n\napproximate\ncoordinate \nascent \n\narc\u2212gv\n\n400\n\nIterations\n\n1800\n\n10000\n\nFigure 3: (a-Left) Performance of all algorithms in the optimal case on a random 11 \u00a3 21 input\nmatrix (b-Right) AdaBoost, arc-gv, and approximate coordinate ascent on synthetic data.\nFigure 3(a) shows the performance of AdaBoost, arc-gv, AdaBoost\u2044 (parameter \u201d set to\n:001), approximate coordinate ascent, and coordinate ascent on G (with a line search for\n\ufb01t at every iteration) on a reduced randomly generated 11\u00a3 21 matrix, in the optimal case.\nAdaBoost settles into a cycle (as shown in Figure2(d)), so its updates remain consistently\nlarge, causing k\u201atk1 to grow faster, thus converge faster with respect to G. The values\nof rt in the cycle happen to produce an optimal margin solution, so AdaBoost quickly\nconverges to this solution. The approximate coordinate ascent algorithm has slightly less\naggressive updates than AdaBoost, and is very closely aligned with coordinate ascent; arc-\ngv is slower. AdaBoost\u2044 has a more methodical convergence rate; convergence is initially\nslower but speeds up later. Arti\ufb01cial test data for Figure 3(b) was designed as follows:\n50 example points were constructed randomly such that each xi lies on a corner of the\nxi(k)), where xi(k) indicates the kth\ncomponent of xi. The jth weak learner is hj(x) = x(j), thus Mij = yixi(j). As expected,\nthe convergence rate of approximate coordinate ascent falls between AdaBoost and arc-gv.\n\nhypercube f\u00a11; 1g100. We set yi = sign(P11\n\nk=1\n\n5 Conclusions\n\nWe have used the nonlinear iterated map de\ufb01ned by AdaBoost to understand its update rule\nin low-dimensional cases and uncover cyclic dynamics. We produced an example to show\nthat AdaBoost does not necessarily maximize the margin in the non-optimal case. Then, we\nintroduced a coordinate ascent algorithm and an approximate coordinate ascent algorithm\nthat aim to maximize the margin directly. Here, the direction of ascent agrees with the\ndirection chosen by AdaBoost and other algorithms. It is an open problem to understand\nthese dynamics in other cases.\nReferences\n[1] Robert E. Schapire. A brief introduction to boosting. In Proceedings of the Sixteenth Interna-\n\ntional Joint Conference on Arti\ufb01cial Intelligence, 1999.\n\n[2] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new\nexplanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651\u20131686,\nOctober 1998.\n\n[3] Gunnar R\u00a8atsch and Manfred Warmuth. Maximizing the margin with boosting. In Proceedings\n\nof the 15th Annual Conference on Computational Learning Theory, pages 334\u2013350, 2002.\n\n[4] Gunnar R\u00a8atsch and Manfred Warmuth. Ef\ufb01cient margin maximizing with boosting. Journal of\n\nMachine Learning Research, submitted 2002.\n\n[5] Leo Breiman. Prediction games and arcing classi\ufb01ers. Neural Computation, 11(7):1493\u20131517,\n\n1999.\n\n[6] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, AdaBoost and\n\nBregman distances. Machine Learning, 48(1/2/3), 2002.\n\n[7] Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin\n\nclassi\ufb01er. Technical report, Department of Statistics, Stanford University, 2003.\n\n\f", "award": [], "sourceid": 2535, "authors": [{"given_name": "Cynthia", "family_name": "Rudin", "institution": null}, {"given_name": "Ingrid", "family_name": "Daubechies", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}]}