{"title": "Online multiclass boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 919, "page_last": 928, "abstract": "Recent work has extended the theoretical analysis of boosting algorithms to multiclass problems and to online settings. However, the multiclass extension is in the batch setting and the online extensions only consider binary classification. We fill this gap in the literature by defining, and justifying, a weak learning condition for online multiclass boosting. This condition leads to an optimal boosting algorithm that requires the minimal number of weak learners to achieve a certain accuracy. Additionally, we propose an adaptive algorithm which is near optimal and enjoys an excellent performance on real data due to its adaptive property.", "full_text": "Online Multiclass Boosting\n\nYoung Hun Jung\n\nJack Goetz\n\nDepartment of Statistics\nUniversity of Michigan\nAnn Arbor, MI 48109\n\n{yhjung, jrgoetz, tewaria}@umich.edu\n\nAmbuj Tewari\n\nAbstract\n\nRecent work has extended the theoretical analysis of boosting algorithms to multi-\nclass problems and to online settings. However, the multiclass extension is in the\nbatch setting and the online extensions only consider binary classi\ufb01cation. We \ufb01ll\nthis gap in the literature by de\ufb01ning, and justifying, a weak learning condition for\nonline multiclass boosting. This condition leads to an optimal boosting algorithm\nthat requires the minimal number of weak learners to achieve a certain accuracy.\nAdditionally, we propose an adaptive algorithm which is near optimal and enjoys\nan excellent performance on real data due to its adaptive property.\n\n1\n\nIntroduction\n\nBoosting methods are a ensemble learning methods that aggregate several (not necessarily) weak\nlearners to build a stronger learner. When used to aggregate reasonably strong learners, boosting has\nbeen shown to produce results competitive with other state-of-the-art methods (e.g., Korytkowski\net al. [1], Zhang and Wang [2]). Until recently theoretical development in this area has been focused\non batch binary settings where the learner can observe the entire training set at once, and the labels\nare restricted to be binary (cf. Schapire and Freund [3]). In the past few years, progress has been\nmade to extend the theory and algorithms to more general settings.\nDealing with multiclass classi\ufb01cation turned out to be more subtle than initially expected. Mukherjee\nand Schapire [4] unify several different proposals made earlier in the literature and provide a general\nframework for multiclass boosting. They state their weak learning conditions in terms of cost matrices\nthat have to satisfy certain restrictions: for example, labeling with the ground truth should have less\ncost than labeling with some other labels. A weak learning condition, just like the binary condition,\nstates that the performance of a learner, now judged using a cost matrix, should be better than a\nrandom guessing baseline. One particular condition they call the edge-over-random condition, proves\nto be suf\ufb01cient for boostability. The edge-over-random condition will also \ufb01gure prominently in this\npaper. They also consider a necessary and suf\ufb01cient condition for boostability but it turns out to be\ncomputationally intractable to be used in practice.\nA recent trend in modern machine learning is to train learners in an online setting where the instances\ncome sequentially and the learner has to make predictions instantly. Oza [5] initially proposed an\nonline boosting algorithm that has accuracy comparable with the batch version, but it took several\nyears to design an algorithm with theoretical justi\ufb01cation (Chen et al. [6]). Beygelzimer et al. [7]\nachieved a breakthrough by proposing an optimal algorithm in online binary settings and an adaptive\nalgorithm that works quite well in practice. These theories in online binary boosting have led to\nseveral extensions. For example, Chen et al. [8] combine one vs all method with binary boosting\nalgorithms to tackle online multiclass problems with bandit feedback, and Hu et al. [9] build a theory\nof boosting in regression setting.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we combine the insights and techniques of Mukherjee and Schapire [4] and Beygelzimer\net al. [7] to provide a framework for online multiclass boosting. The cost matrix framework from the\nformer work is adopted to propose an online weak learning condition that de\ufb01nes how well a learner\ncan perform over a random guess (De\ufb01nition 1). We show this condition is naturally derived from its\nbatch setting counterpart. From this weak learning condition, a boosting algorithm (Algorithm 1) is\nproposed which is theoretically optimal in that it requires the minimal number of learners and sample\ncomplexity to attain a speci\ufb01ed level of accuracy. We also develop an adaptive algorithm (Algorithm\n2) which allows learners to have variable strengths. This algorithm is theoretically less ef\ufb01cient than\nthe optimal one, but the experimental results show that it is quite comparable and sometimes even\nbetter due to its adaptive property. Both algorithms not only possess theoretical proofs of mistake\nbounds, but also demonstrate superior performance over preexisting methods.\n\n2 Preliminaries\n\ncost(cid:80)\n\nt) to the ith weak learner W Li, and gets its prediction li\n\nWe \ufb01rst describe the basic setup for online boosting. While in the batch setting, an additional weak\nlearner is trained at every iteration, in the online setting, the algorithm starts with a \ufb01xed count\nof N weak learners and a booster which manages the weak learners. There are k possible labels\n[k] := {1,\u00b7\u00b7\u00b7 , k} and k is known to the learners. At each iteration t = 1,\u00b7\u00b7\u00b7 , T , an adversary picks\na labeled example (xt, yt) \u2208 X \u00d7 [k], where X is some domain, and reveals xt to the booster. Once\nthe booster observes the unlabeled data xt, it gathers the weak learners\u2019 predictions and makes a \ufb01nal\nprediction. Throughout this paper, index i takes values from 1 to N; t from 1 to T ; and l from 1 to k.\nWe utilize the cost matrix framework, \ufb01rst proposed by Mukherjee and Schapire [4], to develop\nmulticlass boosting algorithms. This is a key ingredient in the multiclass extension as it enables\ndifferent penalization for each pair of correct label and prediction, and we further develop this\nt \u2208\nframework to suit the online setting. The booster sequentially computes cost matrices {Ci\nRk\u00d7k | i = 1,\u00b7\u00b7\u00b7 , N}, sends (xt, Ci\nt \u2208 [k].\nHere the cost matrix Ci\nt plays a role of loss function in that W Li tries to minimize the cumulative\nt]. As the booster wants each learner to predict the correct label, it wants to set the\ndiagonal entries of Ci\nt to be minimal among its row. At this stage, the true label yt is not revealed yet,\nbut the previous weak learners\u2019 predictions can affect the computation of the cost matrix for the next\nlearner. Given a matrix C, the (i, j)th entry will be denoted by C[i, j], and ith row vector by C[i].\nOnce all the learners make predictions, the booster makes the \ufb01nal prediction \u02c6yt by majority votes.\nThe booster can either take simple majority votes or weighted ones. In fact for the adaptive algorithm,\nwe will allow weighted votes so that the booster can assign more weights on well-performing learners.\nThe weight for W Li at iteration t will be denoted by \u03b1i\nt. After observing the booster\u2019s \ufb01nal decision,\nthe adversary reveals the true label yt, and the booster suffers 0-1 loss 1(\u02c6yt (cid:54)= yt). The booster also\nshares the true label to the weak learners so that they can train on this data point.\nTwo main issues have to be resolved to design a good boosting algorithm. First, we need to design\nthe booster\u2019s strategy for producing cost matrices. Second, we need to quantify weak learner\u2019s\nt]. The \ufb01rst issue will be resolved by introducing\npotential functions, which will be thoroughly discussed in Section 3.1. For the second issue, we\nintroduce our online weak learning condition, a generalization of the weak learning assumption in\nBeygelzimer et al. [7], stating that for any adaptively given sequence of cost matrices, weak learners\ncan produce predictions whose cumulative cost is less than that incurred by random guessing. The\nonline weak learning condition will be discussed in the following section. For the analysis of the\nadaptive algorithm, we use empirical edges instead of the online weak learning condition.\n\nability to reduce the cumulative cost(cid:80)T\n\nt Ci\n\nt[yt, li\n\nt=1 Ci\n\nt[yt, li\n\n2.1 Online weak learning condition\n\nIn this section, we propose an online weak learning condition that states the weak learners are better\nthan a random guess. We \ufb01rst de\ufb01ne a baseline condition that is better than a random guess. Let\n\u03b3 \u2208 \u2206[k] be a uniform distribution that puts \u03b3\n\u2206[k] denote a family of distributions over [k] and ul\nk + \u03b3, 1\u2212\u03b3\n\u03b3 = ( 1\u2212\u03b3\nmore weight on the label l. For example, u1\nk ). For a given sequence of\nexamples {(xt, yt) | t = 1,\u00b7\u00b7\u00b7 , T}, U\u03b3 \u2208 RT\u00d7k consists of rows uyt\n\u03b3 . Then we restrict the booster\u2019s\n\nk ,\u00b7\u00b7\u00b7 , 1\u2212\u03b3\n\n2\n\n\fchoice of cost matrices to\n\nCeor\n\n1\n\n:= {C \u2208 Rk\u00d7k | \u2200l, r \u2208 [k], C[l, l] = 0, C[l, r] \u2265 0, and ||C[l]||1 = 1}.\n\n1\n\nNote that diagonal entries are minimal among the row, and Ceor\nalso has a normalization constraint.\nA broader choice of cost matrices is allowed if one can assign importance weights on observations,\nwhich is possible for various learners. Even if the learner does not take the importance weight as an\ninput, we can achieve a similar effect by sending to the learner an instance with probability that is\nproportional to its weight. Interested readers can refer Beygelzimer et al. [7, Lemma 1]. From now\non, we will assume that our weak learners can take weight wt as an input.\nWe are ready to present our online weak learning condition. This condition is in fact naturally derived\nfrom the batch setting counterpart that is well studied by Mukherjee and Schapire [4]. The link is\nthoroughly discussed in Appendix A. For the scaling issue, we assume the weights wt lie in [0, 1].\nDe\ufb01nition 1. (Online multiclass weak learning condition) For parameters \u03b3, \u03b4 \u2208 (0, 1), and\nS > 0, a pair of online learner and an adversary is said to satisfy online weak learning condition\nwith parameters \u03b4, \u03b3, and S if for any sample length T , any adaptive sequence of labeled examples,\n| t =\nand for any adaptively chosen series of pairs of weight and cost matrix {(wt, Ct) \u2208 [0, 1]\u00d7Ceor\n1,\u00b7\u00b7\u00b7 , T}, the learner can generate predictions \u02c6yt such that with probability at least 1 \u2212 \u03b4,\n\n1\n\nwtCt[yt, \u02c6yt] \u2264 C \u2022 U(cid:48)\n\n\u03b3 + S =\n\n1 \u2212 \u03b3\nk\n\n||w||1 + S,\n\n(1)\n\nT(cid:88)\n\nt=1\n\nwhere C \u2208 RT\u00d7k consists of rows of wtCt[yt] and A \u2022 B(cid:48) denotes the Frobenius inner product\nTr(AB(cid:48)). w = (w1,\u00b7\u00b7\u00b7 , wT ) and the last equality holds due to the normalized condition on Ceor\n. \u03b3\nis called an edge, and S an excess loss.\nRemark. Notice that this condition is imposed on a pair of learner and adversary instead of solely\non a learner. This is because no learner can satisfy this condition if the adversary draws samples\nin a completely adaptive manner. The probabilistic statement is necessary because many online\nalgorithms\u2019 predictions are not deterministic. The excess loss requirement is needed since an online\nlearner cannot produce meaningful predictions before observing a suf\ufb01cient number of examples.\n\n1\n\n3 An optimal algorithm\n\nIn this section, we describe the booster\u2019s optimal strategy for designing cost matrices. We \ufb01rst\nintroduce a general theory without specifying the loss, and later investigate the asymptotic behavior\nof cumulative loss suffered by our algorithm under the speci\ufb01c 0-1 loss. We adopt the potential\nfunction framework from Mukherjee and Schapire [4] and extend it to the online setting. Potential\nfunctions help both in designing cost matrices and in proving the mistake bound of the algorithm.\n\nt :=(cid:80)i\n\n3.1 A general online multiclass boost-by-majority (OnlineMBBM) algorithm\n\nt\n\nj=1 \u03b1j\n\ntelj\n\n, where \u03b1i\n\nWe will keep track of the weighted cumulative votes of the \ufb01rst i weak learners for the sample xt by\nsi\nt is its prediction and ej is the jth standard basis\nt is the weight of W Li, li\nt = 1, \u2200i, t. In other words, the booster makes\nvector. For the optimal algorithm, we assume that \u03b1i\nthe \ufb01nal decision by simple majority votes. Given a cumulative vote s \u2208 Rk, suppose we have a loss\nfunction Lr(s) where r denotes the correct label. We call a loss function proper, if it is a decreasing\nfunction of s[r] and an increasing function of other coordinates (we alert the reader that \u201cproper loss\u201d\nhas at least one other meaning in the literature). From now on, we will assume that our loss function\nis proper. A good example of proper loss is multiclass 0-1 loss:\ns[l] \u2265 s[r]).\n\n(2)\n\nLr(s) := 1(max\nl(cid:54)=r\n\ni (s) is to estimate the booster\u2019s loss when there remain i\nThe purpose of the potential function \u03c6r\nlearners until the \ufb01nal decision and the current cumulative vote is s. More precisely, we want potential\nfunctions to satisfy the following conditions:\n\n0(s) = Lr(s),\n\u03c6r\ni+1(s) = El\u223cur\n\u03c6r\n\n\u03b3\n\ni (s + el).\n\u03c6r\n\n3\n\n(3)\n\n\fAlgorithm 1 Online Multiclass Boost-by-Majority (OnlineMBBM)\n1: for t = 1,\u00b7\u00b7\u00b7 , T do\nReceive example xt\n2:\nt = 0 \u2208 Rk\nSet s0\n3:\nfor i = 1,\u00b7\u00b7\u00b7 , N do\n4:\nSet the normalized cost matrix Di\n5:\nGet weak predictions li\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: end for\n\nt + eyt)]\nPass training example with weight (xt, yt, wi[t]) to W Li\n\nt [l] and receive true label yt\nN\u2212i(si\u22121\nN\u2212i(si\u22121\n\nend for\nPredict \u02c6yt := argmaxl sN\nfor i = 1,\u00b7\u00b7\u00b7 , N do\n\nSet wi[t] =(cid:80)k\n\nt = W Li(xt) and update si\n\nt + el) \u2212 \u03c6yt\n\nl=1[\u03c6yt\n\nend for\n\nt according to (5) and pass it to W Li\n\nt = si\u22121\n\nt + eli\n\nt\n\ni (s) also inherits the proper property of the loss function, which can be\nReaders should note that \u03c6r\nshown by induction. The condition (3) can be loosened by replacing both equalities by inequalities\n\u201c\u2265\u201d, but in practice we usually use equalities.\nNow we describe the booster\u2019s strategy for designing cost matrices. After observing xt, the booster\nsequentially sets a cost matrix Ci\nt and uses this in the\ncomputation of the next cost matrix Ci+1\n\nt for W Li, gets the weak learner\u2019s prediction li\n\n. Ultimately, booster wants to set\n\nt\nCi\nt[r, l] = \u03c6r\n\nN\u2212i(si\u22121\n\nt + el).\n(4)\nHowever, this cost matrix does not satisfy the condition of Ceor\n, and thus should be modi\ufb01ed in order\nto utilize the weak learning condition. First to make the cost for the true label equal to 0, we subtract\nCi\nt[r, r] from every element of Ci\nt[r]. Since the potential function is proper, our new cost matrix still\nhas non-negative elements after the subtraction. We then normalize the row so that each row has (cid:96)1\nnorm equal to 1. In other words, we get new normalized cost matrix\nN\u2212i(si\u22121\n\nN\u2212i(si\u22121\n\u03c6r\n\nt + er)\n\n1\n\n,\n\n(5)\n\nwhere wi[t] :=(cid:80)k\n\nt + el) \u2212 \u03c6r\nwi[t]\n\nDi\nt[r, l] =\nt + el)\u2212 \u03c6r\nN\u2212i(si\u22121\n\nN\u2212i(si\u22121\n\nt + er) plays the role of weight. It is still possible\nl=1 \u03c6r\nthat a row vector Ci\nt[r] is a zero vector so that normalization is impossible. In this case, we just leave\nit as a zero vector. Our weak learning condition (1) still works with cost matrices some of whose row\nvectors are zeros because however the learner predicts, it incurs no cost.\nAfter de\ufb01ning cost matrices, the rest of the algorithm is straightforward except we have to estimate\n||wi||\u221e to normalize the weight. This is necessary because the weak learning condition assumes\nthe weights lying in [0, 1]. We cannot compute the exact value of ||wi||\u221e until the last instance is\nrevealed, which is \ufb01ne as we need this value only in proving the mistake bound. The estimate wi\u2217 for\n||wi||\u221e requires to specify the loss, and we postpone the technical parts to Appendix B.2. Interested\nreaders may directly refer Lemma 10 before proceeding. Once the learners generate predictions after\nobserving cost matrices, the \ufb01nal decision is made by simple majority votes. After the true label\nis revealed, the booster updates the weight and sends the labeled instance with weight to the weak\nlearners. The pseudocode for the entire algorithm is depicted in Algorithm 1. The algorithm is named\nafter Beygelzimer et al. [7, OnlineBBM], which is in fact OnlineMBBM with binary labels.\nWe present our \ufb01rst main result regarding the mistake bound of general OnlineMBBM. The proof\nappears in Appendix B.1 where the main idea is adopted from Beygelzimer et al. [7, Lemma 3].\nTheorem 2. (Cumulative loss bound for OnlineMBBM) Suppose weak learners and an adversary\nsatisfy the online weak learning condition (1) with parameters \u03b4, \u03b3, and S. For any T and N satisfying\n\u03b4 (cid:28) 1\nN , and any adaptive sequence of labeled examples generated by the adversary, the \ufb01nal loss\nsuffered by OnlineMBBM satis\ufb01es the following inequality with probability 1 \u2212 N \u03b4:\n\nLyt(sN\n\nt ) \u2264 \u03c61\n\nN (0)T + S\n\nwi\u2217.\n\n(6)\n\nT(cid:88)\n\nN(cid:88)\n\nt=1\n\ni=1\n\n4\n\n\fN (0) plays a role of asymptotic error rate and the second term determines the sample com-\nHere \u03c61\nplexity. We will investigate the behavior of those terms under the 0-1 loss in the following section.\n\n3.2 Mistake bound under 0-1 loss and its optimality\n\nFrom now on, we will specify the loss to be multiclass 0-1 loss de\ufb01ned in (2), which might be the\nmost relevant measure in multiclass problems. To present a speci\ufb01c mistake bound, two terms in\nthe RHS of (6) should be bounded. This requires an approximation of potentials, which is technical\nand postponed to Appendix B.2. Lemma 9 and 10 provide the bounds for those terms. We also\nmention another bound for the weight in the remark after Lemma 10 so that one can use whichever\ntighter. Combining the above lemmas with Theorem 2 gives the following corollary. The additional\nconstraint on \u03b3 comes from Lemma 10.\nCorollary 3. (0-1 loss bound of OnlineMBBM) Suppose weak learners and an adversary satisfy\nthe online weak learning condition (1) with parameters \u03b4, \u03b3, and S, where \u03b3 < 1\n2 . For any T and\nN satisfying \u03b4 (cid:28) 1\nN and any adaptive sequence of labeled examples generated by the adversary,\nOnlineMBBM can generate predictions \u02c6yt that satisfy the following inequality with probability 1\u2212N \u03b4:\n\nT(cid:88)\n\nt=1\n\n1(yt (cid:54)= \u02c6yt) \u2264 (k \u2212 1)e\u2212 \u03b32N\n\n2 T + \u02dcO(k5/2\n\n\u221a\n\nN S).\n\n(7)\n\nTherefore in order to achieve error rate \u0001, it suf\ufb01ces to use N = \u0398( 1\ngives an excess loss bound of \u02dc\u0398( k5/2\n\n\u03b3 S).\n\n\u03b32 ln k\n\n\u0001 ) weak learners, which\n\nRemark. Note that the above excess loss bound gives a sample complexity bound of \u02dc\u0398( k5/2\n\u0001\u03b3 S). If\nwe use alternative weight bound to get kN S as an upper bound for the second term in (6), we end up\nhaving \u02dcO(kN S). This will give an excess loss bound of \u02dc\u0398( k\n\n\u03b32 S).\n\nWe now provide lower bounds on the number of learners and sample complexity for arbitrary online\nboosting algorithms to evaluate the optimality of OnlineMBBM under 0-1 loss. In particular, we\nconstruct weak learners that satisfy the online weak learning condition (1) and have almost matching\nasymptotic error rate and excess loss compared to those of OnlineMBBM as in (7). Indeed we\ncan prove that the number of learners and sample complexity of OnlineMBBM is optimal up to\nlogarithmic factors, ignoring the in\ufb02uence of the number of classes k. Our bounds are possibly\nsuboptimal up to polynomial factors in k, and the problem to \ufb01ll the gap remains open. The detailed\nproof and a discussion of the gap can be found in Appendix B.3. Our lower bound is a multiclass\nversion of Beygelzimer et al. [7, Theorem 3].\nTheorem 4. (Lower bounds for N and T ) For any \u03b3 \u2208 (0, 1\n, there\nexists an adversary with a family of learners satisfying the online weak learning condition (1) with\nparameters \u03b4, \u03b3, and S, such that to achieve asymptotic error rate \u0001, an online boosting algorithm\nrequires at least \u2126( 1\n\n4 ), \u03b4, \u0001 \u2208 (0, 1), and S \u2265 k ln( 1\n\n\u03b4 )\n\n\u0001 ) learners and a sample complexity of \u2126( k\n\n\u0001\u03b3 S).\n\nk2\u03b32 ln 1\n\n\u03b3\n\n4 An adaptive algorithm\n\nThe online weak learning condition imposes minimal assumptions on the asymptotic accuracy of\nlearners, and obviously it leads to a solid theory of online boosting. However, it has two main practical\nlimitations. The \ufb01rst is the dif\ufb01culty of estimating the edge \u03b3. Given a learner and an adversary, it\nis by no means a simple task to \ufb01nd the maximum edge that satis\ufb01es (1). The second issue is that\ndifferent learners may have different edges. Some learners may in fact be quite strong with signi\ufb01cant\nedges, while others are just slightly better than a random guess. In this case, OnlineMBBM has to\npick the minimum edge as it assumes common \u03b3 for all weak learners. It is obviously inef\ufb01cient in\nthat the booster underestimates the strong learners\u2019 accuracy.\nOur adaptive algorithm will discard the online weak learning condition to provide a more practical\nmethod. Empirical edges \u03b31,\u00b7\u00b7\u00b7 , \u03b3N (see Section 4.2 for the de\ufb01nition) are measured for the weak\nlearners and are used to bound the number of mistakes made by the boosting algorithm.\n\n5\n\n\f4.1 Choice of loss function\n\nAdaboost, proposed by Freund et al. [10], is arguably the most popular boosting algorithm in practice.\nIt aims to minimize the exponential loss, and has many variants which use some other surrogate\nloss. The main reason of using a surrogate loss is ease of optimization; while 0-1 loss is not even\ncontinuous, most surrogate losses are convex. We adopt the use of a surrogate loss for the same reason,\nand throughout this section will discuss our choice of surrogate loss for the adaptive algorithm.\nExponential loss is a very strong candidate in that it provides a closed form for computing potential\nfunctions, which are used to design cost matrices (cf. Mukherjee and Schapire [4, Theorem 13]).\nOne property of online setting, however, makes it unfavorable. Like OnlineMBBM, each data point\nwill have a different weight depending on weak learners\u2019 performance, and if the algorithm uses\nexponential loss, this weight will be an exponential function of difference in weighted cumulative\nvotes. With this exponentially varying weights among samples, the algorithm might end up depending\non very small portion of observed samples. This is undesirable because it is easier for the adversary\nto manipulate the sample sequence to perturb the learner.\nTo overcome exponentially varying weights, Beygelzimer et al. [7] use logistic loss in their adaptive\nalgorithm. Logistic loss is more desirable in that its derivative is bounded and thus weights will be\nrelatively smooth. For this reason, we will also use multiclass version of logistic loss:\n\nLr(s) =:\n\nlog(1 + exp(s[r] \u2212 s[r])).\n\n(8)\n\n(cid:88)\n\nl(cid:54)=r\n\nWe still need to compute potential functions from logistic loss in order to calculate cost matrices.\nUnfortunately, Mukherjee and Schapire [4] use a unique property of exponential loss to get a closed\nform for potential functions, which cannot be adopted to logistic loss. However, the optimal cost\nmatrix induced from exponential loss has a very close connection with the gradient of the loss (cf.\nMukherjee and Schapire [4, Lemma 22]). From this, we will design our cost matrices as following:\n\n(cid:40)\n\u2212(cid:80)\n\nCi\n\nt[r, l] :=\n\nt\n\n1\n[r]\u2212si\u22121\n1+exp(si\u22121\n[l])\n1\n1+exp(si\u22121\n[r]\u2212si\u22121\n\n, if l (cid:54)= r\n, if l = r.\nj(cid:54)=r\nt[r] is simply the gradient of Lr(si\u22121\n\n[j])\n\nt\n\nt\n\nt\n\nt\n\n(9)\n\n1\n\nReaders should note that the row vector Ci\n). Also note that this\nmatrix does not belong to Ceor\n, but it does guarantee that the correct prediction gets the minimal cost.\nThe choice of logistic loss over exponential loss is somewhat subjective. The undesirable property\nof exponential loss does not necessarily mean that we cannot build an adaptive algorithm using this\nloss. In fact, we can slightly modify Algorithm 2 to develop algorithms using different surrogates\n(exponential loss and square hinge loss). However, their theoretical bounds are inferior to the one with\nlogistic loss. Interested readers can refer Appendix D, but it assumes understanding of Algorithm 2.\n\n4.2 Adaboost.OLM\n\nthe learners. As we are working with logistic loss, we want to minimize(cid:80)\n\nOur work is a generalization of Adaboost.OL by Beygelzimer et al. [7], from which the name\nAdaboost.OLM comes with M standing for multiclass. We introduce a new concept of an expert.\nFrom N weak learners, we can produce N experts where expert i makes its prediction by weighted\nmajority votes among the \ufb01rst i learners. Unlike OnlineMBBM, we allow varying weights \u03b1i\nt over\nt) for each i, where\nthe loss is given in (8). We want to alert the readers to note that even though the algorithm tries to\nminimize the cumulative surrogate loss, its performance is still evaluated by 0-1 loss. The surrogate\nloss only plays a role of a bridge that makes the algorithm adaptive.\nWe do not impose the online weak learning condition on weak learners, but instead just measure the\nt[yt,li\nt]\nt[yt,yt]. This empirical edge will be used to bound the number of\nperformance of W Li by \u03b3i :=\nmistakes made by Adaboost.OLM. By de\ufb01nition of cost matrix, we can check\n\nt Lyt(si\n\n(cid:80)\n(cid:80)\n\nt Ci\nt Ci\n\nt[yt, yt] \u2264 Ci\nCi\n\nt[yt, l] \u2264 \u2212Ci\n\nt[yt, yt], \u2200l \u2208 [k],\n\nfrom which we can prove \u22121 \u2264 \u03b3i \u2264 1, \u2200i. If the online weak learning condition is met with edge \u03b3,\nthen one can show that \u03b3i \u2265 \u03b3 with high probability when the sample size is suf\ufb01ciently large.\n\n6\n\n\ft according to (9) and pass it to W Li\n\nt + \u03b1i\n\nteli\n\nt\n\nt[l], the prediction of expert i\n\n1 = 0\n\nAlgorithm 2 Adaboost.OLM\n1: Initialize: \u2200i, vi\n1 = 1, \u03b1i\n2: for t = 1,\u00b7\u00b7\u00b7 , T do\nReceive example xt\n3:\nt = 0 \u2208 Rk\nSet s0\n4:\nfor i = 1,\u00b7\u00b7\u00b7 , N do\n5:\nCompute Ci\n6:\nSet li\n7:\nSet \u02c6yi\n8:\nend for\n9:\nRandomly draw it with P(it = i) \u221d vi\n10:\nPredict \u02c6yt = \u02c6yit\n11:\nfor i = 1,\u00b7\u00b7\u00b7 , N do\n12:\nSet \u03b1i\nt+1 = \u03a0(\u03b1i\n13:\nSet wi[t] = \u2212 Ci\nSet vi\nend for\n\nt = W Li(xt) and si\nt = argmaxl si\n\nt \u00b7 exp(\u22121(yt (cid:54)= \u02c6yi\nt))\n\nt \u2212 \u03b7tf i\nt[yt,yt]\nk\u22121\n\nt = si\u22121\n\nt+1 = vi\n\n14:\n15:\n16:\n17: end for\n\n(\u03b1i\n\nt\n\nt and receive the true label yt\n\n(cid:48)\n\n\u221a\n\u221a\nt)) using (10) and \u03b7t = 2\n2\n(k\u22121)\n\nt\nand pass (xt, yt, wi[t]) to W Li\n\nt\n\nUnlike the optimal algorithm, we cannot show the last expert that utilizes all the learners has the\nbest accuracy. However, we can show at least one expert has a good predicting power. Therefore\nwe will use classical Hedge algorithm (Littlestone and Warmuth [11] and Freund and Schapire [12])\nto randomly choose an expert at each iteration with adaptive probability weight depending on each\nexpert\u2019s prediction history.\nt for each weak learner. As our algorithm tries to\nFinally we need to address how to set the weight \u03b1i\nminimize the cumulative logistic loss, we want to set \u03b1i\n). This\nis again a classical topic in online learning, and we will use online gradient descent, proposed\nby Zinkevich [13]. By letting, f i\n), we need an online algorithm ensuring\nt (\u03b1) + Ri(T ) where F is a feasible set to be speci\ufb01ed later, and Ri(T )\nis a regret that is sublinear in T . To apply Zinkevich [13, Theorem 1], we need f i\nt to be convex\nand F to be compact. The \ufb01rst assumption is met by our choice of logistic loss, and for the second\nassumption, we will set F = [\u22122, 2]. There is no harm to restrict the choice of \u03b1i\nt by F because we\ncan always scale the weights without affecting the result of weighted majority votes.\nBy taking derivatives, we get\n\nt to minimize(cid:80)\n\nt (\u03b1) := Lyt(si\u22121\n\nt) \u2264 min\u03b1\u2208F\n\nt Lyt(si\u22121\n\nt + \u03b1eli\n\nt + \u03b1i\n\n(cid:80)\n\n(cid:80)\n\nt (\u03b1i\n\nteli\n\nt f i\n\nt f i\n\nt\n\nt\n\n(cid:40)\n\n\u2212(cid:80)\n\n1+exp(si\u22121\nj(cid:54)=yt\n\nt\n\n1\n\n(cid:48)\n\nf i\nt\n\n(\u03b1) =\n\n[yt]\u2212si\u22121\nt]\u2212\u03b1)\n[li\n1\n1+exp(si\u22121\n[j]+\u03b1\u2212si\u22121\n\nt\n\nt\n\nt\n\n, if li\n, if li\n\nt (cid:54)= yt\nt = yt.\n\n[yt])\n\n(10)\n\n(cid:48)\n\nt\n\nt+1 = \u03a0(\u03b1i\n\n\u221a\n2(k \u2212 1)\n\n(\u03b1)| \u2264 k \u2212 1. Now let \u03a0(\u00b7) represent a projection onto F : \u03a0(\u00b7) :=\n, we get\nwould\n\nThis provides |f i\nmax{\u22122, min{2,\u00b7}}. By setting \u03b1i\n\u221a\nRi(T ) \u2264 4\nwork, but our choice is optimized to ensure the minimal regret.\nThe pseudocode for Adaboost.OLM is presented in Algorithm 2. In fact, if we put k = 2, Ad-\naboost.OLM has the same structure with Adaboost.OL. As in OnlineMBBM, the booster also needs\nto pass the weight along with labeled instance. According to (9), it can be inferred that the weight is\nproportional to \u2212Ci\n\nT . Readers should note that any learning rate of the form \u03b7t = c\u221a\n\nt)) where \u03b7t =\n\n\u221a\n\u221a\n2\n2\n(k\u22121)\n\nt \u2212 \u03b7tf i\n\n(\u03b1i\n\n(cid:48)\n\nt\n\nt\n\nt\n\nt[yt, yt].\n\n4.3 Mistake bound and comparison to the optimal algorithm\n\nNow we present our second main result that provides a mistake bound of Adaboost.OLM. The main\nstructure of the proof is adopted from Beygelzimer et al. [7, Theorem 4] but in a generalized cost\nmatrix framework. The proof appears in Appendix C.\n\n7\n\n\fTheorem 5. (Mistake bound of Adaboost.OLM) For any T and N, with probability 1 \u2212 \u03b4, the\nnumber of mistakes made by Adaboost.OLM satis\ufb01es the following inequality:\n\nT(cid:88)\n\nt=1\n\n1(yt (cid:54)= \u02c6yt) \u2264 8(k \u2212 1)\ni=1 \u03b32\ni\n\n(cid:80)N\n\nT + \u02dcO(\n\nkN 2(cid:80)N\n\ni=1 \u03b32\ni\n\n),\n\nwhere \u02dcO notation suppresses dependence on log 1\n\u03b4 .\nRemark. Note that this theorem naturally implies Beygelzimer et al. [7, Theorem 4]. The difference\nin coef\ufb01cients is due to different scaling of \u03b3i. In fact, their \u03b3i ranges from [\u2212 1\nNow that we have established a mistake bound, it is worthwhile to compare the bound with the\noptimal boosting algorithm. Suppose the weak learners satisfy the weak learning condition (1)\nt[yt,yt] \u2265 \u03b3\nwith edge \u03b3. For simplicity, we will ignore the excess loss S. As we have \u03b3i =\nwith high probability, the mistake bound becomes 8(k\u22121)\n\u03b32 ). In order to achieve error\nrate \u0001, Adaboost.OLM requires N \u2265 8(k\u22121)\n\u00012\u03b34 ) sample size. Note that\n\u0001\u03b32 )}. Adaboost.OLM is\nOnlineMBBM requires N = \u2126( 1\nobviously suboptimal, but due to its adaptive feature, its performance on real data is quite comparable\nto that by OnlineMBBM.\n\n\u03b32N T + \u02dcO( kN\nlearners and T = \u02dc\u2126( k2\n\n\u0001 ) and T = min{ \u02dc\u2126( k5/2\n\n\u0001\u03b3 ), \u02dc\u2126( k\n\n\u03b32 ln k\n\n(cid:80)\n(cid:80)\n\n2 ].\n2 , 1\n\n\u0001\u03b32\n\nt[yt,li\nt]\n\nt Ci\nt Ci\n\n5 Experiments\n\nWe compare the new algorithms to existing ones for online boosting on several UCI data sets, each\nwith k classes1. Table 1 contains some highlights, with additional results and experimental details in\nthe Appendix E. Here we show both the average accuracy on the \ufb01nal 20% of each data set, as well as\nthe average run time for each algorithm. Best decision tree gives the performance of the best of 100\nonline decision trees \ufb01t using the VFDT algorithm in Domingos and Hulten [14], which were used as\nthe weak learners in all other algorithms, and Online Boosting is an algorithm taken from Oza [5].\nBoth provide a baseline for comparison with the new Adaboost.OLM and OnlineMBBM algorithms.\nBest MBBM takes the best result from running the OnlineMBBM with \ufb01ve different values of the\nedge parameter \u03b3.\nDespite being theoretically weaker, Adaboost.OLM often demonstrates similar accuracy and some-\ntimes outperforms Best MBBM, which exempli\ufb01es the power of adaptivity in practice. This power\ncomes from the ability to use diverse learners ef\ufb01ciently, instead of being limited by the strength of\nthe weakest learner. OnlineMBBM suffers from high computational cost, as well as the dif\ufb01culty of\nchoosing the correct value of \u03b3, which in general is unknown, but when the correct value of \u03b3 is used\nit peforms very well. Finally in all cases Adaboost.OLM and OnlineMBBM algorithms outperform\nboth the best tree and the preexisting Online Boosting algorithm, while also enjoying theoretical\naccuracy bounds.\n\nTable 1: Comparison of algorithm accuracy on \ufb01nal 20% of data set and run time in seconds. Best\naccuracy on a data set reported in bold.\n\nData sets\nk\n3\nBalance\n8\nMice\nCars\n4\nMushroom 2\n4\nNursery\nISOLET\n26\nMovement 5\n\nBest decision tree Online Boosting Adaboost.OLM Best MBBM\n0.821\n42\n0.768\n0.695\n0.608\n2173\n56\n0.924\n0.914\n1.000\n325\n0.999\n0.969\n0.953\n1510\n0.635 64707\n0.515\n0.988 18676\n0.915\n\n0.754\n0.561\n0.930\n1.000\n0.966\n0.521\n0.962\n\n0.772\n0.399\n0.914\n1.000\n0.941\n0.149\n0.870\n\n19\n263\n27\n169\n302\n1497\n3437\n\n8\n105\n39\n241\n526\n470\n1960\n\n20\n416\n59\n355\n735\n2422\n5072\n\n1Codes are available at https://github.com/yhjung88/OnlineBoostingWithVFDT\n\n8\n\n\fAcknowledgments\n\nWe acknowledge the support of NSF under grants CAREER IIS-1452099 and CIF-1422157.\n\nReferences\n[1] Marcin Korytkowski, Leszek Rutkowski, and Rafa\u0142 Scherer. Fast image classi\ufb01cation by boosting fuzzy\n\nclassi\ufb01ers. Information Sciences, 327:175\u2013182, 2016.\n\n[2] Xiao-Lei Zhang and DeLiang Wang. Boosted deep neural networks and multi-resolution cochleagram\n\nfeatures for voice activity detection. In INTERSPEECH, pages 1534\u20131538, 2014.\n\n[3] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. MIT press, 2012.\n\n[4] Indraneel Mukherjee and Robert E Schapire. A theory of multiclass boosting. Journal of Machine Learning\n\nResearch, 14(Feb):437\u2013497, 2013.\n\n[5] Nikunj C Oza. Online bagging and boosting. In 2005 IEEE international conference on systems, man and\n\ncybernetics, volume 3, pages 2340\u20132345. IEEE, 2005.\n\n[6] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. An online boosting algorithm with theoretical\n\njusti\ufb01cations. ICML, 2012.\n\n[7] Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms for online boosting.\n\nICML, 2015.\n\n[8] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. Boosting with online binary learners for the multiclass\n\nbandit problem. In Proceedings of The 31st ICML, pages 342\u2013350, 2014.\n\n[9] Hanzhang Hu, Wen Sun, Arun Venkatraman, Martial Hebert, and Andrew Bagnell. Gradient boosting on\n\nstochastic data streams. In Arti\ufb01cial Intelligence and Statistics, pages 595\u2013603, 2017.\n\n[10] Yoav Freund, Robert Schapire, and N Abe. A short introduction to boosting. Journal-Japanese Society For\n\nArti\ufb01cial Intelligence, 14(771-780):1612, 1999.\n\n[11] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. In Foundations of Computer\n\nScience, 1989., 30th Annual Symposium on, pages 256\u2013261. IEEE, 1989.\n\n[12] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an\napplication to boosting. In European conference on computational learning theory, pages 23\u201337. Springer,\n1995.\n\n[13] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Proceed-\n\nings of 20th ICML, 2003.\n\n[14] Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Proceedings of the sixth ACM\nSIGKDD international conference on Knowledge discovery and data mining, pages 71\u201380. ACM, 2000.\n\n[15] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the\n\nerm principle. In COLT, pages 207\u2013232, 2011.\n\n[16] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm.\n\nMachine learning, 2(4):285\u2013318, 1988.\n\n[17] Volodimir G Vovk. Aggregating strategies. In Proc. Third Workshop on Computational Learning Theory,\n\npages 371\u2013383. Morgan Kaufmann, 1990.\n\n[18] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge university press,\n\n2006.\n\n[19] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends R(cid:13) in Optimization,\n\n2(3-4):157\u2013325, 2016.\n\n[20] Robert E Schapire. Drifting games. Machine Learning, 43(3):265\u2013291, 2001.\n\n[21] Eric V Slud. Distribution inequalities for the binomial law. The Annals of Probability, pages 404\u2013412,\n\n1977.\n\n[22] C.L. Blake and C.J. Merz. UCI machine learning repository, 1998. URL http://archive.ics.uci.\n\nedu/ml.\n\n9\n\n\f[23] Cios KJ Higuera C, Gardiner KJ. Self-organizing feature maps identify proteins critical to learning in a\n\nmouse model of down syndrome. PLoS ONE, 2015.\n\n[24] Wallace Ugulino, D\u00e9bora Cardador, Katia Vega, Eduardo Velloso, Ruy Milidi\u00fa, and Hugo Fuks. Wearable\ncomputing: Accelerometers\u2019 data classi\ufb01cation of body postures and movements. In Advances in Arti\ufb01cial\nIntelligence-SBIA 2012, pages 52\u201361. Springer, 2012.\n\n10\n\n\f", "award": [], "sourceid": 590, "authors": [{"given_name": "Young Hun", "family_name": "Jung", "institution": "Universith of Michigan"}, {"given_name": "Jack", "family_name": "Goetz", "institution": "University of Michigan"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}