{"title": "Reservoir Boosting : Between Online and Offline Ensemble Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1412, "page_last": 1420, "abstract": "We propose to train an ensemble with the help of a reservoir in which the learning algorithm can store a limited number of samples. This novel approach lies in the area between offline and online ensemble approaches and can be seen either as a restriction of the former or an enhancement of the latter. We identify some basic strategies that can be used to populate this reservoir and present our main contribution, dubbed Greedy Edge Expectation Maximization (GEEM), that maintains the reservoir content in the case of Boosting by viewing the samples through their projections into the weak classifier response space. We propose an efficient algorithmic implementation which makes it tractable in practice, and demonstrate its efficiency experimentally on several compute-vision data-sets, on which it outperforms both online and offline methods in a memory constrained setting.", "full_text": "Reservoir Boosting : Between Online and Of\ufb02ine\n\nEnsemble Learning\n\nLeonidas Lefakis\n\nIdiap Research Institute\nMartigny, Switzerland\n\nleonidas.lefakis@idiap.ch\n\nFranc\u00b8ois Fleuret\n\nIdiap Research Institute\nMartigny, Switzerland\n\nfrancois.fleuret@idiap.ch\n\nAbstract\n\nWe propose to train an ensemble with the help of a reservoir in which the learning\nalgorithm can store a limited number of samples.\nThis novel approach lies in the area between of\ufb02ine and online ensemble approaches\nand can be seen either as a restriction of the former or an enhancement of the latter.\nWe identify some basic strategies that can be used to populate this reservoir and\npresent our main contribution, dubbed Greedy Edge Expectation Maximization\n(GEEM), that maintains the reservoir content in the case of Boosting by viewing\nthe samples through their projections into the weak classi\ufb01er response space.\nWe propose an ef\ufb01cient algorithmic implementation which makes it tractable in\npractice, and demonstrate its ef\ufb01ciency experimentally on several compute-vision\ndata-sets, on which it outperforms both online and of\ufb02ine methods in a memory\nconstrained setting.\n\nIntroduction\n\n1\nLearning a boosted classi\ufb01er from a set of samples S = {X, Y }N \u2208 RD \u00d7 {\u22121, 1} is usually\naddressed in the context of two main frameworks. In of\ufb02ine Boosting settings [10] it is assumed that\nthe learner has full access to the entire dataset S at any given time. At each iteration t, the learning\nalgorithm calculates a weight wi for each sample i \u2013 the derivative of the loss with respect to the\nclassi\ufb01er response on that sample \u2013 and feeds these weights together with the entire dataset to a\nweak learning algorithm, which learns a predictor ht. The coef\ufb01cient at of the chosen weak learner\nht is then calculated based on its weighted error. There are many variations of this basic model,\ntoo many to mention here, but a common aspect of these is that they do not explicitly address the\nissue of limited resources. It is assumed that the dataset can be ef\ufb01ciently processed in its entirety at\neach iteration. In practice however, memory and computational limitations may make such learning\napproaches prohibitive or at least inef\ufb01cient.\nA common approach used in practice to deal with such limitations is that of sub-sampling the data-set\nusing strategies based on the sample weights W [9, 13]. Though these methods address the limits\nof the weak learning algorithms resources, they nonetheless assume a) access to the entire data-set\nat all times, b) the ability to calculate the weights W of the N samples and to sub-sample K of\nthese, all in an ef\ufb01cient manner. The issues with such an approach can be seen in tasks such as\ncomputer vision, where samples need not only be loaded sequentially into memory if they do not\nall \ufb01t which in itself may be computationally prohibitive, but furthermore once loaded they must\nbe pre-processed, for example by extracting descriptors, making the calculation of the weights\nthemselves a computationally expensive process.\nFor large datasets, in order to address such issues, the framework of online learning is frequently\nemployed. Online Boosting algorithms [15] typically assume access solely to a Filter() function, by\nwhich they mine samples from the data-set typically one at a time. Due to the their online nature\n\n1\n\n\fsuch approaches typically treat the weak learning algorithm as a black box, assuming that it can be\ntrained in an online manner, and concentrate on different approaches to calculating the weak learner\ncoef\ufb01cients [15, 4]. A notable exception is the works of [11] and [14], where weak learner selectors\nare introduced, one for each weak learner in the ensemble, which are capable of picking a weak\nlearner from a predetermined pool. All these approaches however are similar in the fact that they are\nforced to predetermine the number of weak learners in the boosted strong classi\ufb01er.\nWe propose here a middle ground between these two extremes in which the boosted classi\ufb01er can\nstore some of the already processed samples in a reservoir, possibly keeping them through multiple\nrounds of training. As in online learning we assume access only to a Filter() through which we can\nsample Qt samples at each Boosting iteration. This setting is related to the framework proposed\nin [2] for dealing with large data-sets, the method proposed there however uses the \ufb01lter to obtain\na sample and stochastically accepts or rejects the sample based on its weight. The drawback of\nthis approach is a) that after each iteration all old samples are discarded, and b) the algorithm must\nprocess an increasing number of samples at each iteration as the weights become increasingly smaller.\nWe propose to acquire a \ufb01xed number of samples at each iteration and to add these to a persistent\nreservoir, discarding only a subset. The only other work we know which trains a Boosting classi\ufb01er\nin a similar manner is [12], where the authors are solely concerned with learning in the presence of\nconcept drift and do not propose a strategy for \ufb01lling this reservoir. Rather they use a simple sliding\nwindow approach and concentrate on the removal and adding of weak learners to tackle this drift.\nA related concept to the work presented here is that of learning on a budget [6], where, as in the\nonline learning setting, samples are presented one at a time to the learner, a perceptron, which builds\na classi\ufb01cation model by retaining an active subset of these samples. The main concern in this context\nis the complexity of the model itself and its effect via the Gramm matrix computation on both training\nand test time. Subsequent works on budget perceptrons has led to tighter budgets [16] (at higher\ncomputational costs), while [3] proved that such approaches are mistake-bound.\nSimilar work on Support Vector Machines [1] proposed LaSVM, a SVM solver which was shown\nto converge to the SVM QP solution by adopting a scheme composed of two alternating steps,\nwhich consider respectively the expansion and contraction of the support vector set via the SMO\nalgorithm. SVM budgeted learning was also considered in [8] via an L1-SVM formulation which\nallowed users to speci\ufb01cally set a budget parameter B, and subsequently minimized the loss on the B\nworst-classi\ufb01ed examples.\nAs noted, these approaches are concerned with the complexity of the classi\ufb01cation model, that is the\nbudget refers to the number of samples which have none-zero coef\ufb01cients in the dual representation\nof the classi\ufb01er. In this respect our work is only loosely related to what is often referred to as budget\nlearning, in that we solve a qualitatively different task, namely addressing the complexity of the\nparsing and processing the data during training.\n\nTable 1: Notation\n\nRt\n|Rt|\nQt\n\u03a3AA\n\u00b5A\nyA\nwt\n\nFt(x)\n\nF ilter()\n\nht\nH\n\u25e6\nT\n\nthe contents of the reservoir at iteration t\n\nthe size of the reservoir\n\nthe fresh batch of samples at iteration t\nthe covariance matrix of the edges h \u25e6 y\n\nthe expectation of the edges of samples in set A\nthe vector of labels {\u22121, 1}|A| of samples in A\nthe vector of Boosting weights at iteration t\nthe constructed strong classi\ufb01er at iteration t\n\na \ufb01lter returning samples from S\n\nthe weak learner chosen at iteration t\n\nthe family of weak learners\n\ncomponent-wise (Hadamard) product\n\nnumber of weak learners in the strong classi\ufb01er\n\n2\n\n\fTable 2: Boosting with a Reservoir\n\nConstruct R0 and Q0 with r and q calls to F ilter().\nfor t = 1, . . . , T do\n\nDiscard q samples from Rt\u22121 \u222a Qt\u22121 samples to obtain Rt\nSelect ht using the samples in Rt\nCompute at using Rt\nConstruct Qt with q calls to F ilter()\n\nReturn FT =(cid:80)T\n\nend for\n\nt=1 atht\n\n2 Reservoir of samples\n\nIn this section we present in more detailed form the framework of learning a boosted classi\ufb01er with\nthe help of a reservoir. As mentioned, the batch version of Boosting consists of iteratively selecting a\nweak learner ht at each iteration t, based on the loss reduction they induce on the full training set\nS. In the reservoir setting, weak learners are selected solely from the information provided by the\nsamples contained in the reservoir Rt.\nLet N be the number of training samples, and S = {1, . . . , N} the set of their indexes. We\nconsider here one iteration of a Boosting procedure, where each sample is weighted according to its\ncontribution to the overall loss. Let y \u2208 {\u22121, 1}N be the sample labels, and H \u2282 {\u22121, 1}N the set\nof weak-learners, each identi\ufb01ed with its vector of responses over the samples. Let \u03c9 \u2208 RN\n+ be the\nsample weights at that Boosting iteration.\nFor any subset of sample indexes B \u2282 {1, . . . , N} let yB \u2208 {\u22121, 1}|B| be the \u201cextracted\u201d vector.\nWe de\ufb01ne similarly \u03c9B, and for any weak learner h \u2208 H let hB \u2208 {\u22121, 1}|B| stands for the vector\nof the |B| responses over the samples in B.\nAt each iteration t, the learning algorithm is presented with a batch of fresh samples Qt \u2282 S, |Qt| =\nq, and must choose r samples from the full set of samples Rt \u222a Qt at its disposal, in order to build\nRt+1 with |Rt+1| = r, which it subsequently uses for training.\nUsing the samples from Rt, the learner chooses a weak learner ht \u2208 H to maximize (cid:104)ht\n(cid:105),\nwhere \u25e6 stands for the Hadamard component-wise vector product. Maximizing this latter quantity\ncorresponds to minimizing the weighted error estimated on the samples currently in Rt. The weight\nat of the selected weak learner can also be estimated with Rt.\nThe learner then receives a fresh batch of samples Qt+1 and the process continues iteratively. See\nalgorithm in Table 2. In the following we will address the issue of which strategy to employ to discard\nthe q samples at each time step t. To our knowledge, no previous work has been published in this or a\nsimilar framework.\n\n\u25e6yRt, wt\n\nRt\n\nRt\n\n3 Reservoir Strategies\n\nRt\u222aQt\n\nIn the following we present a number of strategies for populating the reservoir, i.e. for choosing which\nq samples from Rt \u222a Qt to discard. We begin by identifying three basic and rather straightforward\napproaches. Max Weights (Max) At each iteration t the weight vector wt\nis computed for the\nr + q samples and the r samples with the largest weights are kept. Weighted Sampling (WSam) As\nabove wt\nis computed, then normalized to 1, and used as a distribution to sample r samples\nto keep without replacement. Random Sampling (Rand) The reservoir is constructed by sampling\nuniformly r samples from the r + q available, without replacement.\nThese will serve mainly as benchmark baselines against which we will compare our proposed method,\npresented below, which is more sophisticated and, as we show empirically, more ef\ufb01cient. These\nbaselines are presented to highlight that a more sophisticated reservoir strategy is needed to ensure\ncompetitive performance, rather than to serve as examples of state-of-the-art baselines.\nOur objective will be to populate the reservoir with samples that will allow for an optimal selection\nof weak learners, as close as possible to the choice we would make if we could keep all samples.\n\nRt\u222aQt\n\n3\n\n\fThe issue at hand is similar to that of feature selection: The selected samples should be jointly\ninformative for choosing the good weak learners. This forces to \ufb01nd a proper balance between the\nindividual importance of the kept samples (i.e. choosing those with large weights) and maximizing\nthe heterogeneity of the weak learners responses on them.\n\n3.1 Greedy Edge Expectation Maximization\n\n(cid:104)hB \u25e6 yB, wB(cid:105)\n\nIn that reservoir setting, it makes sense that given a set of samples A from which we must discard\nsamples and retain only a subset B, what we would like is to retain a training set that is as representa-\ntive as possible of the entire set A. Ideally, we would like B to be such that if we pick the optimal\nweak-learner according to the samples it contains\nh\u2217 = argmax\nh\u2208H\n\n(1)\nit maximizes the same quantity estimated on all the samples in A. i.e. we want (cid:104)h\u2217\nA \u25e6 yA, wA(cid:105) to be\nlarge.\nThere may be many weak-learners in H that have the exact same responses as h\u2217 on the samples\nin B, and since we consider a situation where we will not have access to the samples from A \\ B\nanymore, we model the choice among these weak-learners as a random choice. In which case, a good\nh\u2217 is one maximizing\n(2)\nthat is the average of the scores on the full set A of the weak-learners which coincide with h\u2217 on the\nretained set B.\nWe propose to model the distribution U(H) with a normal law. If H is picked uniformly in H, under\na reasonable assumption of symmetry, we propose\n\nEH\u223cU (H) ((cid:104)HA \u25e6 yA, \u03c9A(cid:105) | HB = h\u2217\n\nB) ,\n\nH \u25e6 y \u223c N (\u00b5, \u03a3)\n\n(3)\nwhere \u00b5 is the vector of dimension N of the expectations of weak learner edges, and \u03a3 is a covariance\nmatrix of size N \u00d7 N. Under this model, if \u00afB = A \\ B, and with \u03a3A,B denoting an extracted\nsub-matrix, we have\n\nB \u25e6 yB, \u03c9B(cid:105) + EH\u25e6y\u223cN (\u00b5,\u03a3) ((cid:104)H \u00afB \u25e6 y \u00afB, \u03c9 \u00afB(cid:105) | HB = h\u2217\nB)\nB \u25e6 yB \u2212 \u00b5B), w \u00afB(cid:105)\nB \u25e6 yB), wB(cid:105) + (cid:104)\u00b5 \u00afB + \u03a3 \u00afBB\u03a3\u22121\n\nEH\u223cU (H) ((cid:104)HA \u25e6 yA, \u03c9A(cid:105) | HB = h\u2217\nB)\n= EH\u25e6y\u223cN (\u00b5,\u03a3) ((cid:104)HA \u25e6 yA, \u03c9A(cid:105) | HB = h\u2217\nB)\n= (cid:104)h\u2217\n= (cid:104)(h\u2217\n\n(4)\n(5)\n(6)\n(7)\nThough the modeling of the discrete variables H \u25e6 y by a continuous distribution may seem awkward,\nwe point out two important aspects. Firstly the parametric modeling allows for an analytical expression\nfor the calculation of (2). Given that we seek to maximize this value over the possible subsets B of\nA, an analytic approach is necessary for the algorithm to retain tractability. Secondly, for a given\nB \u25e6 yB \u2212 \u00b5B) is not only the conditional\nvector of edges h\u2217\nexpectation of h\u2217\nWe note that choosing B based on (7) requires estimates of three quantities: the expected weak-learner\nedges \u00b5A, the co-variance matrix \u03a3AA, and the weak learner h\u2217 trained on B. Given these quantities,\nwe must also develop a tractable optimization scheme to \ufb01nd the B maximizing it.\n\nB \u25e6 yB in B, the vector \u00b5 \u00afB + \u03a3 \u00afBB\u03a3\u22121\n\u00afB \u25e6 y \u00afB, but also its optimal linear predictor in a least squares error sense.\n\nBB(h\u2217\n\nBB(h\u2217\n\n3.2 Computing \u03a3 and \u00b5\n\nAs mentioned, the proposed method requires in particular an estimate of the vector of expected edges\n\u00b5A of the samples in A, as well as the corresponding covariance matrix \u03a3AA.\nIn practice, the estimation of the above depends on the nature of the weak learner family H. In\nthe case of classi\ufb01cation stumps, which we use in the experiments below, both these values can be\ncalculated with small computational cost.\nA classi\ufb01cation stump is a simple classi\ufb01er h\u03b8,\u03b1,d which for a given threshold \u03b8 \u2208 R, polarity\n\u03b1 \u2208 {\u22121, 1}, and feature index d \u2208 {1, . . . , D}, has the following form:\n\n(cid:26) 1 if \u03b1 xd \u2265 \u03b1 \u03b8\n\n\u22121 otherwise\n\n\u2200x \u2208 RD, h\u03b8,\u03b1,d(x) =\n\n4\n\n(8)\n\n\fwhere xd refers to the value of the dth component of x.\nIn practice when choosing the optimal stump for a given set of samples A, a learner would sort all the\nsamples according to each of the D dimensions, and for each dimension d it would consider stumps\nwith thresholds \u03b8 between two consecutive samples in that sorted list.\nFor this family of stumps H and given that we shall consider both polarities, Eh(hAyA) = 0.\nThe covariance of the edge of two samples can also be calculated ef\ufb01ciently, with O(|A|2D) com-\nplexity. For two given samples i,j we have\n\n\u2200h \u2208 H, yihiyjhj \u2208 {\u22121, 1}.\n\n(9)\nHaving sorted the samples along a speci\ufb01c dimension d we have that for \u03b1 = 1, yihiyjhj (cid:54)= yiyj\nfor those weak learners which disagree on those samples i.e. with min(xd\nj ).\ni , xd\ni |) such disagreeing\ni are the indexes of the samples in the sorted list then there are (|I d\nIf I d\nweak learners for \u03b1 = 1 (plus the same quantity for \u03b1 = \u22121), given that for each dimension d there\ncorrespond 2(|A| \u2212 1) weak-learners in H, we reach the following update rule \u2200d,\u2200{i, j} :\n\nj ) < \u03b8 < max(xd\n\ni , xd\nj \u2212 I d\n\nj , I d\n\n\u03a3AA(i, j)+ = yiyj(2 \u2217 (|A| \u2212 1) \u2212 4 \u2217 |I d\n\n(10)\nwhere \u03a3AA(i, j) refers to the i, j element of \u03a3. As can be seen, this leads to a cost of O(|A|2D).\nGiven that commonly D (cid:29) |A|, this cost should not be much higher than O(D|A| log |A|) the cost\nof sorting along the D dimensions.\n\nj \u2212 I d\ni |)\n\n3.3 Choice of h\u2217\nAs stated, the estimation of h\u2217 for a given B must be computationally ef\ufb01cient. We could further\ncommit to the Gaussian assumption by de\ufb01ning p(h\u2217 = h),\u2200h \u2208 H i.e. the probability that a weak\nlearner h will be the chosen one given that it will be trained on B and integrating over H, this\nhowever, though consistent with the Gaussian assumption, is computationally prohibitive. Rather, we\npresent here two cheap alternatives both of which perform well in practice.\nThe \ufb01rst and simplest strategy is to use \u2200B, h\u2217 \u25e6 yB = (1, . . . , 1) which is equivalent to making the\nassumption that the training process will results in a weak learner which performs perfectly on the\ntraining data B. This is exactly what the process will strive to achieve, however unlikely it may be.\nThe second is to generate a number |HLattice| of weak learner edges by sampling on the {\u22121, 1}|B|\nlattice using the Gaussian H \u25e6 y \u223c N (\u00b5B, \u03a3BB) restricted to this lattice and to keep the optimal\nh\u2217 = argmax h \u2208 HLattice(cid:104)(hB \u25e6 yB), wB(cid:105). We can further simplify this process by considering the\nwhole set A and the lattice {\u22121, 1}|A| and simply extracting the values h\u2217\nB for the different subsets B.\nThough much more complex, this approach can be implemented extremely ef\ufb01ciently, experiments\nshowed however that the simple rule of \u2200B, h\u2217 \u25e6 yB = (1, . . . , 1) works just as well in practice and\nis considerably cheaper. In the following experiments we present results solely for this \ufb01rst rule.\n\n3.4 Greedy Calculation of argmaxB\nDespite the analytical formulation offered by our Gaussian assumption, an exact maximization over\nall possible subsets remains computationally intractable. For these reason we propose a greedy\napproach to building the reservoir population which is computationally bounded.\nWe initialize the set B = A, i.e. initially we assume we are keeping all the samples, and calculate\nBB. The greedy process then iteratively goes through the |B| samples in B and \ufb01nds the sample j\n\u03a3\u22121\nsuch that for B\n\n(cid:48)\n\n(cid:48) \u25e6 yB\n\n(cid:48) ), w \u00afB\n\n(cid:48)(cid:105) + (cid:104)h\u2217\n\nB\n\n(cid:48) \u25e6 yB\n\n(cid:48)(cid:105)\n\n(cid:48) , wB\n\n(11)\n\n= B \\ {j} the value\n(cid:48) \u03a3\u22121\n(cid:48)\nB\n\n(cid:104)\u03a3 \u00afB\n\n(cid:48)\nB\n\nB\n\n(cid:48) (h\u2217\n\nB\n\nis maximized, where, in this context, h\u2217 refers to the weak learner chosen by training on B\nprocess is repeated q times, i.e. until | \u00afB| = q, discarding one sample at each iteration.\nIn the experiments presented here, we stop the greedy subset selection after these q steps. However in\npractice the subset selection can continue by choosing pairs k,j to swap between the two steps. In\nour experiments however we did not notice any gain from further optimization of the subset B.\n\n. This\n\n(cid:48)\n\n5\n\n\fA, wA(cid:105)|B)\n\nA, wA(cid:105)|B(cid:48)) for B(cid:48) = B \\ {j}.\n\n3.5 Evaluation of E((cid:104)h\u2217\nEach step in the above greedy process requires going through all the samples j in the current B and\ncalculating E((cid:104)h\u2217\nIn order for our method to be computationally tractable we must be able to compute the above value\nwith a limited computational cost. The naive approach of calculating the value from scratch for each\nj would cost O(|B(cid:48)|3 + | \u00afB(cid:48)||B|) . The main computational cost here is the \ufb01rst factor, incurred\nin calculating the inverse of the covariance matrix \u03a3B\n(cid:48) which results from the matrix \u03a3BB by\nremoving a single row and column. It is thus important to be able to perform this calculation with a\nlow computational cost.\n3.5.1 Updating \u03a3\u22121\n(cid:48)\nB\nFor a given matrix M and its inverse M\u22121 we would like to ef\ufb01ciently calculate the inverse of M\u2212j\nwhich is results from M by the deletion of row and column j.\nIt can be shown that the inverse of the matrix Mej which results from M by the substitution of row\nand column j by the basis vector ej is given by the following formula:\n\n(cid:48)\nB\n\nB\n\n(cid:48)\n\nM\u2212\n\nej\n\n1 = M\u22121 \u2212 1\nMii\n\nM\u22121\n\nj\u2217 M\u22121\u2217j + eT\nj ej\n\n(12)\n\nwhere M\u2217j stands for the vector of elements of the jth column of matrix M and Mj\u2217 stand for the\nvector of elements of its jth row. We omit the proof (a relatively straightforward manipulation of the\nSherman-Morrison formulas) due to space constraints. The inverse M\u22121\u2212j can be recovered by simply\nremoving the jth row and column of M\u22121\nej .\n(cid:48) in O(|B|2). We further exploit the fact that the matrices\nBased on this we can compute \u03a3\u22121\n(cid:48)\nB\n(cid:48) . Thus by\nB(cid:48) and wT\n\u03a3 \u00afB\npre-calculating the products \u03a3\u22121\n\u00afB\u03a3 \u00afBBonce at the beginning of each greedy optimization\nstep, we can incur a cost of O(|B|) for each sample j and an O(|B|2) cost overall.\n\n(cid:48) enter into the calculations through the products \u03a3\u22121\n(cid:48)\nB\n\n(cid:48) and \u03a3\u22121\n(cid:48)\nB\n\nB and wT\n\nBBh\u2217\n\n\u00afB\u03a3 \u00afB\n\n(cid:48) h\u2217\n\n(cid:48)\nB\n\n(cid:48)\nB\n\nB\n\nB\n\nB\n\n3.6 Weights \u02dcwB\n\nGEEM provides a method for selecting which samples to keep and which to discard. However in\ndoing so it creates a biased sample B of the set A, and consequently weights wB are not representative\nof the weight distribution wA. It is thus necessary to alter the weights wB to obtain a new weight\nvector \u02dcwB which will takes this bias into account. Based on the assumption (3) and (7), and the fact\nthat \u00b5A = 0, we set\n(13)\nThe resulting weight vector \u02dcwB used to pick the weak-learner h\u2217 correctly re\ufb02ects the entire set\nA = Rt \u222a Qt (under the Gaussian assumption)\n\n\u00afB\u03a3 \u00afBB\u03a3\u22121\n\n\u02dcwB = wB + wT\n\nBB\n\n3.7 Overall Complexity\n\nThe proposed method GEEM comprises, at each boosting iteration, three main steps: (1) The\ncalculation of \u03a3AA, (2) The optimization of B, and (3) The training of the weak learner ht\nThe third step is common to all the reservoir strategies presented here. In the case of classi\ufb01cation\nstumps by presorting the samples along each dimension and exploiting the structure of the hypothesis\nspace H, we can incur a cost of O(D|B| log |B|) where D is the dimensionality of the input space.\nThe \ufb01rst step, as mentioned, incurs a cost of O(|A|2D) if we go through all dimensions of the\ndata. However the minimum objective of acquiring an invertible matrix \u03a3AA by only looking at\n|A| dimensions and incurring a cost of O(|A|3). Finally the second step as analyzed in the previous\nsection, incurs a cost of O(q|A|2).\nThus the overall complexity of the proposed method is O(|A|3 + D|A|log|A|) which in practice\nshould not be signi\ufb01cantly larger than O(D|B|log|B|), the cost of the remaining reservoir strategies.\nWe note that this analysis ignores the cost of processing incoming samples Qt which is also common\nto all strategies, dependent on the task this cost may handily dominate all others.\n\n6\n\n\f4 Experiments\n\nIn order to experimentally validate both the framework of reservoir boosting as well as the proposed\nmethod GEEM, we conducted experiments on four popular computer vision datasets.\nIn all our experiments we use logitboost for training. It attempts to minimize the logistic loss which\nis less aggressive than the exponential loss. Original experiments with the exponential loss in a\nreservoir setting showed it to be unstable and to lead to degraded performance for all the reservoir\nstrategies presented here. In [14] the authors performed extensive comparison in an online setting and\nalso found logitboost to yield the best results. We set the number of weak learners T in the boosted\nclassi\ufb01er to be T = 250 common to all methods. In the case of the online boosting algorithms this\ntranslates to \ufb01xing the number of weak learners.\nFinally, for the methods that use a reservoir \u2013 that is GEEM and the baselines outlined in 3 \u2013 we set\nr = q. Thus at every iteration, the reservoir is populated with |Rt| = r samples and the algorithm\nreceives a further |Qt| = r samples from the \ufb01lter. The reservoir strategy is then used to discard r of\nthese samples to build Rt+1.\n\n4.1 Data-sets\nWe used four standard datasets: CIFAR-10 is a recognition dataset consisting of 32 \u00d7 32 images\nof 10 distinct classes depicting vehicles and animals. The training data consists of 5000 images\nof each class. We pre-process the data as in [5] using code provided by the authors. MNIST is\na well-known optical digit recognition dataset comprising 60000 images of size 28 \u00d7 28 of digits\nfrom 0 \u2212 9. We do not preprocess the data in anyway, using the raw pixels as features. INRIA is\na pedestrian detection dataset. The training set consists of 12180 images of size 64 \u00d7 128 of both\npedestrians and background images from which we extract HoG features [7]. STL-10 An image\nrecognition dataset consisting of images of size 96 \u00d7 96 belonging to 10 classes, each represented by\n500 images in the training set. We pre-process the data as for CIFAR.\n\n4.2 Baselines\n\nThe baselines for the reservoir strategy have already been outlined in 3, and we also benchmarked\nthree online Boosting algorithms: Oza [15], Chen [4], and Bisc [11]. The \ufb01rst two algorithms treat\nweak learners as a black-box but prede\ufb01ne their number. We initiate the weak learners of these\napproaches by running Logitboost of\ufb02ine using a subset of the training set as we found that randomly\nsampling the weak learners led to very poor performance; thus though they are online algorithms,\nnonetheless in the experiments presented here they are afforded an of\ufb02ine initialization step. Note\nthat these approaches are not mutually exclusive with the proposed method, as the weak learners\npicked by GEEM can be combined with an online boosting algorithm optimizing their coef\ufb01cients.\nFor the \ufb01nal method [11], we initiated the number of selectors to be K = 250 resulting in the same\nnumber of weak learners as the other methods. We also conducted experiments with [14] which is\nclosely related to [11], however as it performed consistently worse than [11], we do not show those\nresults here.\nFinally we compared our method against two sub-sampling methods that have access to the full\ndataset and subsample r samples using a weighted sampling routine. At each iteration, these methods\ncompute the boosting weights of all the samples in the dataset and use weighted sampling to obtain\na subset Rt. The \ufb01rst method is a simple weighted sampling method (WSS) while the second is\nMadaboost (Mada) which combines weighted sampling with weight adjustment for the sub-sampled\nsamples. We furthermore show comparison with a \ufb01xed reservoir baseline (Fix), this baseline\nsubsamples the dataset once prior to learning and then trains the ensemble using of\ufb02ine Adaboost,\nthe contents of the reservoir in this case do not change from iteration to iteration.\n\n5 Results and Discussion\n\nTable 3, 4, and 5, list respectively the performance of the reservoir baselines, the online Boosting\ntechniques, and the sub-sampling methods. Each table also presents the performance of our GEEM\napproach in the same settings.\n\n7\n\n\fMax\n\nRand\n\nWSam\n\nGEEM\n\nr=250\n\nr=100\n\nDataset\nCIFAR 29.59 (0.59) 29.16 (0.71) 46.02 (0.35) 45.88 (0.24) 48.92 (0.34) 50.09 (0.24) 50.96 (0.36) 54.87 (0.28)\nSTL 30.20 (0.75) 30.72 (0.82) 39.25 (0.32) 39.40 (0.25) 41.60 (0.39) 42.93 (0.30) 42.40 (0.65) 45.70 (0.38)\nINRIA 95.57 (0.49) 96.31 (0.37) 91.54 (0.49) 91.72 (0.35) 94.29 (0.23) 94.63 (0.30) 97.21 (0.21) 97.52 (0.13)\nMNIST 66.74 (1.45) 68.25 (0.81) 79.97 (0.24) 79.59 (0.22) 83.96 (0.29) 84.07 (0.23) 84.66 (0.30) 84.33 (0.33)\n\nr=250\n\nr=100\n\nr=250\n\nr=250\n\nr=100\n\nr=100\n\nTable 3: Test Accuracy on the four datasets for the different reservoir strategies\n\nOnline Boosting\n\nBisc\n\nChen\n\nDataset\nCIFAR 39.40 (1.91) 45.03 (0.93) 49.16 (0.40) 54.87 (0.28)\nSTL 33.09 (1.49) 36.35 (0.49) 39.98 (0.56) 45.70 (0.38)\nINRIA 94.23 (0.97) 95.65 (0.38) 95.50 (0.49) 97.53 (0.13)\nMNIST 80.99 (1.11) 85.25 (0.82) 84.85 (0.54) 84.33 (0.33)\n\nOza\n\nGEEM\n(r=250)\n\nTable 4: Comparison of GEEM with online boosting algorithms\nWSS\n\nMada\n\nFix\n\nGEEM\n\nr=250\n\nr=100\n\nDataset\nCIFAR 50.38 (0.38) 51.66 (0.30) 48.87 (0.26) 49.44 (0.33) 48.41 (0.88) 52.40 (0.77) 50.96(0.36) 54.87 (0.28)\nSTL 42.54 (0.35) 44.07 (0.31) 41.36 (0.32) 42.34 (0.24) 42.04 (0.19) 46.07 (0.41) 42.40 (0.65) 45.70 (0.38)\nINRIA 94.24 (0.30) 94.65 (0.16) 94.26 (0.27) 94.65 (0.10) 92.46 (0.67) 93.82 (0.74) 97.21 (0.21) 97.53 (0.13)\nMNIST 84.21 (0.27) 84.51 (0.16) 79.00 (0.33) 78.99 (0.31) 85.37 (0.33) 88.02 (0.15) 84.66 (0.30) 84.33 (0.33)\n\nr=1,000\n\nr=2,500\n\nr=100\n\nr=250\n\nr=250\n\nr=100\n\nTable 5: Comparison of GEEM with subsampling algorithms\n\nAs can be seen, GEEM outperforms the other reservoir strategies on three of the four datasets and\nperforms on par with the best on the fourth (MNIST). It also outperforms the on-line Boosting\ntechniques on three data-sets and on par with the best baselines on MNIST. Finally, GEEM performs\nbetter than all the sub-sampling algorithms. Note that the Fix baseline was provided with ten times\nthe number of samples to reach a similar level of performance.\nThese results demonstrate that both the reservoir framework we propose for Boosting, and the speci\ufb01c\nGEEM algorithm, provide performance greater or on par with existing state-of-the-art methods. When\ncompared with other reservoir strategies, GEEM suffers from larger complexity which translates to\na longer training time. For the INRIA dataset and r = 100 GEEM requires circa 70 seconds for\ntraining as opposed to 50 for the WSam strategy, while for r = 250 GEEM takes approximately 320\nseconds to train compared to 70 for WSam. We note however that even when equating training time,\nwhich translates to using r = 100 for GEEM and r = 250 for WSam, GEEM still outperforms the\nsimpler reservoir strategies. The timing results on the other 3 datasets were similar in this respect.\nMany points can still be improved. In our ongoing research we are investigating different approaches\nto modeling the process of evaluating h\u2217, of particular importance is of course that it is both reasonable\nand fast to compute, one approach is to consider the maximum a posteriori value of h\u2217 by drawing on\nelements in extreme value theory.\nWe have further plans to adapt this framework, and the proposed method, to a series of other settings.\nIt could be applied in the context of parallel processing, where a dataset can be split among CPUs\neach training a classi\ufb01er on a different portion of the data.\nFinally, we are also investigating the method\u2019s suitability for active learning tasks and dataset creation.\nWe note that the proposed method GEEM is not given information concerning the labels of the\nsamples, but simply the expectation and covariance matrix of the edges.\n\nAcknowledgments\n\nThis work was supported by the European Community\u2019s Seventh Framework Programme FP7 -\nChallenge 2 - Cognitive Systems, Interaction, Robotics - under grant agreement No 247022 - MASH.\n\n8\n\n\fReferences\n[1] Antoine Bordes, Seyda Ertekin, Jason Weston, and L\u00b4eon Bottou. Fast kernel classi\ufb01ers with\n\nonline and active learning. J. Mach. Learn. Res., 6:1579\u20131619, December 2005.\n\n[2] Joseph K. Bradley and Robert E. Schapire. Filterboost: Regression and classi\ufb01cation on large\n\ndatasets. In NIPS, 2007.\n\n[3] Nicol Cesa-Bianchi and Claudio Gentile. Tracking the best hyperplane with a simple budget\nperceptron. In In Proc. of Nineteenth Annual Conference on Computational Learning Theory,\npages 483\u2013498. Springer-Verlag, 2006.\n\n[4] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. An online boosting algorithm with theoretical\njusti\ufb01cations. In John Langford and Joelle Pineau, editors, ICML, ICML \u201912, pages 1007\u20131014,\nNew York, NY, USA, July 2012. Omnipress.\n\n[5] Adam Coates and Andrew Ng. The importance of encoding versus training with sparse coding\nand vector quantization. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th\nInternational Conference on Machine Learning (ICML-11), ICML \u201911, pages 921\u2013928, New\nYork, NY, USA, June 2011. ACM.\n\n[6] Koby Crammer, Jaz S. Kandola, and Yoram Singer. Online classi\ufb01cation on a budget. In\nSebastian Thrun, Lawrence K. Saul, and Bernhard Schlkopf, editors, NIPS. MIT Press, 2003.\n[7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer\nVision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on,\nvolume 1, pages 886\u2013893 vol. 1, 2005.\n\n[8] Ofer Dekel and Yoram Singer. Support vector machines on a budget. In NIPS, pages 345\u2013352,\n\n2006.\n\n[9] Carlos Domingo and Osamu Watanabe. Madaboost: A modi\ufb01cation of adaboost. In Proceedings\nof the Thirteenth Annual Conference on Computational Learning Theory, COLT \u201900, pages\n180\u2013189, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.\n\n[10] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning\n\nand an application to boosting. J. Comput. Syst. Sci., 55(1):119\u2013139, August 1997.\n\n[11] Helmut Grabner and Horst Bischof. On-line boosting and vision. In CVPR (1), pages 260\u2013267,\n\n2006.\n\n[12] Mihajlo Grbovic and Slobodan Vucetic. Tracking concept change with incremental boosting\nby minimization of the evolving exponential loss. In ECML PKDD, ECML PKDD\u201911, pages\n516\u2013532, Berlin, Heidelberg, 2011. Springer-Verlag.\n\n[13] Zdenek Kalal, Jiri Matas, and Krystian Mikolajczyk. Weighted sampling for large-scale boosting.\n\nIn BMVC, 2008.\n\n[14] C. Leistner, A. Saffari, P.M. Roth, and H. Bischof. On robustness of on-line boosting -\nIn Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th\n\na competitive study.\nInternational Conference on, pages 1362 \u20131369, 27 2009-oct. 4 2009.\n\n[15] Nikunj C. Oza and Stuart Russell. Online bagging and boosting. In In Arti\ufb01cial Intelligence\n\nand Statistics 2001, pages 105\u2013112. Morgan Kaufmann, 2001.\n\n[16] Bordes Antoine Weston Jason and L\u00b4eon Bottou. Online (and of\ufb02ine) on an even tighter budget.\n\nIn In Arti\ufb01cial Intelligence and Statistics 2005, 2005.\n\n9\n\n\f", "award": [], "sourceid": 711, "authors": [{"given_name": "Leonidas", "family_name": "Lefakis", "institution": "Idiap Research Institute"}, {"given_name": "Fran\u00e7ois", "family_name": "Fleuret", "institution": "Idiap Research Institute"}]}