{"title": "A Bayesian Approach to Concept Drift", "book": "Advances in Neural Information Processing Systems", "page_first": 127, "page_last": 135, "abstract": "To cope with concept drift, we placed a probability distribution over the location of the most-recent drift point. We used Bayesian model comparison to update this distribution from the predictions of models trained on blocks of consecutive observations and pruned potential drift points with low probability. We compare our approach to a non-probabilistic method for drift and a probabilistic method for change-point detection. In our experiments, our approach generally yielded improved accuracy and/or speed over these other methods.", "full_text": "A Bayesian Approach to Concept Drift\n\nStephen H. Bach Marcus A. Maloof\n\nDepartment of Computer Science\n\nGeorgetown University\n\nWashington, DC 20007, USA\n\n{bach, maloof}@cs.georgetown.edu\n\nAbstract\n\nTo cope with concept drift, we placed a probability distribution over the location\nof the most-recent drift point. We used Bayesian model comparison to update\nthis distribution from the predictions of models trained on blocks of consecutive\nobservations and pruned potential drift points with low probability. We compare\nour approach to a non-probabilistic method for drift and a probabilistic method\nfor change-point detection. In our experiments, our approach generally yielded\nimproved accuracy and/or speed over these other methods.\n\n1\n\nIntroduction\n\nConsider a classi\ufb01cation task, in which the objective is to assign labels Y to vectors of one or more\nattribute values X. To learn to perform this task, we use training data to model f : X \u2192 Y ,\nthe unknown mapping from attribute values to labels, or target concept, in hopes of maximizing\nclassi\ufb01cation accuracy. A common problem in online classi\ufb01cation tasks is concept drift, which is\nwhen the target concept changes over time. Identifying concept drift is often dif\ufb01cult. If the correct\nlabel for some x is y1 at time step t1 and y2 at time step t2, does this indicate concept drift or that\nthe training examples are noisy?\nResearchers have approached drift in a number of ways. Schlimmer and Grainger [1] searched for\ncandidate models by reweighting training examples according to how well they \ufb01t future examples.\nSome have maintained and modi\ufb01ed partially learned models, e.g., [2, 3]. Many have maintained\nand compared \u201cbase\u201d models trained on blocks of consecutive examples to identify those that are\nthe best predictors of new examples, e.g., [4, 5, 6, 7, 8]. We focus on this approach. Such methods\naddress directly the uncertainty about the existence and location of drift.\nWe propose using probability theory to reason about this uncertainty. A probabilistic model of drift\noffers three main bene\ufb01ts to the research community. First, our experimental results show that a\nprobabilistic model can achieve new combinations of accuracy and speed on classi\ufb01cation tasks.\nSecond, probability theory is a well-developed theory that could offer new insights into the problem\nof concept drift. Third, probabilistic models can easily be combined in a principled way, and their\nuse in the machine-learning \ufb01eld continues to grow [9]. Therefore, our model could readily and\ncorrectly share information with other probabilistic models or be incorporated into broader ones.\nIn this paper we present a probabilistic model of the number of most-recent training examples that\nthe active concept describes. Maximum-likelihood estimation would over\ufb01t the model by conclud-\ning that each training was generated by a different target concept. This is unhelpful for future predic-\ntions, since it eliminates all generalization from past examples to future predictions. Instead, we use\nBayesian model comparison [9], or BMC, to reason about the trade-offs between model complexity\n(i.e., the number of target concepts) and goodness of \ufb01t. We \ufb01rst describe BMC and its application to\ndetecting change points. We then describe a Bayesian approach to concept drift. Finally, we show\nthe results of an empirical comparison among our method (pruned and unpruned), BMC for change\npoints, and Dynamic Weighted Majority [5], an ensemble method for concept drift.\n\n1\n\n\f2 Bayesian model comparison\n\nBMC uses probability theory to assign degrees of belief to candidate models given observations and\nprior beliefs [9]. By Bayes\u2019 Theorem, p(M|D) = p(D|M )p(M )\n, where M is the set of models\nunder consideration and D is the set of observations. Researchers in Bayesian statistics have used\nBMC to look for change points in time-series data. The goal of change-point detection is to segment\nsequences of observations into blocks that are identically distributed and usually assumed to be\nindependent.\n\np(D)\n\n2.1 Previous work on Bayesian change-point detection\n\nBarry and Hartigan [10, 11] used product partition models as distributions over possible segmenta-\ntions of time-series data. Exact inference requires O(n3) time in the number of observations and\nmay be accurately approximated in O(n) time using Markov sampling [10]. In an online task, ap-\nproximate training and testing on n observations would require O(n2) time, since the model must\nbe updated after new training data. These updates would require resampling and testing for conver-\ngence.\nFearnhead [12] showed how to perform direct simulation from the posterior distribution of a class\nof multiple-change-point models. This method requires O(n2) time and avoids the need to use\nMarkov sampling and to test for convergence. Again, an approximate method can be performed in\napproximately linear time, but the model must be regularly rebuilt in online tasks.\nThe computational costs associated with of\ufb02ine methods make it dif\ufb01cult to apply them to online\ntasks. Researchers have also looked for online methods for change-point detection. Fearnhead and\nLiu [13] introduced an online version of Fearnhead\u2019s simulation method [12] which uses particle\n\ufb01ltering to quickly update the distribution over change points. Adams and MacKay [14] proposed\nan alternative method for online Bayesian change-point detection. We now describe it in more detail,\nsince it will be the starting point for our own model.\n\n2.2 A method for online Bayesian change-point detection\n\nAdams and MacKay [14] proposed maintaining a discrete distribution over lt, the length in time\nsteps of the longest substrings of observations that are identically distributed, ending at time step\nt. This method therefore models the location of only the most recent change point, a cost-saving\nmeasure useful for many online problems.\nA conditional prior distribution p(lt|lt\u22121) is used, such that\n\n\uf8f1\uf8f2\uf8f3 \u03bb\u22121\n\n1 \u2212 \u03bb\u22121\n0\n\np(lt|lt\u22121) =\n\nif lt = 0;\nif lt = lt\u22121 + 1;\notherwise.\n\n(1)\n\nIn principle, a more sophisticated prior could be used. The crucial aspect is that, given that a sub-\nstring is identically distributed, it assigns mass to only two outcomes: the next observation is dis-\ntributed identically to the observations of the substring, or it is the \ufb01rst of a new substring.\nThe algorithm is initialized at time step 0 with a single base model that is the prior distribution over\nobservations. Initially, p(l0 = 0) = 1. Let Dt be the observation(s) made at time step t. At each\ntime step the algorithm computes a new posterior distribution p(lt|D1:t) by marginalizing out lt\u22121\nfrom\n\np(Dt|lt, D1:t\u22121)p(lt|lt\u22121)p(lt\u22121|D1:t\u22121)\n\np(lt, lt\u22121|D1:t) =\n\np(Dt|D1:t\u22121)\n\n.\n\n(2)\n\nThis is a straightforward summation over a discrete variable.\nTo \ufb01nd p(lt, lt\u22121|D1:t), consider the three components in the numerator. First, p(lt\u22121|D1:t\u22121) is the\ndistribution that was calculated at the previous time step. Next, p(lt|lt\u22121) is the prior distribution.\nSince only two outcomes are assigned any mass, each element in p(lt\u22121|D1:t\u22121) contributes mass\nto only two points in the posterior distribution. This keeps the algorithm linear in the size of the en-\nsemble. Finally, p(Dt|lt, D1:t\u22121) = p(Dt|Dt\u2212lt:t\u22121). In other words, it is the predictive probability\n\n2\n\n\fof a model trained on the observations received from time steps t \u2212 lt to t \u2212 1. The denominator\nthen normalizes the distribution.\nOnce this posterior distribution p(lt|D1:t) is calculated, each model in the ensemble is trained on\nthe new observation. Then, a new model is initialized with the prior distribution over observations,\ncorresponding to lt+1 = 0.\n\n3 Comparing conditional distributions for concept drift\n\nWe propose a new approach to coping with concept drift. Since the objective is to maximize classi-\n\ufb01cation accuracy, we want to model the conditional distribution p(Y |X) as accurately as possible.\nUsing [14] as a starting point, we place a distribution over lt, which now refers to the length in time\nsteps that the currently active concept has been active.\nThere is now an important distinction between BMC for concept drift and BMC for change points:\nBMC for concept drift models changes in p(Y |X), whereas BMC for change points models changes\nin the joint distribution p(Y, X). We use the conditional distribution to look for drift points because\nwe do not wish to react to changes in the marginal distribution p(X). A change point in the joint\ndistribution p(Y, X) could correspond to a change point in p(X), a drift point in p(Y |X), or both.\nReacting only to changes in p(Y |X) means that we compare models on their ability to classify\nunlabeled attribute values, not generate those values.\nIn other words, we assume that neither the sequence of attribute values X1:t nor the sequence of\nclass labels Y1:t alone provide information about lt. Therefore p(lt|lt\u22121, Xt) = p(lt|lt\u22121) and\np(lt\u22121|Y1:t\u22121, X1:t) = p(lt\u22121|Y1:t\u22121, X1:t\u22121). We also assume that examples from different con-\ncepts are independent. We use Equation 1 as the prior distribution p(lt|lt\u22121) [14]. Equation 2 is\nreplaced with\n\np(lt, lt\u22121|Y1:t, X1:t) =\n\np(Yt|lt, Y1:t\u22121, X1:t)p(lt|lt\u22121)p(lt\u22121|Y1:t\u22121, X1:t\u22121)\n\np(Yt|Y1:t\u22121, X1:t)\n\n.\n\n(3)\n\nt(cid:88)\n\nTo classify unlabeled attribute values X with class label Y , the predictive distribution is\n\np(Y |X) =\n\np(Y |X, Y1:t, X1:t, lt = i)p(lt = i).\n\n(4)\n\ni=1\n\nWe call this method Bayesian Conditional Model Comparison (BCMC). If left unchecked, the size\nof its ensemble will grow linearly with the number of observations.\nIn practice, this is far too\ncomputationally expensive for many online-learning tasks. We therefore prune the set of models\nduring learning. Let \u03c6 be a user-speci\ufb01ed threshold for the minimum posterior probability a model\nmust have to remain in the ensemble. Then, if there exists some i such that p(lt = i|D1:t) < \u03c6 <\np(lt = 0|lt\u22121), simply set p(lt = i|Dt) = 0 and discard the model p(D|Dt\u2212i:t). We call this\nmodi\ufb01ed method Pruned Bayesian Conditional Model Comparison (PBCMC).\n\n4 Experiments\n\nWe conducted an empirical comparison using our implementations of PBCMC and BCMC. We hy-\npothesized that looking for drift points in the conditional distribution p(Y |X) instead of change\npoints in the joint distribution p(Y, X) would lead to higher accuracy on classi\ufb01cation tasks. To test\nthis, we included our implementation of the method of Adams and MacKay [14], which we refer to\nsimply as BMC. It is identical to BCMC, except that it uses Equation 2 to compute the posterior over\nlt, where D \u2261 (Y, X).\nWe also hypothesized that PBCMC could achieve improved combinations of accuracy and speed\ncompared to Dynamic Weighted Majority (DWM) [5], an ensemble method for concept drift that\nuses a heuristic weighting scheme and pruning. DWM is a top performer on the problems we con-\nsidered [5]. Like the other learners, DWM maintains a dynamically-sized, weighted ensemble of\nmodels trained on blocks of examples. It predicts by taking a weighted-majority vote of the models\u2019\npredictions and multiplies the weights of those models that predict incorrectly by a constant \u03b2. It\n\n3\n\n\fthen rescales the weights so that maximum weight is 1. Then if the algorithm\u2019s global prediction\nwas incorrect, it adds a new model to the ensemble with a weight of 1, and it removes any models\nwith weights below a threshold \u03b8. In the cases of models which output probabilities, DWM considers\na prediction incorrect if a model did not assign the most probability to the correct label.\n\n4.1 Test problems\n\nWe conducted our experiments using four problems previously used in the literature to evaluate\nmethods for concept drift The STAGGER concepts [1, 3] are three target concepts in a binary classi\ufb01-\ncation task presented over 120 time steps. Attributes and their possible values are shape \u2208 {triangle,\ncircle, rectangle}, color \u2208 {red, green, blue}, and size \u2208 {small, medium, large}. For the \ufb01rst 40\ntime steps, the target concept is color = red \u2227 size = small. For the next 40 time steps, the target\nconcept is color = green \u2228 shape = circle. Finally, for the last 40 time steps, the target concept is\nsize = medium \u2228 size = large. A number of researchers have used this problem to evaluate methods\nfor concept drift [4, 5, 3, 1]. Per the problem\u2019s usual formulation, we evaluated each learner by\npresenting it with a single, random example at each time step and then testing it on a set of 100\nrandom examples, resampled after each time step. We conducted 50 trials.\nThe SEA concepts [8] are four target concepts in a binary classi\ufb01cation task, presented over 50,000\ntime steps. The target concept changes every 12,500 time steps, and associated with each concept\nis a single, randomly generated test set of 2,500 examples. At each time step, a learner is presented\nwith a randomly generated example, which has a 10% chance of being labeled as the wrong class.\nEvery 100 time steps, the learner is tested on the active concept\u2019s test set. Each example consists\nof numeric attributes xi \u2208 [0, 10], for i = 1, . . . , 3. The target concepts are hyperplanes, such that\ny = + if x1 + x2 \u2264 \u03b8, where \u03b8 \u2208 {7, 8, 9, 9.5}, for each of the four target concepts, respectively;\notherwise, y = \u2212. Note that x3 is an irrelevant attribute. Several researchers have used a shifting\nhyperplane to evaluate learners for concept drift [5, 6, 7, 2, 8]. We conducted 10 trials. In this\nexperiment, \u00b50 = 5.\nThe calendar-apprentice (CAP) data sets [15, 16] is a personal-scheduling task. Using a subset of 34\nsymbolic attributes, the task is to predict a user\u2019s preference for a meeting\u2019s location, duration, start\ntime, and day of week. There are 12 attributes for location, 11 for duration, 15 for start time, and\n16 for day of week. Each learner was tested on the 1,685 examples for User 1. At each time step,\nthe learner was presented the next example without its label. After classifying it, it was then told the\ncorrect label so it could learn.\nThe electricity-prediction data set consists of 45,312 examples collected at 30-minute intervals be-\ntween 7 May 1996 and 5 December 1998 [17]. The task is to predict whether the price of electricity\nwill go up or down based on \ufb01ve numeric attributes: the day of the week, the 30-minute period of\nthe day, the demand for electricity in New South Wales, the demand in Victoria, and the amount\nof electricity to be transferred between the two. About 39% of the examples have unknown values\nfor either demand in Victoria or the transfer amount. At each time step, the learner classi\ufb01ed the\nnext example in temporal order before being given the correct label and using it to learn. In this\nexperiment, \u00b50 = 0.\n\n4.2 Experimental design\n\nWe tested the learning methods on the four problems described. For STAGGER and SEA, we mea-\nsured accuracy on the test set, then computed average accuracy and 95% con\ufb01dence intervals at each\ntime step. We also computed the average normalized area under the performance curves (AUC) with\n95% con\ufb01dence intervals. We used the trapezoid rule on adjacent pairs of accuracies and normalized\nby dividing by the total area of the region. We present both AUC under the entire curve and after the\n\ufb01rst drift point to show both a learner\u2019s overall performance and its performance after drift occurs.\nFor CAP and electricity prediction, we measured accuracy on the unlabeled observations.\nAll the learning methods used a model we call Bayesian Naive Bayes, or BNB, as their base models.\nBNB makes the conditionally independent factor assumption (a.k.a.\nthe \u201cnaive Bayes\u201d assump-\ni=1 p(Xi|Y ) [9]. It calculates values for\np(Y |X) as needed using Bayes\u2019 Theorem. It takes the Bayesian approach to probabilities (hence the\nadditional \u201cBayes\u201d in the name), meaning that it places distributions over the parameters that govern\n\ntion) that the joint distribution p(Y, X) factors into p(Y )(cid:81)n\n\n4\n\n\fTable 1: Results for (a) the STAGGER concepts and (b) the SEA concepts.\n\n(a) STAGGER concepts\n\nLearner and Parameters\n\nBNB, on each concept\nPBCMC, \u03bb = 20, \u03c6 = 10\u22124\nBCMC, \u03bb = 20\nBMC, \u03bb = 50\nDWM, \u03b2 = 0.5, \u03b8 = 10\u22124\nBNB, on all examples\n\nAUC\n(overall)\n\n0.912\u00b10.005\n0.891\u00b10.005\n0.891\u00b10.005\n0.884\u00b10.005\n0.878\u00b10.005\n0.647\u00b10.008\n\nAUC\n(after drift)\n\n0.914\u00b10.007\n0.885\u00b10.007\n0.885\u00b10.007\n0.876\u00b10.008\n0.868\u00b10.007\n0.516\u00b10.011\n\n(b) SEA concepts\n\nLearner and Parameters\n\nBNB, on each concept\nDWM, \u03b2 = 0.9, \u03b8 = 10\u22123\nBCMC, \u03bb = 10, 000\nPBCMC, \u03bb = 10, 000, \u03c6 = 10\u22124\nBMC, \u03bb = 200\nBNB, on all examples\n\nAUC\n(overall)\n\n0.974\u00b10.002\n0.974\u00b10.001\n0.970\u00b10.002\n0.964\u00b10.002\n0.955\u00b10.003\n0.910\u00b10.003\n\nAUC\n(after drift)\n\n0.974\u00b10.002\n0.974\u00b10.001\n0.969\u00b10.002\n0.961\u00b10.003\n0.948\u00b10.003\n0.889\u00b10.002\n\nthe distributions p(Y ) and p(X|Y ) into which p(Y, X) factors. In our experiments, BNB predicted\nby marginalizing out the latent parameter variables to compute marginal likelihoods. Note that we\nuse BNB, a generative model over p(Y, X), even though we said that we wish to model p(Y |X) as\naccurately as possible. This is to ensure a fair comparison with BMC which needs p(Y, X). We are\nmore interested in the effects of looking for changes in each distribution, not which is a better model\nfor the active concept.\nIn our experiments, BNB placed Dirichlet distributions [9] over the parameters (cid:126)\u03b8 of multinomial\ndistributions p(Y ) and p(Xi|Y ) when Xi was a discrete attribute. All Dirichlet priors assigned\nequal density to all valid values of (cid:126)\u03b8. BNB placed Normal-Gamma distributions [9] over the pa-\nrameters \u00b5 and \u03bb of normal distributions p(Xi|Y ) when Xi was a continuous attribute. p(\u00b5, \u03bb) =\nN (\u00b5|\u00b50, (\u03b2\u03bb)\u22121)Gam(\u03bb|a, b). The predictive distribution is then a Student\u2019s t-distribution with\nmean \u00b5 and precision \u03bb. In all of our experiments, \u03b2 = 2 and a = b = 1. The value of \u00b50 is\nspeci\ufb01ed for each experiment with continuous attributes.\nWe also tested BNB as a control to show the effects of not attempting to cope with drift and BNB\ntrained using only examples from the active concept (when such information was available) to show\npossible accuracy given perfect information about drift.\nParameter selection is dif\ufb01cult when evaluating methods for concept drift. Train-test-and-validate\nmethods such as k-fold cross validation are not appropriate because the observations are or-\ndered and not assumed to be identically distributed. We therefore tested each learner on each\nproblem using each of a set of values for each parameter. Due to limited space, we present\nresults for each learning method using the best parameter settings we found. We make no\nclaim that these parameters are optimal, but they are representative of the overall trends we ob-\nserved. We performed this parameter search for all the learning methods. The parameters we\ntested were \u03bb \u2208 {10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000}, \u03c6 \u2208 {10\u22122, 10\u22123, 10\u22124},\n\u03b2 \u2208 {0.25, 0.5, 0.75, 0.9}, and \u03b8 \u2208 {10\u22122, 10\u22123, 10\u22124, 0}.\n\n5\n\n\fTable 2: Accuracy on the CAP and electricity data sets.\n\nPBCMC\n\n\u03bb = 10, 000, \u03c6 = 10\u22124\n\n\u03b2 = 0.75, \u03b8 = 10\u22124\n\nBCMC\n\n\u03bb = 5, 000\n\n63.92\n63.03\n39.17\n51.81\n54.48\n\u03bb = 10\n85.33\n\nBMC\n\u03bb = 10\n63.15\n64.10\n35.19\n51.22\n53.41\n\u03bb = 10\n65.37\n\nDWM\n\n65.76\n66.35\n37.98\n51.28\n55.34\n\n82.31\n\nBNB\n\n62.14\n62.37\n32.40\n51.22\n52.03\n\n62.44\n\nLocation\nDuration\nStart Time\nDay of Week\n\nAverage\n\nElectricity\n\n63.74\n63.15\n38.40\n51.81\n54.27\n\n85.32\n\n\u03bb = 10, \u03c6 = 10\u22122\n\n\u03b2 = 0.25, \u03b8 = 10\u22123\n\n4.3 Results and analysis\n\nTable 1 shows the top results for the STAGGER and SEA concepts. On the STAGGER concepts, PBCMC\nand BCMC performed almost identically and have a higher mean AUC than BMC, but their 95%\ncon\ufb01dence intervals overlap. PBCMC and BCMC outperformed DWM. On the SEA concepts, DWM\nwas the top performer, matching the accuracy of BNB trained on each concept and outperforming all\nthe other learner methods. BCMC was next, followed by PBCMC, then BMC, and the BNB.\nTable 2 shows the top results for the CAP and electricity data sets. DWM performed the best on the\nlocation and duration data sets, while BCMC performed best on the start time and day-of-week data\nsets. PBCMC matched the accuracy of BCMC on the day-of-week and duration data sets and came\nclose to it on the others. DWM had the highest mean accuracy over all four tasks, followed by PBCMC\nand BCMC, then BMC, and \ufb01nally BNB. BCMC performed the best on the electricity data set, closely\nfollowed by PBCMC.\nlooking for changes in the conditional distribution p(Y |X) led to\nThe \ufb01rst conclusion is clear:\nbetter accuracy than looking for changes in the joint distribution p(Y, X). With the close exception\nof the duration problem in the CAP data sets, PBCMC and BCMC outperformed BMC, sometimes\ndramatically so. What is less clear is the relative merits of PBCMC and DWM. We now analyze these\nlearners to better understand address this question.\n\n4.3.1 Reactivity versus stability\n\nThe four test problems can be partitioned into two subsets: those on which PBCMC was generally\nmore accurate (STAGGER and electricity) and those on which DWM was (SEA and CAP). We can\nobtain further insight into what separates these two subsets by noting that both PBCMC and DWM can\nbe said to have \u201cstrategies,\u201d which are determined by their parameters. For PBCMC, higher values of\n\u03bb mean that it will assign less probability initially to new models. For DWM, higher values of \u03b2 mean\nthat it will penalize models less for making mistakes. For both, lower values of \u03c6 and \u03b8 respectively\nmean that they are slower to completely remove poorly performing models from consideration. We\ncan thus interpret these parameters to describe how \u201creactive\u201d or \u201cstable\u201d the learners are, i.e., the\ndegree to which new observations can alter their hypotheses [4].\nThe two subsets are also partitioned by the strategy which was superior for the problems in each.\nFor both PBCMC and DWM, some of the most reactive parameterizations we tested were optimal on\nSTAGGER and electricity, but some of the most stable were optimal on SEA and CAP. Further, we\nobserved generally strati\ufb01ed results across parameterizations. For each problem, almost all of the\nparameterizations of the top learner were more accurate than almost all of the parameterizations of\nthe other. This indicates that PBCMC was generally better for the concepts which favor reactivity,\nwhereas DWM was generally better for the concepts which favor stability.\n\n4.3.2 Closing the performance gaps\n\nWe now consider why these gaps in performance exist and how they might be closed. Figure 1 shows\nthe average accuracies of PBCMC and DWM at each time step on the STAGGER and SEA concepts.\nThese are for the experiments reported in Table 1, so the parameters, numbers of trials, etc. are the\nsame. We present 95% con\ufb01dence intervals at selected time steps for both. Figure 1 shows that the\n\n6\n\n\f(a)\n\n(b)\n\nFigure 1: Average accuracy on (a) the STAGGER concepts and (b) the SEA concepts. See text for\ndetails.\n\nbetter performing learners in each problem were faster to react to concept drift. This shows that\nDWM did not perform better on SEA simply by being more stable whether the concept was or not.\nOn the SEA concepts, PBCMC did perform best with the most stable parameterization we tried, but\nits main problem was that it wasn\u2019t reactive enough when drift occurred.\nWe \ufb01rst consider whether the problem is one of parameter selection. Perhaps we can achieve better\nperformances by using a more reactive parameterization of DWM on certain problems and/or a more\nstable parameterization of PBCMC on other problems. Our experimental results cast doubt on this\nproposition. For the problems on which PBCMC was superior, DWM\u2019s best results were not obtained\nusing the most reactive parameterization. In other words, simply using an even more reactive pa-\nrameterization of DWM did not improve performance on these problems. Further, on the duration\nproblem in the CAP data sets, PBCMC also achieved the reported accuracy using \u03bb = 5000 and\n\u03c6 = 10\u22122, and on the location problem it acheived negligibly better accuracy using \u03bb = 5000 and\n\u03c6 = 10\u22123 or \u03c6 = 10\u22124. Therefore, simply using an even more stable parameterization of PBCMC\ndid not improve performance on these problems either. BCMC, which is just PBCMC with \u03c6 = 0, did\noutperform PBCMC on SEA. It reacted more quickly than PBCMC did, but not as quickly as DWM\ndid, and at a much greater computational cost, since it had to maintain every model in order to have\nthe one(s) which would eventually gain weight relative to the other models. BCMC also was not a\nsigni\ufb01cant improvement over PBCMC on the location and duration problems.\nWe therefore theorize that the primary reason for the differences in performance between PBCMC\nand DWM is their approaches to updating their ensembles, which determines how they react to drift.\nPBCMC favors reactivity by adding a new model at every time step and decaying the weights of all\nmodels by the degree to which they are incorrect. DWM favors stability by only adding a new model\nafter incorrect overall predictions and only decaying weights of incorrect models, and then only by\na constant factor. This is supported by the results on problems favoring reactive parameterizations\ncompared with the results on problems favoring stable parameterizations. Further, that it is dif\ufb01cult\nto close the performance gaps with better parameter selection suggests that there is a range of re-\nactivity or stability each favors. When parameterized beyond this range, the performance of each\nlearner degrades, or at least plateaus.\nTo further support this theory, we consider trends in ensemble sizes. Figure 2 shows the average\nnumber of models in the ensembles of PBCMC and DWM at each time step on the STAGGER and\nSEA concepts. These are again for the experiments reported in Table 1, and again we present 95%\ncon\ufb01dence intervals at selected time steps for both. The \ufb01gure shows that the trends in ensemble\nsizes were roughly interchanged between the two learners on the two problems. On both problems,\none learner stayed within a relatively small range of ensemble sizes, whereas the other continued to\nexpand the ensemble when the concept was stable, only signi\ufb01cantly pruning soon after drift. On\nSTAGGER, PBCMC expanded its ensemble size far more, whereas DWM did on SEA. This agrees\nwith our expectations for the synthetic concepts. STAGGER contains no noise, whereas SEA does,\nwhich complements the designs of the two learners. When noise is more likely, DWM will update\n\n7\n\n 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120Predictive Accuracy (%)Time Step (t)PBCMC, \u03bb = 20, \u03c6 = 10-4DWM, \u03b2 = 0.5, \u03b8 = 10-4 86 88 90 92 94 96 98 100 0 12500 25000 37500 50000Predictive Accuracy (%)Time Step (t)PBCMC, \u03bb = 10000, \u03c6 = 10-4DWM, \u03b2 = 0.9, \u03b8 = 10-3\f(a)\n\n(b)\n\nFigure 2: Average numbers of models on (a) the STAGGER concepts and (b) the SEA concepts. See\ntext for details.\n\nits ensemble more than when it is not as likely. However, when noise is more likely, PBCMC will\nusually have dif\ufb01culty preserving high weights for models which are actually useful. Conversely,\nPBCMC regularly updates its ensemble, and DWM will have less dif\ufb01culty maintaining high weights\non good models because it only decays weights by a constant factor.\nTherefore, it seems that each learner reaches the boundary of its favored range of reactivity or\nstability when further changes in that direction cause it to either be so reactive that it often assigns\nrelatively high probability of drift to many time steps for which there was no drift, or so stable that\nit cannot react to actual drift. On STAGGER, PBCMC matched the performance of BNB on the \ufb01rst\ntarget concept (not shown), whereas DWM made more mistakes as it reacted to erroneously inferred\ndrift. On SEA, PBCMC needs to be parameterized to be so stable that it cannot react quickly to drift.\n\n5 Conclusion and Future Work\n\nIn this paper we presented a Bayesian approach to coping with concept drift. Empirical evaluations\nsupported our method. We showed that looking for changes in the conditional distribution p(Y |X)\nled to better accuracy than looking for changes in the joint distribution p(Y, X). We also showed that\nour Bayesian approach is competitive with one of the top ensemble methods for concept drift, DWM,\nsometimes beating and sometimes losing to it. Finally, we explored why each method sometimes\noutperforms the other. We showed that both PBCMC and DWM appear to favor a different range of\nreactivity or stability.\nDirections for future work include integrating the advantages of both PBCMC and DWM into a single\nlearner. Related to this task is a better characterization of their relative advantages and the relation-\nships among them, their favored ranges of reactivity or stability, and the problems to which they are\napplied. It also important to note that the more constrained ensemble sizes discussed above corre-\nspond to faster classi\ufb01cation speeds. Future work could explore how to balance this desiderata with\nthe desire for better accuracy. Finally, another direction is to integrate a Bayesian approach with\nother probabilistic models. With a useful probabilistic model for concept drift, such as ours, one\ncould potentially incorporate existing probabilistic domain knowledge to guide the search for drift\npoints or build broader models that use beliefs about drift to guide decision making.\n\nAcknowledgments\n\nThe authors wish to thank the anonymous reviewers for their constructive feedback. The authors also\nwish to thank Lise Getoor and the Department of Computer Science at the University of Maryland,\nCollege Park. This work was supported by the Georgetown University Undergraduate Research\nOpportunities Program.\n\n8\n\n 0 10 20 30 40 50 60 70 0 20 40 60 80 100 120Ensemble sizeTime Step (t)PBCMC, \u03bb = 20, \u03c6 = 10-4DWM, \u03b2 = 0.5, \u03b8 = 10-4 0 200 400 600 800 1000 1200 1400 0 12500 25000 37500 50000Ensemble sizeTime Step (t)PBCMC, \u03bb = 10000, \u03c6 = 10-4DWM, \u03b2 = 0.9, \u03b8 = 10-3\fReferences\n[1] J. C. Schlimmer and R. H. Granger. Beyond incremental processing: Tracking concept drift. In\nProceedings of the Fifth National Conference on Arti\ufb01cial Intelligence, pages 502\u2013507, Menlo\nPark, CA, 1986. AAAI Press.\n\n[2] G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings\nof the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining, pages 97\u2013106, New York, NY, 2001. ACM Press.\n\n[3] G. Widmer and M. Kubat. Learning in the presence of concept drift and hidden contexts.\n\nMachine Learning, 23:69\u2013101, 1996.\n\n[4] S. H. Bach and M. A. Maloof. Paired learners for concept drift. In Proceedings of the Eighth\nIEEE International Conference on Data Mining, pages 23\u201332, Los Alamitos, CA, 2008. IEEE\nPress.\n\n[5] J. Z. Kolter and M. A. Maloof. Dynamic weighted majority: An ensemble method for drifting\n\nconcepts. Journal of Machine Learning Research, 8:2755\u20132790, Dec 2007.\n\n[6] J. Z. Kolter and M. A. Maloof. Using additive expert ensembles to cope with concept drift.\nIn Proceedings of the Twenty-second International Conference on Machine Learning, pages\n449\u2013456, New York, NY, 2005. ACM Press.\n\n[7] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble\nclassi\ufb01ers. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, pages 226\u2013235, New York, NY, 2003. ACM Press.\n\n[8] W. N. Street and Y. Kim. A streaming ensemble algorithm (SEA) for large-scale classi\ufb01ca-\ntion. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, pages 377\u2013382, New York, NY, 2001. ACM Press.\n\n[9] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, Berlin-Heidelberg, 2006.\n\n[10] D. Barry and J. A. Hartigan. A Bayesian analysis for change point problems. Journal of the\n\nAmerican Statistical Association, 88(421):309\u2013319, 1993.\n\n[11] D. Barry and J. A. Hartigan. Product partition models for change point problems. The Annals\n\nof Statistics, 20(1):260\u2013279, 1992.\n\n[12] Paul Fearnhead. Exact and ef\ufb01cient Bayesian inference for multiple changepoint problems.\n\nStatistics and Computing, 16(2):203\u2013213, 2006.\n\n[13] P. Fearnhead and Z. Liu. On-line inference for multiple changepoint problems. Journal of the\nRoyal Statistical Society: Series B (Statistical Methodology), 69(4):589\u2013605, September 2007.\n\n[14] R.P. Adams and D.J.C. MacKay. Bayesian online changepoint detection. Technical re-\nport, University of Cambridge, 2007. http://www.inference.phy.cam.ac.uk/rpa23/papers/rpa-\nchangepoint.pdf.\n\n[15] A. Blum. Empirical support for winnow and weighted-majority algorithms: Results on a\n\ncalendar scheduling domain. Machine Learning, 26:5\u201323, 1997.\n\n[16] T. M. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski. Experience with a\n\nlearning personal assistant. Communications of the ACM, 37(7):80\u201391, July 1994.\n\n[17] M. Harries, C. Sammut, and K. Horn. Extracting hidden context. Machine Learning,\n\n32(2):101\u2013126, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1334, "authors": [{"given_name": "Stephen", "family_name": "Bach", "institution": null}, {"given_name": "Mark", "family_name": "Maloof", "institution": null}]}