{"title": "Incrementally Learning Time-varying Half-planes", "book": "Advances in Neural Information Processing Systems", "page_first": 920, "page_last": 927, "abstract": null, "full_text": "Incrementally Learning Time-varying Half-planes \n\nAnthony Kuh * \n\nDept. of Electrical Engineering \nUniversity of Hawaii at Manoa \n\nHonolulu, ill 96822 \n\nThomas Petsche t \n\nSiemens Corporate Research \n\n755 College Road East \nPrinceton, NJ 08540 \n\nRonald L. Rivest+ \n\nLaboratory for Computer Science \n\nMIT \n\nCambridge, MA 02139 \n\nAbstract \n\nWe present a distribution-free model for incremental learning when concepts vary \nwith time. Concepts are caused to change by an adversary while an incremental \nlearning algorithm attempts to track the changing concepts by minimizing the \nerror between the current target concept and the hypothesis. For a single half(cid:173)\nplane and the intersection of two half-planes, we show that the average mistake \nrate depends on the maximum rate at which an adversary can modify the concept. \nThese theoretical predictions are verified with simulations of several learning \nalgorithms including back propagation. \n\n1 \n\nINTRODUCTION \n\nThe goal of our research is to better understand the problem of learning when concepts are \nallowed to change over time. For a dichotomy, concept drift means that the classification \nfunction changes over time. We want to extend the theoretical analyses of learning to \ninclude time-varying concepts; to explore the behavior of current learning algorithms in the \nface of concept drift; and to devise tracking algorithms to better handle concept drift. In this \npaper, we briefly describe our theoretical model and then present the results of simulations \n\n*kuh@wiliki.eng.hawaii.edu \n\nt petsche@learning.siemens.com \n\n+rivest@theory.lcs.mit.edu \n\n920 \n\n\fIncrementally Learning Time-varying Half-planes \n\n921 \n\nin which several tracking algorithms, including an on-line version of back-propagation, are \napplied to time-varying half-spaces. \n\nFor many interesting real world applications, the concept to be learned or estimated is not \nstatic, i.e., it can change over time. For example, a speaker's voice may change due to \nfatigue, illness, stress or background noise (Galletti and Abbott, 1989), as can handwriting. \nThe output of a sensor may drift as the components age or as the temperature changes. In \ncontrol applications, the behavior of a plant may change over time and require incremental \nmodifications to the model. \n\nHaussler, et al. (1987) and Littlestone (1989) have derived bounds on the number of mistakes \nan on-line learning algorithm will make while learning any concept in a given concept class. \n, However, in that and most other learning theory research, the concept is assumed to be fixed. \nHelmbold and Long (1991) consider the problem of concept drift, but their results apply to \nmemory-based tracking algorithms while ours apply to incremental algorithms. In addition, \nwe consider different types of adversaries and use different methods of analysis. \n\n2 DEFINITIONS \n\nWe use much the same notation as most learning theory, but we augment many symbols \nwith a subscript to denote time. As usual, X is the instance space and Xt is an instance drawn \nat time t according to afixed, ~rbitrary distribution Px. The function Ct : X ~ {O, I} is the \nactive concept at time t, that is, at time t any instan~e is labeled according to Ct. The label \nof the instance is at = Ct(Xt). Each active concept C i is a member of the concept class C. A \nsequence of active concepts is denoted c. At any time t, the tracker uses an algorithm \u00a3 to \ngenerate a hypothesis Ct of the active concept. \n\nWe use a symmetric distance function to measure the difference between two concepts: \nd(c, c') = Px[x : c(x) =1= c'(x)]. \n\nAs we alluded to in the introduction, we distinguish between two types of tracking al(cid:173)\ngorithms. A memory-based tracker stores the most recent m examples and chooses a \nhypothesis based on those stored examples. Helmbold and Long (1991), for example, \nuse an algorithm that chooses as the hypothesis the concept that minimizes the number \nof disagreements between cr(xt) and Ct(Xt). An incremental tracker uses only the previous \nhypothesis and the most recent examples to form the new hypothesis. In what follows, we \nfocus on incremental trackers. \nThe task for a tracking algorithm is, at each iteration t, to form a \"good\" estimate ct of the \nactive concept Ct using the sequence of previous examples. Here \"good\" means that the \nprobability of a disagreement between the label predicted by the tracker and the actual label \nis small. In the time-invariant case, this would mean that the tracker would incrementally \nimprove its hypothesis as it collects more examples. In the time-varying case, however, we \nintroduce an adversary whose task is to change the active concept at each iteration. \n\nGiven the existence of a tracker and an adversary, each iteration of the tracking problem \nconsists of five steps: (1) the adversary chooses the active concept Cr; (2) the tracker is \ngiven an unlabeled instance, Xr, chosen randomly according to Px; (3) the tracker predicts \na label using the current hypothesis: at = Ct-l (xt); (4) the tracker is given the correct label \nat'= ct(xt); (5) the tracker forms a new hypothesis: ct = \u00a3(Ct-l, (xt,at)). \n\n\f922 \n\nKuh, Petsche, and Rivest \n\nIt is clear that an unrestricted adversary can always choose a concept sequence (a sequence \nof active concepts) that the tracker can not track. Therefore, it is necessary to restrict \nthe changes that the adversary can induce. In this paper, we require that two subsequent \nconcepts differ by no more than /\" that is, d(c t, ct-r) ~ /' for all t. We define the restricted \nconcept sequence space C-y = {c : Ct E C, d(ct, Ct+1) ~ y}. In the following, we are \nconcerned with two types of adversaries: a benign adversary which causes changes that are \nindependent of the hypothesis; and a greedy adversary which always chooses a change that \nwill maximize d(ct, Ct-1) constrained by the upper-bound. \n\nSince we have restricted the adversary, it seems only fair to restrict the tracker too. We \nrequire that a tracking algorithm be: deterministic, i.e., that the process generating the \nhypotheses be detenninistic; prudent, i.e., that the label predicted for an instance be a \ndetenninistic function of the current hypothesis: at = Ct-1 (xt); and conservative, i.e., that \nthe hypothesis is modified only when an example is mislabeled. The restriction that a tracker \nbe conservative rules out algorithms which attempt to predict the adversary's movements \nand is the most restrictive of the three. On the other hand, when the tracker does update its \nhypothesis, there are no restrictions on d( Ct. Ct-1). \nTo measure perfonnance, we focus on the mistake rate of the tracker. A mistake occurs \nwhen the tracker mislabels an instance, i.e., whenever Ct-1 (xt) =I Ct(Xt). For convenience, \nwe define a mistake indicator function, M(xt\u2022 Ct. Ct-1) which is I if Ct-1(Xt) =I ct(xt) and 0 \notherwise. Note that if a mistake occurs, it occurs before the hypothesis is updated(cid:173)\na conservative tracker is always a step behind the adversary. We are interested in the \nasymptotic mistake rate, p.. = lim inft->oo ~ 2::=0 M(xt. Ct. Ct-l)\u00b7 \nFollowing Helmbold and Long (1991), we say that an algorithm (p.., y)-tracks a sequence \nspace C if, for all C E C-y and all drift rates 1\" not greater than 1', the mistake rate p..' is at \nmost p... \n\nWe are interested in bounding the asymptotic mistake rate of a tracking algorithm based \non the concept class and the adversary. To derive a lower bound on the mistake rate, we \nhypothesize the existence of a perfect conservative tracker, i.e., one that is always able to \nguess the correct concept each time it makes a mistake. We say that such a tracker has \ncomplete side information (CSI). No conservative tracker can do better than one with CSI. \nThus, the mistake rate for a tracker with CSI is a lower bound on the mistake rate achievable \nby any conservative tracker. \n\nTo upper bound the mistake rate, it is necessary that we hypothesize a particular tracking \nalgorithm when no side information (NSI) is available, that is, when the tracker only knows \nit mislabeled an instance and nothing else. In our analysis, we study a simple tracking \nalgorithm which modifies the previous hypothesis just enough to correct the mistake. \n\n3 ANALYSIS \n\nWe consider two concept classes in this paper, half-planes and the intersection of two half(cid:173)\nplanes which can be defined by lines in the plane that pass through the origin. We call these \nclasses HS2 and IHS2 \u2022 In this section, we present our analysis for HS 2 \u2022 \nWithout loss of generality, since the lines pass through the origin, we take the instance \nspace to, be the circumference of the unit circle. A half-plane in HS 2 is defined by a vector \nw such that for an instance x, c(x) = 1 if wx ~ 0 and c(x) = 0 otherwise. Without loss of \n\n\fIncrementally Learning Time-varying Half-planes \n\n923 \n\nFigure' I: Markov chain for the greedy adversary and (a) CSI and (b) COVER trackers. \n\ngenerality, as we will show later, we assume that the instances are chosen uniformly. \n\nTo begin, we assume a greedy adversary as follows: Every time the tracker guesses the \ncorrect target concept (that is, Ct-l = ct-d, the greedy adversary randomly chooses a \nvector r orthogonal to w and at every iteration, the adversary rotates w by 7r\"l radians in the \ndirection defined by r. We have shown that a greedy adversary maximizes the asymptotic \nmistake rate for a conservati ve tracker but do not present the proof here. \n\nTo lower bound the achievable error rate, we assume a conservative tracker with complete \nside information so that the hypothesis is unchanged if no mistake occurs and is updated to \nthe correct concept otherwise. The state of this system is fully described by d(c t, ct ) and, \nfor \"I = 1/ K for some integer K, is modeled by the Markov chain shown in figure I a. In \neach state Si (labeled i in the figure), d(cr. Ct) = i\"l. The asymptotic mistake rate is equal to \nthe probability of state 0 which is lower bounded by \n\n1(\"1) = J2\"1/7T' - 2\"1/7r \n\nSince I( \"I) depends only on \"I which, in tum, is defined in terms of the probability measure, \nthe results holds for all distributions. Therefore, since this result applies to the best of all \npossible conservative trackers, we can say that \nTheorem 1. For HS2 , if d(ct, ct-d ~ \"I, then there exists a concept sequence C E C-y such \nthat the mistake rate p., > 1(\"1). Equivalently, C-y is not (\"I,p.,)-trackable whenever p., < 1(\"1). \nTo upper bound the achievable mistake rate, we must choose a realizable tracking algorithm. \nWe have analyzed the behavior of a simple algorithm we call COVER which rotates the \nhypothesize line just far enough to cover the incorrectly labeled instance. Mathematically, \nif Wt is the hypothesized normal vector at time t and Xt is the mislabeled instance: \n\n-.. \nWt = Wt-l -\n\n-.. \n\n( - . . ) \nXt \u00b7 Wt-l Xt\u00b7 \n\n(1) \n\nIn this case, a mistake in state Si can lead to a transition to any state Sj for j ~ i as shown in \nFigure I b. The asymptotic probability of a mistake is the sum of the equilibrium transition \nprobabilities P(Sj lSi) for all j ~ i. Solving for these probabilities leads to an upper bound \nu( \"I) on the mistake rate: \n\nu(\"I) = J7T'''I/2+''I(2+~) \n\nAgain this depends only on \"I and so is distribution independent and we can say that: \nTheorem 2. For HS2 , for all concept sequences c E C-y the mistake rate for COVER \np., ~ u(\"I). Equivalently, C-y is (\"I,p.,)-trackable whenever p., < u(\"I). \n\n\f924 \n\nKuh, Petsche, and Rivest \n\nIf the adversary is benign, it is as likely to decrease as to increase the probability of a \nmistake. Unfortunately, although this makes the task ofthe tracker easier, it also makes the \nanalysis more difficult. So far, we can show that: \nTheorem 3. For HS2 and a benign adversary, there exists a concept sequence C Eel' such \nthat the mistake rate J.L is O( 'Y2/3). \n\n4 SIMULATIONS \n\nTo test the predictions ofthe theory and explore some areas for which we currently have no \ntheory, we have run simulations for a variety of concept classes, adversaries, and tracking \nalgorithms. Here we will present the results for single half-planes and the intersection of \ntwo half-planes; both greedy and benign adversaries; an ideal tracker; and two types of \ntrackers that use no side information. \n\n4.1 HALF-PLANES \n\nThe simplest concept class we have simulated is the set of all half-planes defined by lines \npassing through the origin. This is equivalent to the set classifications realizable with \n2-dimensional perceptrons with zero threshold. In other words, if w is the normal vector \nand x is a point in space, c(x) = 1 if w . x 2:: 0 and c(x) = 0 otherwise. The mistake \nrate reported for each data point is the average of 1,000,000 iterations. The instances were \nchosen uniformly from the circumference of the unit circle. \nWe also simulated the ideal tracker using an algorithm called CSI and tested a tracking \nalgorithm called COVER, which is a simple implementation of the tracking algorithm \nanalyzed in the theory. If a tracker using COVER mislabels an instance, it rotates the \nnormal vector in the plane defined by it and the instance so that the instance lies exactly on \nthe new hypothesis line, as described by equation 1. \n\n4.1.1 Greedy adversary \n\nWhenever CSI or COVER makes a mistake and then guesses the concept exactly, the \ngreedy adversary uniformly at random chooses a direction orthogonal to the normal vector \nofthe hyperplane. Whenever COVER makes a mistake and wt =I w\" the greedy adversary \nchoose the rotation direction to be in the plane defined by W t and Wt and orthogonal to w t. \nAt every iteration, the adversary rotates the normal vector of the hyperplane in the most \nrecently chosen direction so that d(c\" cr+t> = 'Y, or equivalently, Wt . Wt-l = cos( 1T'Y). \n\nFigure 2 shows that the theoretical lower bound very closely matches the simulation results \nfor CSI when 'Y is small. For small 'Y, the simulation results for COVER lie very close to the \ntheoretical predictions for the NSI case. In other words, the bounds predicted in theorems 1 \nand 2 are tight and the mistake rates for CSI and COVER differ by only a factor of 1T /2. \n\n4.1.2 Benign adversary \n\nAt every iteration, the benign adversary uniformly at random chooses a direction orthogonal \nto the normal vector of the hyperplane and rotates the hyperplane in that direction so \nthat d(c\" ct+d = 'Y. Figure 3 shows that CSI behaves as predicted by Theorem 3 when \nJ.L = 0.6'Y2/3. The figure also shows that COVER performs very well compared to CSI. \n\n\f0.500 \n\n0.100 \n\nQ) \n\n~ 0.050 \n~ \n.19 \n~ 0.010 \n\n(J) \n\n0.005 \n\nIncrementally Learning Time-varying Half-planes \n\n925 \n\n+ \no \n.. , \n\n+ \n0 \n\n+ \n0.\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n\n0 \n..... . \n\n.D\u00b7\u00b7\u00b7\u00b7\u00b7 \n\nTheorem 1 \nTheorem 2 \nCSt \nCOVER \n\no \n+ \n\n0.001 L---r------r----~~========:;::::J \n\n0.0001 \n\n0.0010 \nRate of change \n\n0.0100 \n\n0.1000 \n\nFigure 2: The mistake rate, /.L, as a function of the rate of change, ,)\" for HS 2 when the \nadversary is greedy. \n\n0.5000 \n\no. .... t5 \n\nrlJ\u00b7 .\u2022 ' \n\n.d} \u2022\u2022\u2022\u2022 \n\n.d} \u2022\u2022\u2022\u2022 \n\n.fj .... ii\u00b7\u00b7\u00b7\u00b7 \n\n~ ..... a\u00b7\u00b7\u00b7\u00b7 \n\nrn .. ' \n.rn-.... .' \n\n19- \u2022 \n.' \n\ni\u00a7..... \n\n0 \n+ \n\nCSt \nCOVER \n\n0.1000 \nQ) 0.0500 \n~ \n\nQ) \n.:;t! \n\nttl en 0.0100 \n~ 0.0050 \n\n0.0010 \n0;0005 \n\n0.0001 \n\n0.0010 \nRate of change \n\n0.0100 \n\n0.1000 \n\nFigure 3: The mistake rate, /.L, as a function of the rate of change, ,)\" for HS 2 when the \nadversary is benign. The line is /.L = 0.6,),2/3. \n\n4.2 INTERSECTION OF TWO HALF-PLANES \n\nThe other concept class we consider here is the intersection of two half-spaces defined by \nlines through the origin. That is, c(x) = 1 if W IX ~ 0 and W2X ~ 0 and ~(x) = 0 otherwise. \nWe tested two tracking algorithms using no side information for this concept class. \n\nThe first is a variation on the previous COVER algorithm. For each mislabeled instance: if \nboth half-spaces label Xt differently than Ct(Xt), then the line that is closest in euclidean dis(cid:173)\ntance to Xt is updated according to COVER; otherwise, the half-space labeling X t differently \nthan ct(xt) is updated. \n\nThe second is a feed-forward network with 2 input, 2 hidden and 1 output nodes. The \n\n\f926 \n\nKuh, Petsche, and Rivest \n\n0.500 r;::::============:;-------------:7~ \n\nTheorem 1 \nTheorem 2 \n\n+ \n~ ..... ~ \n\n+ \nJi .... \n\n+ \n.M .\u2022\u2022. \n\n+ \n+ \nn.'\u00b7\u00b7 \n1iiI\"\"'~ \n\n:i1t \u2022\u2022\u2022 , \n\n0.100 \n\nQ) \n~ 0.050 \nQ) \n.::t! \n1\\1 \niii \n~ 0.010 \n\n0.005 \n\n.fijt \u2022\u2022\u2022\u2022 \n\n-liit \u2022\u2022\u2022 ' \n\n.~ .... \n\n~., .. \n\n~ ... ,. \n\nw\u00b7\u00b7\u00b7\u00b7 \n\n0 \n+ \nX \n\nCSI \nCOVER \nBack prop \n\n0.001 L.---r------r----~:;:::========::;::~ \n\n0.0001 \n\n0.0100 \n\n0.1000 \n\n0.0010 \nRate of change \n\nFigure 4: The mistake rate, fL, as a function of the rate of change, 'Y, for IHS 2 when the \nadversary is greedy. \n\nthresholds of all the neurons and the weights from the hidden to output layers are fixed, i.e., \nonly the input weights can be modified. The output of each neuron is/CD) = (1 +e -lOwu)-l. \nFor classification, the instance was labeled one if the output of the network was greater than \n0.5 and zero otherwise. If the difference between the actual and desired outputs was greater \nthan 0.1, back-propagation was run using only the most recent example until the difference \nwas below 0.1. The learning rate was fixed at 0.01 and no momentum was used. Since the \nmodel may be updated without making a mistake, this algorithm is not conservative. \n\n4.2.1 Greedy Adversary \n\nAt each iteration, the greedy adversary rotates each hyperplane in a direction orthogonal to \nits normal vector. Each rotation direction is based on an initial direction chosen uniformly \nat random from the set of vectors orthogonal to the normal vector. At each iteration, both \nthe normal vector and the rotation vector are rotated 7T'Y /2 radians in the plane they define \nso that d(ct, Ct-l) = 'Y for every iteration. Figure 4 shows that the simulations match the \npredictions well for small 'Y. Non-conservative back-propagation performs about as well \nas conservative CSI and slightly better than conservative COVER. \n\n4.2.2 Benign Adversary \n\nAt each iteration, the benign adversary uniformly at random chooses a direction orthogonal \nto Wi and rotates the hyperplane in that direction such that d(c t, Ct-l) = 'Y. The theory for \nthe benign adversary in this case is not yet fully developed, but figure 5 shows that the \nsimulations approximate the optimal performance for HS 2 against a benign adversary with \nc E c'Y/2' Non-conservative back-propagation does not perform as well for very small 'Y, \nbut catches up for 'Y > .001. This is likely due to the particular choice of learning rate. \n\n\fIncrememally Learning Time-varying Half-planes \n\n927 \n\n0.5000 \n\n0.1000 \nQ) 0.0500 \n\n-ra ~ \n\nQ) \n\n.:s:. ra \n0.0100 \nCii \n~ 0.0050 \n\n0.0010 \n0.0005 \n\n1!9 \n\nill .J ..... ~./ \n\n~ ...... . \n\n~ ~ ........ . \n\nX \n\nX \n\nX \n\nX \n\n~ ~ \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\n~ ....... \n\n~ ....... \n\na .... ~............ \n\n0 \n+ \nX \n\nCSI \nCOVER \nBack prop \n\n0.0001 \n\n0.0010 \nRate of change \n\n0.0100 \n\n0.1000 \n\nFigure 5: The mistake rate, IL, as a function of the rate of change, y, for IHS 2 when the \nadversary is benign. The dashed line is IL = O.6( Y /2)2/3. \n\n5 CONCLUSIONS \n\nWe have presented the results of some of our research applied to the problem of tracking \ntime-varying half-spaces. For HS2 and IHS2 presented here, simulation results match the \ntheory quite well. For IHS 2 , non-conservative back-propagation perforn1s quite well. \n\nWe have extended the theorems presented in this paper to higher-dimensional input vectors \nand more general geometric concept classes. In Theorem 3, IL ~ cy 2/3 for some constant c \nand we are working to find a good value for that constant. We are also working to develop \nan analysis of non-conservative trackers and to better understand the difference between \nconservative and non-conservative algorithms. \n\nAcknowledgments \n\nAnthony Kuh gratefully acknowledges the support of the National Science Foundation \nthrough grant EET-8857711 and Siemens Corporate Research. Ronald L. Rivest gratefully \nacknowledges support from NSF grant CCR-8914428, ARO grant NOOO14-89-J-1988 and \na grant from the Siemens Corporation. \n\nReferences \n\nGalletti, I. and Abbott, M. (1989). Development of an advanced airborne speech recognizer \n\nfor direct voice input. Speech Technology, pages 60-63. \n\nHaussler, D., Littlestone, N., and Warmuth, M. K. (1987). Expected mistake bounds for \n\non-line learning algorithms. (Unpublished). \n\nHelmbold, D. P. and Long, P. M. (1991). Tracking drifting concepts using random examples. \nIn Valiant, L. G. and Warmuth, M. K., editors, Proceedings of the Fourth Annual \nWorkshop on Computational Learning Theory, pages 13-23. Morgan Kaufmann. \n\nLittlestone, N. (1989). Mistake bounds and logarithmic linear-threshold learning algorithms. \n\nTechnical Report UCSC-CRL-89-11, Univ. of California at Santa Cruz. \n\n\f", "award": [], "sourceid": 547, "authors": [{"given_name": "Anthony", "family_name": "Kuh", "institution": null}, {"given_name": "Thomas", "family_name": "Petsche", "institution": null}, {"given_name": "Ronald", "family_name": "Rivest", "institution": null}]}