{"title": "An Apobayesian Relative of Winnow", "book": "Advances in Neural Information Processing Systems", "page_first": 204, "page_last": 210, "abstract": null, "full_text": "An Apobayesian Relative of Winnow \n\nNick Littlestone \n\nNEC Research Institute \n\n4 Independence Way \nPrinceton, NJ 08540 \n\nChris Mesterharm \nNEC Research Institute \n\n4 Independence Way \nPrinceton, NJ 08540 \n\nAbstract \n\nWe study a mistake-driven variant of an on-line Bayesian learn(cid:173)\ning algorithm (similar to one studied by Cesa-Bianchi, Helmbold, \nand Panizza [CHP96]). This variant only updates its state (learns) \non trials in which it makes a mistake. The algorithm makes binary \nclassifications using a linear-threshold classifier and runs in time lin(cid:173)\near in the number of attributes seen by the learner. We have been \nable to show, theoretically and in simulations, that this algorithm \nperforms well under assumptions quite different from those embod(cid:173)\nied in the prior of the original Bayesian algorithm. It can handle \nsituations that we do not know how to handle in linear time with \nBayesian algorithms. We expect our techniques to be useful in \nderiving and analyzing other apobayesian algorithms. \n\n1 \n\nIntroduction \n\nWe consider two styles of on-line learning. In both cases, learning proceeds in a \nsequence of trials. In each trial, a learner observes an instance to be classified, \nmakes a prediction of its classification, and then observes a label that gives the \ncorrect classification. One style of on-line learning that we consider is Bayesian. \nThe learner uses probabilistic assumptions about the world (embodied in a prior \nover some model class) and data observed in past trials to construct a probabilistic \nmodel (embodied in a posterior distribution over the model class). The learner uses \nthis model to make a prediction in the current trial. When the learner is told the \ncorrect classification of the instance, the learner uses this information to update the \nmodel, generating a new posterior to be used in the next trial. \nIn the other style of learning that we consider, the attention is on the correctness \nof the predictions rather than on the model of the world. The internal state of the \n\n\fAn Apobayesian Relative o!Winnow \n\n205 \n\nlearner is only changed when the learner makes a mistake (when the prediction fails \nto match the label). We call such an algorithm mistake-driven. (Such algorithms are \noften called conservative in the computational learning theory literature.) There is a \nsimple way to derive a mistake-driven algorithm from anyon-line learning algorithm \n(we restrict our attention in this paper to deterministic algorithms). The derived \nalgorithm is just like the original algorithm, except that before every trial, it makes \na record of its entire state, and after every trial in which its prediction is correct, \nit resets its state to match the recorded state, entirely forgetting the intervening \ntrial. (Typically this is actually implemented not by making such a record, but by \nmerely omitting the step that updates the state.) For example, if some algorithm \nkeeps track of the number of trials it has seen, then the mistake-driven version of \nthis algorithm will end up keeping track of the number of mistakes it has made. \nWhether the original or mistake-driven algorithm will do better depends on the task \nand on how the algorithms are evaluated. \n\nWe will start with a Bayesian learning algorithm that we call SBSB and use this \nprocedure to derive a mistake-driven variant, SASB. Note that the variant cannot \nbe expected to be a Bayesian learning algorithm (at least in the ordinary sense) \nsince a Bayesian algorithm would make a prediction that minimizes the Bayes risk \nbased on all the available data, and the mistake-driven variant has forgotten quite \na bit. We call such algorithms apobayesian learning algorithms. This name is \nintended to suggest that they are derived from Bayesian learning algorithms, but \nare not themselves Bayesian. Our algorithm SASB is very close to an algorithm \nof [CHP96). We study its application to different tasks than they do, analyzing its \nperformance when it is applied to linearly separable data as described below. \nIn this paper instances will be chosen from the instance space X = {a, l}n for some \nn. Thus instances are composed of n boolean attributes. We consider only two \ncategory classifications tasks, with predictions and labels chosen from Y = {a, I} . \nWe obtain a' bound on the number of mistakes SASB makes that is comparable to \nbounds for various Winnow family algorithms given in [Lit88,Lit89). As for those \nalgorithms, the bound holds under the assumption that the points labeled 1 are \nlinearly separable from the points labeled 0, and the bound depends on the size 8 of \nthe gap between the two classes. (See Section 3 for a definition of 8.) The mistake \nbound for SASB is 0 ( /or log ~ ). While this bound has an extra factor of log ~ not \npresent in the bounds for the Winnow algorithms, SASB has the advantage of not \nneeding any parameters. The Winnow family algorithms have parameters, and the \nalgorithms' mistake bounds depend on setting the parameters to values that depend \non 8. \n(Often, the value of 8 will not be known by the learner.) We expect the \ntechniques used to obtain this bound to be useful in analyzing other apobayesian \nlearning algorithms. \n\nA number of authors have done related research regarding worst-case on-line \nloss bounds including [Fre96,KW95,Vov90). Simulation experiments involving a \nBayesian algorithm and a mistake-driven variant are described in [Lit95). That \npaper provides useful background for this paper. Note that our present analysis \ntechniques do not apply to the apobayesian algorithm studied there. The closest of \nthe original Winnow family algorithms to SASB appears to be the Weighted Ma(cid:173)\nJority algorithm [LW94], which was analyzed for a case similar to that considered \nin this paper in [Lit89). One should get a roughly correct impression of SASB if \n\n\f206 \n\nN. Littlestone and C. Mesterharm \n\none thinks of it as a version of the Weighted Majority algorithm that learns its \nparameters. \n\nIn the next section we describe the Bayesian algorithm that we start with. \nIn \nSection 3 we discuss its mistake-driven apobayesian variant. Section 4 mentions \nsome simulation experiments using these algorithms, and Section 5 is the conclusion. \n\n2 A Bayesian Learning Algorithm \n\nTo describe the Bayesian learning algorithm we must specify a family of distribu(cid:173)\ntions over X x Y and a prior over this family of distributions. We parameterize \nthe distributions with parameters ((h, ... , 8n + l ) chosen from e = [0, 1 ]n+l. The \nparameter 8n +1 gives the probability that the label is 1, and the parameter 8i gives \nthe probability that the ith attribute matches the label. Note that the probability \nthat the ith attribute is 1 given that the label is 1 equals the probability that the \nith attribute is 0 given that the label is O. We speak of this linkage between the \nprobabilities for the two classes as a symmetry condition. With this linkage, the \nobservation of a point from either class will affect the posterior distribution for both \nclasses. It is perhaps more typical to choose priors that allow the two classes to be \ntreated separately, so that the posterior for each class (giving the probability of ele(cid:173)\nments of X conditioned on the label) depends only on the prior and on observations \nfrom that class. The symmetry condition that we impose appears to be important \nto the success of our analysis of the apobayesian variant of this algorithm. (Though \nwe impose this condition to derive the algorithm, it turns out that the apobayesian \nvariant can actually handle tasks where this condition is not satisfied.) \nWe choose a prior on e that gives probability 1 to the set of all elements \n() = (81, ... , 8n +l ) E e for which at most one of 81 , ... ,8n does not equal !. \nThe prior is uniform on this set. Note that for any () in this set only a single at(cid:173)\ntribute has a probability other than ~ of matching the label, and thus only a single \nattribute is relevant. Concentrating on this set turns out to lead to an apobayesian \nalgorithm that can, in fact, handle more than one relevant attribute and that per(cid:173)\nforms particularly well when only a small fraction of the attributes are relevant. \n\nThis prior is related to to the familiar Naive Bayes model, which also assumes \nthat the attributes are conditionally independent given the labels. However, in the \ntypical Naive Bayes model there is no restriction to a single relevant attribute and \nthe symmetry condition linking the two classes is not imposed. \n\nOur prior leads to the following algorithm. (The name SBSB stands for \"Symmetric \nBayesian Algorithm with Singly-variant prior for Bernoulli distribution.\") \n\nAlgorithm SBSB Algorithm SBSB maintains counts Si of the number of times \neach attribute matches the label, a count M of the number of times the label is 1, \nand a count t of the number of trials. \n\nInitialization Si t- 0 for i = 1, ... ,n \n\nM t-O \n\ntt-O \n\nPrediction Predict 1 given instance (Xl, ... ,xn ) if and only if \n\n(M + 1) f= XiCSi+l)+Clixi)(t-Si+1) > (t - M + 1) f= (1-Xi)(Si+1~+XiCt-si+l) \nUpdate M t- M + y, t t- t + 1, and for each i, if Xi = Y then Si t- Si + 1 \n\n(S,) \n\ni=l \n\ni=l \n\n(S.) \n\n\fAn Apobayesian Relative of Winnow \n\n207 \n\n3 An Apobayesian Algorithm \n\nWe construct an apobayesian algorithm by converting algorithm SBSB into a \nmistake-driven algorithm using the standard conversion given in the introduction. \nWe call the resulting learning algorithm SASBj we have replaced \"Bayesian\" with \n\"Apobayesian\" in the acronym. \n\nIn the previous section we made assumptions made about the generation of the \ninstances and labels that led to SBSB and thence to SASB. These assumptions \nhave served their purpose and we now abandon them. In analyzing the apobayesian \nalgorithm we do not assume that the instances and labels are generated by some \nstochastic process. Instead we assume that the instance-label pairs in all of the \ntrials are linearly-separable, that is, that there exist some WI, ., . ,Wn , and c such \nthat for every instance-label pair (x, y) we have E~=I WiXi ;::: c when y = 1 and \n2:~=1 WiXi ::; c when y = O. We actually make a somewhat stronger assumption, \ngiven in the following theorem, which gives our bound for the apobayesian algorithm. \n\nTheorem 1 Suppose that 'Yi ;::: 0 and \"Ii ;::: 0 for i = 1, ... , n, and that 2:~=1 'Yi + \n\"I i = 1. Suppose that 0 ::; bo < bi ::; 1 and let 8 = bi - bo. Suppose that algorithm \nSASB is run on a sequence of trials such that the instance x and label y in each \ntrial satisfy 2:~=1 'YiXi + \"Ii (1 - Xi) ::; bo if y = 0 and 2:~=1 'YiXi + \"Ii (1 - Xi) ;::: bi if \n\ny = 1. Then the number of mistakes made by SASB will be bounded by * log 8; . \n\nWe have space to say only a little about how the derivation of this bound proceeds. \nDetails are given in [Lit96]. \n\n) \n\npt-d )P(x,y \n\nIn analyzing SASB we work with an abstract description of the associated algorithm \nSBSB. This algorithm starts with a prior on e as described above. We represent \nthis with a density Po. Then after each trial it calculates a new posterior density \nPt(O) = t-d8kP(X'YI~k, where Pt is the density after trial t and P(x, ylO) is the \nconditional probability ofthe instance x and label y observed in trial t given O. Thus \nwe can think of the algorithm as maintaining a current distribution on e that is \ninitially the prior. SASB is similar, but it leaves the current distribution unchanged \nwhen a mistake is not made. For there to exist a finite mistake bound there must \nexist some possible choice for the current distribution for which SASB would make \nperfect predictions, should it ever arrive at that distribution. We call any such \ndistribution leading to perfect predictions a possible target distribution. It turns out \nthat the separability condition given in Theorem 1 guarantees that a suitable target \ndistribution exists. The analysis proceeds by showing that for an appropriate choice \nof a target density p the relative entropy of the current distribution with respect to \nthe target distribution, J p( 0) log(p( 0) / Pt (0)), decreases by at least some amount \nR > 0 whenever a mistake is made. Since the relative entropy is never negative, the \nnumber of mistakes is bounded by the initial relative entropy divided by R. This \nform of analysis is very similar to the analysis of the various members of the Winnow \nfamily in [Lit89,Lit91]. \n\nThe same technique can be applied to other apobayesian algorithms. The abstract \nupdate of Pt given above is quite general. The success of the analysis depends on \nconditions on Po and P(x, ylO) that we do not have space here to discuss. \n\n\f208 \n\np = 0.01 k = 1 n = 20 \n\n250r---~--~----~--~--~ \n\nN. LittLestone and C. Mesterharm \np = 0.1 k = 5 n = 20 \n\n250 .-----.-----,-----,---~--~ \n\n, \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n\n'\" \n\n200 \n\n100 \n\n50 \n\n'Optimal' -\n\n'SBSB' .. .. \n'SASB' - -\n\n. \n: 'SASB + voting\" -. - / / \n. \n\n/' \n\n/ / \n\n.,-\n\n, \n\n,/ \n\n/ \n\n/ \n\n/ \n\n/ \n,/ \n\n.,-\n\n/ \n,/ \n\n/ \n/' \n\n,/ \n\n/' \n. \n. / \n. / \n/ \n:'/.\",,~ \nr \no \n\n200 \n\nf/) 150 \n~ \nS \n.~ :e 100 \n\n50 \n\n' Optimal' -\n\n'SBSB' ... . \n'SASB' - (cid:173)\n\n'SASB + voting' _.-\n\n/ \n\n/ \n\n/ \n\n/ \n\n/ \n,/ \n\n/' \n\n/ \n,/ \n\n/' \n\n/ \n\n-.---:: .. -. \n\n' . ' \n\n/ \n\n/ ...... <- ... \n\n/'\" \n./::: .. . . \no \n\no~--~--~----~--~--~ \n8000 10000 \n\n2000 \n\n4000 \n\n6000 \n\nO\"\"'---...L-------I.-----'----~--~ \n\n2000 \n\n4000 \n\n6000 \n\n8000 1 0000 \n\nTrials \n\nTrials \n\nFigure 1: Comparison of SASB with SBSB \n\n4 Simulation Experiments \n\nThe bound of the previous section was for perfectly linearly-separable data. We \nhave also done some simulation experiments exploring the performance of SASB on \nnon-separable data and comparing it with SBSB and with various other mistake(cid:173)\ndriven algorithms. A sample comparison of SASB with SBSB is shown in Figure \n1. In each experimental run we generated 10000 trials with the instances and labels \nchosen randomly according to a distribution specified by (h = '\" = Ok = 1 - p, \nOk+l = ... = 0n+l = .5 where 01 , \u2022\u2022. ,On+l are interpreted as specified in Section \n2, n is the number of attributes, and n, p, and k are as specified at the top of each \nplot. The line labeled \"optimal\" shows the performance obtained by an optimal \npredictor that knows the distribution used to generate the data ahead of time, and \nthus does not need to do any learning. The lines labeled \"SBSB\" and \"SASB\" show \nthe performance of the corresponding learning algorithms. The lines labeled \"SASB \n+ voting\" show the performance of SASB with the addition of a voting procedure \ndescribed in [Lit95]. This procedure improves the asymptotic mistake rate of the \nalgorithms. Each line on the graph is the average of 30 runs. Each line plots the \ncumulative number of mistakes made by the algorithm from the beginning of the run \nas a function of the number of trials. \n\nIn the left hand plot, there is only 1 relevant attribute. This is exactly the case that \nSBSB is intended for, and it does better than SASB. In right hand plot, there are 5 \nrelevant attributes; SBSB appears unable to take advantage of the extra information \npresent in the extra relevant attributes, but SASB successfully does. \n\nComparison of SASB and previous Winnow family algorithms is still in progress, \nand we defer presenting details until a clearer picture has been obtained. SASB and \nthe Weighted Majority algorithm often perform similarly in simulations. Typically, \nas one would expect, the Weighted Majority algorithm does somewhat better than \n\n\fAn Apobayesian Relative of Winnow \n\n209 \n\nSASB when its parameters are chosen optimally for the particular learning task, and \nworse for bad choices of parameters. \n\n5 Conclusion \n\nOur mistake bounds and simulations suggest that SASB may be a useful alternative \nto the existing algorithms in the Winnow family. Based on the analysis style and the \nbounds, SASB should perhaps itself be considered a Winnow family algorithm. Fur(cid:173)\nther experiments are in progress comparing SASB with Winnow family algorithms \nrun with a variety of parameter settings. \n\nPerhaps of even greater interest is the potential application of our analytic techniques \nto a variety of other apobayesian algorithms (though as we have observed earlier, \nthe techniques do not appear to apply to all such algorithms) . We have already \nobtained some preliminary results regarding an interpretation of the Perceptron \nalgorithm as an apobayesian algorithm. We are interested in looking for entirely \nnew algorithms that can be derived in this way and also in better understanding \nthe scope of applicability of our techniques. All of the analyses that we have looked \nat depend on symmetry conditions relating the probabilities for the two classes. It \nwould be of interest to see what can be said when such symmetry conditions do not \nhold. In simulation experiments [Lit95], a mistake-driven variant of the standard \nNaive Bayes algorithm often does very well, despite the absence of such symmetry \nin the prior that it is based on. \n\nOur simulation experiments and also the analysis of the related algorithm Winnow \n[Lit91] suggest that SASB can be expected to handle some instance-label pairs inside \nof the separating gap or on the wrong side, especially if they are not too far on the \nwrong side. In particular it appears to be able to handle data generated according \nto the distributions on which SBSB is based, which do not in general yield perfectly \nseparable data. \n\nIt is of interest to compare the capabilities of the original Bayesian algorithm with \nthe derived apobayesian algorithm. When the data is stochastically generated in a \nmanner consistent with the assumptions behind the original algorithm, the original \nBayesian algorithm can be expected to do better (see, for example, Figure 1). On \nthe other hand, the apobayesian algorithm can handle data beyond the capabilit(cid:173)\nies of the original Bayesian algorithm. For example, in the case we consider, the \napobayesian algorithm can take advantage of the presence of more than one relevant \nattribute, even though the prior behind the original Bayesian algorithm assumes a \nsingle relevant attribute. Furthermore, as for all of the Winnow family algorithms, \nthe mistake bound for the apobayesian algorithm does not depend on details of the \nbehavior of the irrelevant attributes (including redundant attributes). \n\nInstead of using the apobayesian variant, one might try to construct a Bayesian \nlearning algorithm for a prior that reflects the actual dependencies among the at(cid:173)\ntributes and the labels. However, it may not be clear what the appropriate prior is. \nIt may be particularly unclear how to model the behavior of the irrelevant attrib(cid:173)\nutes. Furthermore, such a Bayesian algorithm may end up being computationally \nexpensive. For example, attempting to keep track of correlations among all pairs \nof attributes may lead to an algorithm that needs time and space quadratic in the \nnumber of attributes. On the other hand, if we start with a Bayesian algorithm that \n\n\f210 \n\nN. Littlestone and C. Mesterharm \n\nuses time and space linear in the number of attributes we can obtain an apobayesian \nalgorithm that still uses linear time and space but that can handle situations beyond \nthe capabilities of the original Bayesian algorithm. \nAcknowledgments This paper has benefited from discussions with Adam Grove. \n\nReferences \n\n[CHP96] Nicolo Cesa-Bianchi, David P. Helmbold, and Sandra Panizza. On bayes \nmethods for on-line boolean prediction. In Proceedings of the Ninth Annual \nConference on Computational Learning Theory, pages 314-324, 1996. \n\n[Fre96] Yoav Freund. Predicting a binary sequence almost as well as the optimal \n\nbiased coin. In Proceedings of the Ninth Annual Conference on Computa(cid:173)\ntional Learning Theory, pages 89-98, 1996. \n\n[KW95] J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient \nupdates for linear prediction. In Proc. 27th ACM Symp. on Theory of \nComputing, pages 209-218, 1995. \n\n[Lit88] N. Littlestone. Learning quickly when irrelevant attributes abound: A new \n\nlinear-threshold algorithm. Machine Learning, 2:285-318, 1988. \n\n[Lit89] N. Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning \nAlgorithms. PhD thesis, Tech. Rept. UCSC-CRL-89-11, Univ. of Calif., \nSanta Cruz, 1989. \n\n[Lit91] N. Littlestone. Redundant noisy attributes, attribute errors, and linear(cid:173)\nthreshold learning using Winnow. In Proc. 4th Annu. Workshop on Com(cid:173)\nput. Learning Theory, pages 147- 156. Morgan Kaufmann, San Mateo, CA, \n1991. \n\n[Lit95] N. Littlestone. Comparing several linear-threshold learning algorithms on \n\ntasks involving superfluous attributes. In Proceedings of the XII Interna(cid:173)\ntional conference on Machine Learning, pages 353- 361, 1995. \n\n[Lit96] N. Littlestone. Mistake-driven bayes sports: Bounds for symmetric \n\napobayesian learning algorithms. Technical report, NEC Research In(cid:173)\nstitute, Princeton, NJ, 1996. \n\n[LW94] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. \n\nInformation and Computation, 108:212-261, 1994. \n\n[Vov90] Volodimir G. Vovk. Aggregating strategies. In Proceedings of the 1990 \n\nWorkshop on Computational Learning Theory, pages 371-383, 1990. \n\n\f", "award": [], "sourceid": 1194, "authors": [{"given_name": "Nick", "family_name": "Littlestone", "institution": null}, {"given_name": "Chris", "family_name": "Mesterharm", "institution": null}]}