{"title": "Mistake Bounds for Maximum Entropy Discrimination", "book": "Advances in Neural Information Processing Systems", "page_first": 833, "page_last": 840, "abstract": null, "full_text": " Mistake Bounds\n for Maximum Entropy Discrimination\n\n\n\n Philip M. Long Xinyu Wu\n Center for Computational Learning Systems Department of Computer Science\n Columbia University National University of Singapore\n plong@cs.columbia.edu wuxy@comp.nus.edu.sg\n\n\n\n\n Abstract\n\n We establish a mistake bound for an ensemble method for classification\n based on maximizing the entropy of voting weights subject to margin\n constraints. The bound is the same as a general bound proved for the\n Weighted Majority Algorithm, and similar to bounds for other variants\n of Winnow. We prove a more refined bound that leads to a nearly opti-\n mal algorithm for learning disjunctions, again, based on the maximum\n entropy principle. We describe a simplification of the on-line maximum\n entropy method in which, after each iteration, the margin constraints are\n replaced with a single linear inequality. The simplified algorithm, which\n takes a similar form to Winnow, achieves the same mistake bounds.\n\n\n\n1 Introduction\n\nIn this paper, we analyze a maximum-entropy procedure for ensemble learning in the on-\nline learning model. In this model, learning proceeds in trials. During the tth trial, the\nalgorithm (1) receives xt {0, 1}n (interpreted in this work as a vector of base classifier\npredictions), (2) predicts a class ^\n yt {0, 1}, and (3) discovers the correct class yt. During\ntrial t, the algorithm has access only to information from previous trials.\n\nThe first algorithm we will analyze for this problem was proposed by Jaakkola, Meila\nand Jebara [14]. The algorithm, at each trial t, makes its prediction by taking a weighted\nvote over the predictions of the base classifiers. The weight vector pt is the probability\ndistribution over the n base classifiers that maximizes the entropy, subject to the constraint\nthat pt correctly classifies all patterns seen in previous trials with a given margin . That\nis, it maximizes the entropy of pt subject to the constraints that pt xs 1/2 + whenever\nys = 1 for s < t, and pt xs 1/2 - whenever ys = 0 for s < t.\n\nWe show that, if there is a weighting p, determined with benefit of hindsight, that achieves\nmargin on all trials, then this on-line maximum entropy procedure makes at most ln n\n 22\nmistakes.\n\nLittlestone [19] proved the same bound for the Weighted Majority Algorithm [21], and a\nsimilar bound for the Balanced Winnow Algorithm [19]. The original Winnow algorithm\nwas designed to solve the problem of learning a hidden disjunction of a small number\nk out of a possible n boolean variables. When this problem is reduced to our general\nsetting in the most natural way, the resulting bound is (k2 log n), whereas Littlestone\n\n\f\nproved a bound of ek ln n for Winnow. We prove more refined bounds for a wider family\nof maximum-entropy algorithms, which use thresholds different than 1/2 (as proposed in\n[14]) and class-sensitive margins. A mistake bound of ek ln n for learning disjunctions is a\nconsequence of this more refined analysis.\n\nThe optimization needed at each round can be cast as minimizing a convex function subject\nto convex constraints, and thus can be solved in polynomial time [25]. However, the same\nmistake bounds hold for a similar, albeit linear-time, algorithm. This algorithm, after each\ntrial, replaces all constraints from previous trials with a single linear inequality. (This is\nanalogous to modification of SVMs leading to the ROMMA algorithm [18].) The resulting\nupdate is similar in form to Winnow.\n\nLittlestone [19] analyzed some variants of Winnow by showing that mistakes cause a re-\nduction in the relative entropy between the learning algorithm's weight vector, and that of\nthe target function. Kivinen and Warmuth [16] showed that an algorithm related to Win-\nnow trades optimally in a sense between accommodating the information from new data,\nand keeping the relative entropy between the new and old weight vectors small. Blum [4]\nidentified a correspondence between Winnow and a different application of the maximum\nentropy principle, in which the algorithm seeks to maximize the average entropy of the\nconditional distribution over the class designations (the yt's) subject to constraints arising\nfrom the examples, as proposed in [2]. Our proofs have a similar structure to the analysis\nof ROMMA [18]. Our problems fall within the general framework analyzed by Gordon\n[11]; while Gordon's results expose interesting relationships among learning algorithms,\napplying them did not appear to be the most direct route to solving our concrete problem,\nnor did they appear likely to result in the most easily understood proofs. As in related anal-\nyses like mistake bounds for the perceptron algorithm [22], Winnow [19] and the Weighted\nMajority Algorithm [19], our bound holds for any sequence of (xt, yt) pairs satisfying the\nseparation condition; in particular no independence assumptions are needed. Langford,\nSeeger and Megiddo [17] performed a related analysis, incomparable in strength, using\nindependence assumptions. Other related papers include [3, 20, 5, 15, 26, 13, 8, 27, 7].\n\nThe proofs of our main results do not contain any calculation; they combine simple geomet-\nric arguments with established information theory. The proof of the main result proceeds\nroughly as follows. If there is a mistake on trial t, it is corrected with a large margin by\npt+1. Thus pt+1 must assign a significantly different probability to the voters predicting\n1 on trial t than pt does. Applying an identity known as Pinsker's inequality, this means\nthat the relative entropy from pt+1 and pt is large. Next, we exploit the fact that the con-\nstraints satisfied by pt, and therefore by pt+1, are convex to show that moving from pt to\npt+1 must take you away from the uniform distribution, thus decreasing the entropy. The\ntheorem then follows from the fact that the entropy can only be reduced by a total of ln n.\nThe refinement leading to a ek ln n bound for disjunctions arises from the observation that\nPinsker's inequality can be strengthened when the probabilities being compared are small.\n\nThe analysis of this paper lends support to a view of Winnow as a fast, incremental ap-\nproximation to the maximum entropy discrimination approach, and suggests a variant of\nWinnow that corresponds more closely to the inductive bias of maximum entropy.\n\n\n\n\n\n2 Preliminaries\n\n\n\nLet n be the number of base classifiers. To avoid clutter, for the rest of the paper, \"proba-\nbility distribution\" should be understood to mean \"probability distribution over {1, ..., n}.\"\n\n\f\n2.1 Margins\n\nFor u [0, 1], define (u) = 1 to be 1 if u 1/2, and 0 otherwise. For a feature vector\nx {0, 1}n and a class designation y {0, 1}, say that a probability distribution p is\ncorrect with margin if (p x) = y, and |p x - 1/2| . If x and y were encountered\nin a trial of a learning algorithm, we say that p is correct with margin on that trial.\n\n\n2.2 Entropy, relative entropy, and variation\n\nRecall that, for a probability distributions p = (p1, ..., pn) and q = (q1, ..., qn),\n\n the entropy of p, denoted by H(p), is defined by n p\n i=1 i ln(1/pi),\n\n the relative entropy between p and q, denoted by D(p||q), is defined by\n n p\n i=1 i ln(pi/qi), and\n\n the variation distance between p and q, denoted by V (p, q), is defined to be the\n maximum difference between the probabilities that they assign to any set:\n\n 1 n\n V (p, q) = max p x - q x = |pi - qi|. (1)\n x{0,1}n 2 i=1\n\nRelative entropy and variation distance are related in Pinsker's inequality.\n\nLemma 1 ([23]) For all p and q, D(p||q) 2V (p, q)2.\n\n\n2.3 Information geometry\n\nRelative entropy obeys something like the Pythogarean Theorem.\n\nLemma 2 ([9]) Suppose q is a probability distribution, C is a convex set of probability\ndistributions, and r is the element of A that minimizes D(r||q). Then for any p C,\n\n D(p||q) D(p||r) + D(r||q).\n\nIf C can be defined by a system of linear equations, then\n\n D(p||q) = D(p||r) + D(r||q).\n\n\n3 Maximum Entropy with Margin\n\nIn this section, we will analyze the algorithm OME (\"on-line maximum entropy\") that at\nthe tth trial\n\n chooses pt to maximize the entropy H(pt), subject to the constraint that it is\n correct with margin on all pairs (xs, ys) seen in the past (with s < t),\n\n predicts 1 if and only if pt xt 1/2.\n\nIn our analysis, we will assume that there is always a feasible pt.\n\nThe following is our main result.\n\nTheorem 3 If there is a fixed probability distribution p that is correct with margin on\nall trials, OME makes at most ln n mistakes.\n 22\n\n\f\nProof: We will show that a mistake causes the entropy of the hypothesis to drop by at least\n22. Since the constraints only become more restrictive, the entropy never increases, and\nso the fact that the entropy lies between 0 and ln n will complete the proof.\n\nSuppose trial t was a mistake. The definition of pt+1 ensures that pt+1 xt is on the correct\nside of 1/2 by at least . But pt xt was on the wrong side of 1/2. Thus |pt+1 xt - pt \nxt| . Either pt+1 xt - pt xt , or the bitwise complement c(xt) of xt satisfies\npt+1 c(xt) - pt c(xt) . Thus V (pt+1, pt) . Therefore, Pinsker's Inequality\n(Lemma 1) implies that\n D(pt+1||pt) 22. (2)\n\nLet Ct be the set of all probability distributions that satisfy the constraints in effect when\npt was chosen, and let u = (1/n, ..., 1/n). Since pt+1 is in Ct (it must satisfy the con-\nstraints that pt did), Lemma 2 implies D(pt+1||u) D(pt+1||pt) + D(pt||u) and thus\nD(pt+1||u) - D(pt||u) D(pt+1||pt) which, since D(p||u) = (ln n) - H(p) for all p,\nimplies H(pt)-H(pt+1) D(pt+1||pt). Applying (2), we get H(pt)-H(pt+1) 22.\nAs described above, this completes the proof.\n\nBecause H(pt) is always at least H(p), the same analysis leads to a mistake bound of\n(ln n - H(p))/(22). Further, a nearly identical proof establishes the following (details\nare omitted from this abstract).\n\nTheorem 4 Suppose OME is modified so that p1 is set to be something other than the\nuniform distribution, and each pt minimizes D(pt||p1) subject to the same constraints.\n\nIf there is a fixed p that is correct with margin on all trials, the modified algorithm\n \nmakes at most D(p ||p1) mistakes.\n 22\n\n\n4 Maximum Entropy for Learning Disjunctions\n\nIn this section, we show how the maximum entropy principle can be used to efficiently\nlearn disjunctions.\n\nFor a threshold b, define b(x) to be 1 if x b and 0 otherwise. For a feature vector\nx {0, 1}n and a class designation y {0, 1}, say that p is correct at threshold b with\nmargin if b(p x) = y, and |p x - b| .\n\nThe algorithm OMEb, analyzed in this section, on the tth trial\n + ,-\n\n\n chooses pt to maximize the entropy H(pt), subject to the constraint that it is\n correct at threshold b with margin + on all pairs (xs, ys) with ys = 1 seen in\n the past (with s < t), and correct at threshold b with margin - on all such pairs\n (xs, ys) with ys = 0, then\n predicts 1 if and only if pt xt b.\n\nNote that the algorithm OME considered in Section 3 can also be called OME1/2,,.\n\nFor p, q [0, 1], define d(p||q) = D((p, (1 - p))||(q, (1 - q))), often called \"entropic loss.\"\n\nLemma 5 If there is an x {0, 1}n such that p x = p and q x = q, then D(p||q) \nd(p||q).\n\nProof: Application of Lagrange multipliers, together with the fact that D is convex [6],\nimplies that D(p||q) is minimized, subject to the constraints that p x = p and q x = q,\nwhen (1) pi is the same for all i with xi = 1, (2) qi is the same for all i with xi = 1,\n(3) pi is the same for all i with xi = 0, (4) qi is the same for all i with xi = 0. The\n\n\f\nabove four properties, together with the constraints, are enough to uniquely specify p and\nq. Evaluating D(p||q) in this case gives the result.\n\nTheorem 6 Suppose there is a probability distribution p that is correct at threshold b,\nwith a margin + on all trials t with yt = 1, and with margin - on all trials with yt = 0.\nThen OMEb, makes at most ln n mistakes.\n + ,- min{d(b++||b),d(b- ||b)}\n -\n\n\n\nProof: The outline of the proof is similar to the proof of Theorem 3. We will show that\nmistakes cause the entropy of the algorithm's hypothesis to decrease.\n\nArguing as in the proof of Theorem 3, H(pt+1) H(pt) - D(pt+1||pt). Lemma 5 then\nimplies that\n H(pt+1) H(pt) - d(pt+1 xt||pt xt). (3)\n\nIf there was a mistake on trial t for which yt = 1, then pt xt b, and pt+1 xt b + +.\nThus in this case d(pt+1 xt||pt xt) d(b + +||b). Similarly, if there was a mistake\non trial t for which yt = 0, then d(pt+1 xt||pt xt) d(b - -||b).\n\nOnce again, these two bounds on d(pt+1 xt||pt xt), together with (3) and the fact that\nthe entropy is between 0 and ln n, complete the proof.\n\nThe analysis of Theorem 6 can also be used to prove bounds for the case in which mistakes\nof different types have different costs, as considered in [12].\n\nTheorem 6 improves on Theorem 3 even in the case in which + = - and b = 1/2. For\nexample, if = 1/4, Theorem 6 gives a bound of 7.65 ln n, where Theorem 3 gives an\n8 ln n bound.\n\nNext, we apply Theorem 6 to analyze the problem of learning disjunctions.\n\nCorollary 7 If there are k of the n features, such that each yt is the disjunction of those\nfeatures in xt, then algorithm OME1/(ek),1/k-1/(ek),1/(ek) makes at most ek ln n mistakes.\n\nProof Sketch: If the target weight vector p assigns equal weight to each of the variables\nin the disjunction, when y = 1, the weight of variables evaluating to 1 is at least 1/k,\nand when y = 0, it is 0. So the hypothesis of Theorem 6 is satisfied when b = 1/(ek),\n+ = 1/k - b and - = b. Plugging into Theorem 6, simplifying and overapproximating\ncompletes the proof.\n\nTo get a more readable, but weaker, variant of Theorem 6, we will use the following bound,\nimplicit in the analysis of Angluin and Valiant [1] (see Theorem 1.1 of [10] for a more\nexplicit proof, and [24] for a closely related bound). It improves on Pinsker's inequality\n(Lemma 1) when n = 2, p is small, and q is close to p.\n\n\nLemma 8 ([1]) If 0 p 2q, d(p||q) (p-q)2 .\n 3q\n\nThe following is a direct consequence of Lemma 8 and Theorem 6. Note that in the case of\ndisjunctions, it leads to a weaker 6k ln n bound.\n\nTheorem 9 If there is a probability distribution p that is correct at threshold b with a\nmargin on all trials, then OMEb,, makes at most 3b ln n mistakes.\n 2\n\n\n5 Relaxed on-line maximum entropy algorithms\n\nLet us refer the halfspace of probability distributions that satisfy the constraint of trial t\nas Tt and the associated separating hyperplane by Jt. Recall that Ct is the set of feasible\n\n\f\nFigure 1: In ROME, the constraints Ct in effect before the tth round are replaced by the\nhalfspace St.\n\n\nsolutions to all the constraints in effect when pt is chosen. So pt+1 maximizes entropy\nsubject to membership in Ct+1 = Tt Ct.\n\nOur proofs only used the following facts about the OME algorithm: (a) pt+1 Tt, (b) pt\nis the maximum entropy member of Ct, and (c) pt+1 Ct.\n\nSuppose At is the set of weight vectors with entropy at last that of pt. Let Ht be the\nhyperplane tangent to At at pt. Finally, let St be the halfspace with boundary Ht containing\npt+1. (See Figure 1.) Then (a), (b) and (c) hold if Ct is replaced with St. (The least obvious\nis (b), which follows since Ht is tangent to At at pt, and the entropy function is strictly\nconcave.)\n\nAlso, as previously observed by Littlestone [19], the algorithm might just as well not re-\nspond to trials in which there is not a mistake. Let us refer to an algorithm that does both\nof these as a Relaxed On-line Maximum Entropy (ROME) algorithm.\n\nA similar observation regarding an on-line SVM algorithm, led to the simple ROMMA\nalgorithm [18]. In that case, it was possible to obtain a simple close-form expression for\nthe new weight vector. Matters are only slightly more complicated here.\n\nProposition 10 If trial t is a mistake, and q maximizes entropy subject to membership in\nSt Tt, then it is on the separating hyperplane for Tt.\n\nProof: Because q and p both satisfy St, any convex combination of the two satisfies St.\nThus, if q was on the interior of Tt, we could find a probability distribution with higher\nentropy that still satisfies both St and Tt by taking a tiny step from q toward p. This would\ncontradict the assumption that q is the maximum entropy member of St Tt.\n\nThis implies that the next hypothesis of a ROME algorithm is either on Jt (the separating\nhyperplane Tt) only, or on both Jt and Ht (the separating hyperplane of St). The following\ntheorem will enable us to obtain a formula in either case.\n\nLemma 11 ([9] (Theorem 3.1)) Suppose q is a probability distribution, and C is a set de-\nfined by linear constraints as follows: for an m n real matrix A, and a m-dimensional\n\n\f\ncolumn vector b, C = {r : Ar = b}. Then if r is the member of C minimizing\nD(r||q), then there are scalar constants Z, c1, ..., cm such that for all i {1, ..., n},\nri = exp( m c\n j=1 j aj,i)qi/Z.\n\n\nIf the next hypothesis pt+1 of a ROME algorithm is on Ht, then by Lemma 2, it and all\nother members of Ht satisfy D(pt+1||u) = D(pt+1||pt) + D(pt||u). Thus, in this case,\npt+1 also minimizes D(q||pt) from among the members q of Ht Jt. Thus, Lemma 11\nimplies that pt+1,i/pt,i is the same for all i with xi = 1, and the same for all i with xi = 0.\nThis implies that, for ROMEb, , if there was a mistake on a trial t,\n + ,-\n\n\n (b++)pt,i ifx\n t,i = 1 and yt = 1\n ptxt\n (1-(b++))pt,i if xt,i = 0 and yt = 1\n p 1-(ptxt)\n t+1,i = \n (b- )p (4)\n - t,i if x\n t,i = 1 and yt = 0\n ptxt\n \nNote that this updates the (1-(b-+))pt,i\n if x\n 1-( t,i = 0 and yt = 0.\n ptxt)\n\n\n weights multiplicatively, like Winnow and Weighted Majority.\n\nIf pt+1 is not on the separating hyperplane for St, then it must maximize entropy subject to\nmembership in Tt alone, and therefore subject to membership in Jt. In this case, Lemma 11\nimplies\n (b++) ifx\n |{j:x t,i = 1 and yt = 1\n t,j =1}|\n (1-(b++)) if xt,i = 0 and yt = 1\n p |{j:xt,j=0}|.\n t+1,i = \n (b- (5)\n + ) if x\n t,i = 1 and yt = 0\n |{j:xt,j=1}|\nIf this is the case, then p \n t+1 (1-(b-\n + )) if x\n |{j:x t,i = 0 and yt = 0\n t,j =0}|.\n\n defined as in (5) should be a member of St.\n\nHow to test for membership in St? Evaluating the gradient of H at pt, and simplifying a\nbit, we can see that\n n 1\n St = q : qi ln H(p) .\n p\n i=1 t,i\n\n\nSumming up, a way to implement a ROME algorithm with the same mistake bound as the\ncorresponding OME algorithm is to\n\n try defining pt+1 as in (5), and check whether the resulting pt+1 St, if so use\n it, and\n\n if not, then define pt+1 as in (4) instead.\n\n\nAcknowledgements\n\nWe are grateful to Tony Jebara and Tong Zhang for helpful conversations, and an anony-\nmous referee for suggesting a simplification of the proof of Theorem 3.\n\n\nReferences\n\n [1] D. Angluin and L. Valiant. Fast probabilistic algorithms for Hamiltonion circuits and\n matchings. Journal of Computer and System Sciences, 18(2):155193, 1979.\n\n [2] A. L. Berger, S. Della Pietra, and V. J. Della Pietra. A maximum entropy approach to\n natural language processing. Computational Linguistics, 22(1):3971, 1996.\n\n\f\n [3] D. Blackwell. An analog of the minimax theorem for vector payoffs. Pac. J. Math.,\n 6:18, 1956.\n\n [4] A. Blum, 2002. http://www-2.cs.cmu.edu/avrim/ML02/lect0418.txt.\n\n [5] N. Cesa-Bianchi, A. Krogh, and M. Warmuth. Bounds on approximate steepest de-\n scent for likelihood maximization in exponential families. IEEE Transactions on\n Information Theory, 40(4):12151218, 1994.\n\n [6] T. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991.\n\n [7] K. Crammer, O. Dekel, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive\n algorithms. NIPS, 2003.\n\n [8] Koby Crammer and Yoram Singer. Ultraconservative online algorithms for multiclass\n problems. In COLT, pages 99115, 2001.\n\n [9] I. Csiszar. I-divergence geometry of probability distributions and minimization prob-\n lems. Annals of Probability, 3:146158, 1975.\n\n[10] D. P. Dubhashi and A. Panconesi. Concentration of measure for the analysis of ran-\n domized algorithms, 1998. Monograph.\n\n[11] Geoffrey J. Gordon. Regret bounds for prediction problems. In Proc. 12th Annu.\n Conf. on Comput. Learning Theory, pages 2940. ACM Press, New York, NY, 1999.\n\n[12] D. P. Helmbold, N. Littlestone, and P. M. Long. On-line learning with linear loss\n constraints. Information and Computation, 161(2):140171, 2000.\n\n[13] M. Herbster and M. K. Warmuth. Tracking the best linear predictor. Journal of\n Machine Learning Research, 1:281309, 2001.\n\n[14] T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. NIPS, 1999.\n\n[15] J. Kivinen and M. Warmuth. Boosting as entropy projection. COLT, 1999.\n\n[16] J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for\n linear prediction. Information and Computation, 132(1):163, 1997.\n\n[17] J. Langford, M. Seeger, and N. Megiddo. An improved predictive accuracy bound for\n averaging classifiers. ICML, pages 290297, 2001.\n\n[18] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine\n Learning, 46(1-3):361387, 2002.\n\n[19] N. Littlestone. Mistake Bounds and Logarithmic Linear-threshold Learning Algo-\n rithms. PhD thesis, UC Santa Cruz, 1989.\n\n[20] N. Littlestone, P. M. Long, and M. K. Warmuth. On-line learning of linear functions.\n Computational Complexity, 5:123, 1995. Preliminary version in STOC'91.\n\n[21] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information\n and Computation, 108:212261, 1994.\n\n[22] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the\n Symposium on the Mathematical Theory of Automata, pages 615622, 1962.\n\n[23] M. S. Pinsker. Information and Information Stability of Random Variables and Pro-\n cesses. Holden-Day, 1964.\n\n[24] F. Topsoe. Some inequalities for information divergence and related measures of\n discrimination. IEEE Trans. Inform. Theory, 46(4):16021609, 2001.\n\n[25] P. Vaidya. A new algorithm for minimizing convex functions over convex sets. FOCS,\n pages 338343, 1989.\n\n[26] T. Zhang. Regularized winnow methods. NIPS, pages 703709, 2000.\n\n[27] T. Zhang. A sequential approximation bound for some sample-dependent convex\n optimization problems with applications in learning. COLT, pages 6581, 2001.\n\n\f\n", "award": [], "sourceid": 2729, "authors": [{"given_name": "Philip", "family_name": "Long", "institution": null}, {"given_name": "Xinyu", "family_name": "Wu", "institution": null}]}