{"title": "Class-size Independent Generalization Analsysis of Some Discriminative Multi-Category Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1625, "page_last": 1632, "abstract": null, "full_text": "Class-size Independent Generalization Analsysis\n of Some Discriminative Multi-Category\n Classification Methods\n\n\n\n Tong Zhang\n IBM T.J. Watson Research Center\n Yorktown Heights, NY 10598\n tzhang@watson.ibm.com\n\n\n\n Abstract\n\n We consider the problem of deriving class-size independent generaliza-\n tion bounds for some regularized discriminative multi-category classi-\n fication methods. In particular, we obtain an expected generalization\n bound for a standard formulation of multi-category support vector ma-\n chines. Based on the theoretical result, we argue that the formula-\n tion over-penalizes misclassification error, which in theory may lead to\n poor generalization performance. A remedy, based on a generalization\n of multi-category logistic regression (conditional maximum entropy), is\n then proposed, and its theoretical properties are examined.\n\n\n1 Introduction\n\nWe consider the multi-category classification problem, where we want to find a predictor\np : X Y, where X is a set of possible inputs and Y is a discrete set of possible outputs.\nIn many applications, the output space Y can be extremely large, and may be regarded as\ninfinity for practical purposes. For example, in natural language processing and sequence\nanalysis, the input can be an English sentence, and the output can be a parse or a translation\nof the sentence. For such applications, the number of potential outputs can be exponential\nof the length of the input sentence. As another example, in machine learning based web-\npage search and ranking, the input is the keywords and the output space consists of all\nweb-pages.\n\nIn order to handle such application tasks, from the theoretical point of view, we do not\nneed to assume that the output space Y is finite, so that it is crucial to obtain generalization\nbounds that are independent of the size of Y. For such large scale applications, one often\nhas a routine that maps each x X to a subset of candidates GEN(x) Y, so that the\ndesired output associated with x belongs to GEN(x). For example, for web-page search,\nGEN(x) consists of all pages that contain one or more keywords in x. For sequence\nannotation, GEN(x) may include all annotation sequences that are consistent. Although\nthe set GEN(x) may significantly reduce the size of potential outputs Y, it can still be\nlarge. Therefore it is important that our learning bounds are independent of the size of\nGEN(x).\n\nWe consider the general setting of learning in Hilbert spaces since it includes the popular\n\n\f\nkernel methods. Let our feature space H be a reproducing kernel Hilbert space with dot\nproduct . For a weight vector w H, we use notation w 2 = w w. We associate each\n H\npossible input/output pair (x, y) X Y with a feature vector fx,y H. Our classifier is\ncharacterized by a weight vector w H, with the following classification rule:\n\n pw(x) = arg max w fx,c. (1)\n cGEN(x)\n\nNote that computational issues are ignored in this paper. In particular, we assume that the\nabove decision can be computed efficiently (either approximately or exactly) even when\nGEN(x) is large. In practice, this is often possible either by heuristic search or dynamic\nprogramming (when GEN(X) has certain local-dependency structures). In this paper, we\nare only interested in the learning performance, so that we will not discuss the computa-\ntional aspect.\n\nWe assume that the input/output pair (x, y) X Y is drawn from an unknown underlying\ndistribution D. The quality of the predictor w is measured by some loss function. In this\npaper, we focus on the expected classification error with respect to D:\n\n D (w) = E(X,Y ) I (pw(X ), Y ), (2)\n\nwhere (X, Y ) is drawn from D, and I is the standard 0-1 classification error: I(Y , Y ) = 0\nwhen Y = Y and I(Y , Y ) = 1 when Y = Y .\n\nThe general set up we described above is useful for many application problems, and has\nbeen investigated, for example, in [2, 6]. The important issue of class-size independent (or\nweakly dependent) generalization analysis has also been discussed there.\n\nConsider a set of training data S = {(Xi, Yi), i = 1, . . . , n}, where we assume that for\neach i, Yi GEN(Xi). We would like to find ^\n wS H such that the classification error\n D ( ^\n wS) is as small as possible. This paper studies regularized discriminative learning\nmethods that estimate a weight vector ^\n wS H by solving the following optimization\nproblem:\n n\n 1 \n ^\n wS = arg min L(w, Xi, Yi) + w 2 ,\n H (3)\n wH n 2\n i=1\n\nwhere 0 is an appropriately chosen regularization parameter, and L(w, X, Y ) is a\nloss function which is convex of w. In this paper, we focus on some loss functions of the\nfollowing form:\n\n \n\n L(w, X, Y ) = (w (f ,\n X,Y - fX,c))\n\n cGEN(X)\\Y\n\nwhere and are appropriately chosen real-valued functions.\n\nTypically is chosen as an increasing function and as a decreasing function, se-\nlected so that (3) is a convex optimization problem. The intuition behind this method\nis that the resulting optimization formulation favors large values w (fX - f\n i ,Yi Xi,c)\nfor all c GEN(Xi)\\Yi. Therefore, it favors a weight vector w H such that\nw fX = arg max\n i ,Yi cGEN(Xi) w fXi,c, which encourages the correct classification\nrule in (1). The regularization term w 2 is included for capacity control, which has\n 2 H\nbecome the standard practice in machine learning nowadays.\n\nTwo of the most important methods used in practice, multi-category support vector ma-\nchines [7] and penalized multi-category logistic regression (conditional maximum entropy\nwith Gaussian smoothing [1]), can be regarded as special cases of (3). The purpose of this\npaper is to study their generalization behaviors. In particular, we are interested in general-\nization bounds that are independent of the size of GEN(Xi).\n\n\f\n2 Multi-category Support Vector Machines\n\nWe consider the multi-category support vector machine method proposed in [7]. It is a\nspecial case of (3) with ^\n wS computed based on the following formula:\n\n n \n 1 \n ^\n wS = arg min h(w (f - f w 2 ,\n X (4)\n i ,Yi Xi,c)) + H \n wH n 2\n i=1 cGEN(Xi)\\Yi\n\nwhere h(z) = max(1 - z, 0) is the hinge loss used in the standard SVM formulation.\nFrom the asymptotic statistical point of view, this formulation has some drawbacks in that\nthere are cases such that the method does not lead to a classifier that achieves the Bayes\nerror [9] (inconsistency). A Bayes consistent remedy has been proposed in [4]. However,\nmethod based on (4) has some attractive properties, and has been successfully used for\nsome practical problems.\n\nWe are interested in the generalization performance of (4). As we shall see, this formulation\nperforms very well in the linearly separable (or near separable) case. Our analysis also\nreveals a problem of this method for non-separable problems. Specifically, the formulation\nover-penalizes classification error. Possible remedies will be suggested at the end of the\nsection.\n\nWe start with the following theorem, which specifies a generalization bound in a form often\nreferred to as the oracle inequality. That is, it bounds the generalization performance of\nthe SVM method (4) in terms of the best possible true multi-category SVM loss. Proof is\nleft to Appendix B.\n\nTheorem 2.1 Let M = sup sup f\n X Y,Y GEN(X) X,Y - fX,Y H . The expected general-\nization error of (4) can be bounded as:\n\n ES D( ^\n wS) ESE(X,Y ) sup h( ^\n wS (fX,Y - fX,c))\n cGEN(X)\\Y\n\n \n max(n, M 2) + M 2 n w 2\n inf E h(w (f H\n X,Y - fX,c)) + ,\n n (X,Y ) \n wH 2(n + 1)\n cGEN(X)\\Y\n\n\nwhere ES is the expectation with respect to the training data.\n\nNote that the generalization bound does not depend on the size of GEN(X), which is\nwhat we want to achieve. The left-hand side of the theorem bounds the classification error\nof the multi-category SVM classifier in terms of sup h( ^\n w\n cGEN(X)\\Y S (fX,Y - fX,c)),\nwhile the right hand side in terms of h(w (f\n cGEN(X)\\Y X,Y - fX,c)). There is a\nmismatch here. The latter is a very loose bound since it over-counts classification errors\nin the summation when multiple errors are made at the same point. In fact, although the\nclass-size dependency does not come into our generalization analysis, it may well come\ninto the summation term h(w (f\n cGEN(X)\\Y X,Y - fX,c)) when multiple errors are\nmade at the same point. We believe that this is a serious flaw of the method, which we\nwill try to remedy later. However, the bound can be quite tight in the near separable case,\nwhen h( ^\n w\n cGEN(X)\\Y S (fX,Y - fX,c)) is small. The following Corollary gives such\na result:\n\nCorollary 2.1 Assume that there is a large margin separator w H such that for each\ndata point (X, Y ), the following margin condition holds:\n\n c GEN(X)\\Y : w fX,Y w fX,c + 1.\n\n\f\nThen in the limit of 0, the expected generalization error of (4) can be bounded as:\n\n w 2\n \n E H 2\n S D ( ^\n wS) sup sup fX,Y - fX,Y ,\n n + 1 H\n X Y,Y GEN(X)\n\nwhere ES is the expectation with respect to the training data.\n\nProof. Just choose w on the right hand side of Theorem 2.1. 2\n\nThe above result for (4) gives a class-size independent bound for large margin separable\nproblems. The bound generalizes a similar result for two-class hard-margin SVM. It also\nmatches a bound for multi-class perceptron in [2]. To our knowledge, this is the first result\nshowing that the generalization performance of a batch large margin algorithm such as (4)\ncan be class-size independent (at least in the separable case). Previous results in [2, 6],\nrelying on the covering number analysis, lead to bounds that depend on the size of Y\n(although the result in [6] is of a different style).\n\nOur analysis also implies that the multi-category classification method (4) has good gen-\neralization behavior for separable problems. However, as pointed out earlier, for non-\nseparable problems, the formulation over-penalize classification error since in the sum-\nmation, it may count classification error at a point multiple times when multiple mistakes\nare made at the point. A remedy is to replace the summation symbol cGEN(Xi)\\Yi\nin (4) by the sup operator supcGEN(X , as we have used for bounding the classifi-\n i )\\Yi\ncation error on the left hand side of Theorem 2.1. This is done in [3]. However, like\n(4), the resulting formulation is also inconsistent. Instead of using a hard-sup operator,\nwe may also use a soft-sup operator, which can possibly lead to consistency. For exam-\nple, consider the equality sup |h |h\n c c| = limp( c c|p)1/p, we may approximate the\nright hand side limit with a large p. Another more interesting formulation is to consider\nsup h exp(ph\n c c = limp p-1 ln( c c)), which leads to a generalization of the condi-\ntional maximum entropy method.\n\n\n3 Large Margin Discriminative Maximum Entropy Method\n\nBased on the motivation given at the end of the last section, we propose the following gen-\neralization of maximum entropy (multi-category logistic regression) with Gaussian prior\n(see [1]). It introduces a margin parameter into the standard maximum entropy formula-\ntion, and can be regarded as a special case of (3):\n\n \n n\n 1 1 \n ^\n w -fX\n S = arg min ln 1 + ep(-w(fXi,Yi i ,c )) + w 2 ,\n H (5)\n wH n p 2\n i=1 cGEN(Xi)\\Yi\n\nwhere is a margin condition, and p > 0 is a scaling factor (which in theory can also be\nremoved by a redefinition of w and ).\n\nIf we choose = 0, then this formulation is equivalent to the standard maximum entropy\nmethod. If we pick the margin parameter = 1, and let p , then\n\n \n1 ln 1 + ep(-w(fX -f\n i ,Yi Xi,c)) sup h(w(fX -fX\np i ,Yi i ,c )),\n cGEN(X\n cGEN(X i )\\Yi\n i )\\Yi\n\n\nwhere h(z) = max(0, 1 - z) is used in (4). In this case, the formulation reduces to (4)\nbut with replaced by sup . As discussed at the end of last\n cGEN(Xi)\\Yi cGEN(Xi)\\Yi\nsection, this solves the problem of over-counting the classification error.\n\nIn general, even with a finite scaling factor p, the log-transform in (4) guarantees that\none penalizes misclassification error at most 1 ln |GEN(X\n p i)| times at a point, where\n\n\f\n|GEN(Xi)| is the size of GEN(Xi), while in (4), one may potentially over-penalize\n|GEN(Xi)| times. Clearly this is a desirable effect for non-separable problems. Meth-\nods in (5) have many attractive properties. In particular, we are able to derive class-size\nindependent generalization bounds for this method. The proof of the following theorem is\ngiven in Appendix C.\n\nTheorem 3.1 Let M = sup sup f\n X Y,Y GEN(X) X,Y - fX,Y H . Define loss L(w, x, y)\nas:\n \n 1\n L(w, x, y) = ln 1 + ep(-w(fx,y-fx,c)) ,\n p \n cGEN(x)\\y\n\nand let\n n\n Q = inf E(X,Y )L(w, X, Y ) + w 2 .\n H\n wH 2(n + 1)\n\nThe expected generalization error of (5) can be bounded as:\n\n M 2\n ESE(X,Y )L( ^\n wS, X, Y ) Q + (1 - e-pQ ).\n n\n\nwhere ES is the expectation with respect to the training data.\n\nTheorem 3.1 gives a class-size independent generalization bound for (5). Note that the left\nhand side is the true loss of the ^\n wS from (5), and the right hand size is specified in terms\nof the best possible regularized true loss Q, plus a penalty term that is no larger than\nM 2/(n). It is clear that this generalization bound is class-size independent. Moreover,\nunlike Theorem 2.1, the loss function on the left hand side matches the loss function on\nthe right hand side in Theorem 3.1. These are not trivial properties. In fact, most learning\nmethods do not have these desirable properties. We believe this is a great advantage for the\nmaximum entropy-type discriminative learning method in (5). It implies that this class of\nalgorithms are suitable for problems with large number of classes. Moreover, we can see\nthat the generalization performance is well-behaved no matter what values of p and we\nchoose.\n\nIf we take = 0 and p = 1, then we obtain a generalization bound for the popular maxi-\nmum entropy method with Gaussian prior, which has been widely used in natural language\nprocessing applications. To our knowledge, this is the first generalization bound derived for\nthis method. Our result not only shows the importance of Gaussian prior regularization, but\nalso implies that the regularized conditional maximum entropy method has very desirable\ngeneralization behavior.\n\nAnother interesting special case of (5) is to let = 1 and p . For simplicity we only\nconsider the case that |GEN(X)| is finite (but can be arbitrarily large). In this case, we\nnote that 0 L(w, X, Y ) - sup h(w (f\n cGEN(X)\\Y X,Y - fX,c)) ln |GEN(X)| . We\n p\nthus obtain from Theorem 3.1 a bound\n\n EX ln |GEN(X)| M 2\n ESE(X,Y ) sup h( ^\n wS (fX,Y - fX,c)) +\n cGEN(X)\\Y p n\n\n w 2\n + inf E H\n (X,Y ) sup h(w (fX,Y - fX,c)) + .\n wH cGEN(X)\\Y 2\n\nNow we can take a sufficiently large p such that the term EX ln |GEN(X)|/p becomes\nnegligible. Let p , the result implies a bound for the SVM method in [3]. For\nnon-separable problems, this bound is clearly superior to the SVM bound in Theorem 2.1\nsince the right hand side replaces the summation by the sup operator\n cGEN(X)\\Y\n\n\f\nsupcGEN(X)\\Y . In theory, this satisfactorily solves the problem of over-penalizing mis-\nclassification error. Moreover, an advantage over [3] is that for some p, consistency can be\nachieved. Our analysis also establishes a bridge between the Gaussian smoothed maximum\nentropy method [1] and the SVM method in [3].\n\n\n4 Conclusion\n\nWe studied the generalization performance of some regularized multi-category classifica-\ntion methods. In particular, we derived a class-size independent generalization bound for\na standard formulation of multi-category support vector machines. Based on the theoreti-\ncal investigation, we showed that this method works well for linearly separable problems.\nHowever, it over-penalizes mis-classification error, leading to loose generalization bounds\nin the non-separable case. A remedy, based on a generalization of the maximum entropy\nmethod, is proposed. Moreover, we are able to derive class-size independent bounds for the\nnewly proposed formulation, which implies that this class of methods (including the stan-\ndard maximum entropy) are suitable for classification problems with very large number of\nclasses. We showed that in theory, the new formulation provides a satisfactory solution to\nthe problem of over-penalizing mis-classification error.\n\n\nA A general stability bound\n\nThe following lemma is essentially a variant of similar stability results for regularized learn-\ning systems used in [8, 10]. We include the proof Sketch for completeness.\n\nLemma A.1 Consider a sequence of convex functions Li(w) for i = 1, 2, . . . Define for\nk = 1, 2, . . .\n k n\n wk = arg min Li(w) + w 2 .\n H\n w 2\n i=1\n\nThen for all k 1, there exists subgradient (cf. [5]) Lk+1(wk+1) of Li at wk+1 such\nthat\n k+1\n 1 1\n wk+1 = - Li(wk+1), wk - wk+1 H Lk+1(wk+1) H .\n n n\n i=1\n\n\n\nProof Sketch. The first equality is the first-order condition for the optimization problem [5]\nwhere wk+1 is the solution. Now, subtracting this equality at wk and wk+1, we have:\n\n k\n\n -n(wk+1 - wk) = Lk+1(wk+1) + ( Li(wk+1) - Li(wk)).\n i=1\n\nMultiply the two sides by wk+1 - wk, we obtain\n\n k\n-n w 2\n k+1-wk H = Lk+1(wk+1)(wk+1-wk)+ ( Li(wk+1)- Li(wk))(wk+1-wk).\n i=1\n\nNote that Li(wk+1) - Li(wk)) (wk+1 - wk) = dL (w (w\n i k , wk+1) + dLi k+1, wk ),\nwhere dL(w, w ) = L(w ) - L(w) - L(w) (w - w) is often called the Bregman\ndivergence of L, which is well-known to be non-negative for any convex function L (this\nclaim is also easy to verify by definition). We thus have ( Li(wk+1)- Li(wk))(wk+1-\nwk) 0. It follows that\n-n w 2\n k+1 - wk H Lk+1(wk+1) (wk+1 - wk) - Lk+1(wk+1) H wk+1 - wk H .\n\nBy canceling the factor wk+1 - wk H , we obtain the second inequality. 2\n\n\f\nB Proof Sketch of Theorem 2.1\n\nConsider training samples (Xi, Yi) for i = 1, . . . , n + 1. Let ~\n wk be the solution of (4) with\nthe training sample (Xk, Yk) removed from the set (that is, the summation is n+1 ),\n i=1,i=k\n\nand let ~\n w be the solution of (4) but with the summation n replaced by n+1. Now for\n i=1 i=1\nnotation simplicity, we let zk,c = ~\n w (fX - f\n k ,Yk Xk,c) for c GEN(X ). It follows from\nLemma A.1 that\n\n n+1\n 1 M\n ~\n w 2 = - h (z h (z\n H k,c)zk,c, ~\n wk - ~\n w H - k,c),\n n n\n k=1 cGEN(X) cGEN(X)\\Y\n\n\nwhere h () denotes a subgradient of h(). Therefore using the inequality -h (z) h(z) -\nh (z)z, we have\n\n sup [h( ~\n wk (fX - f\n k ,Yk Xk,c)) - h(zk,c)] ~\n wk - ~\n w H M\n cGEN(Xk)\\Yk\n\n M 2 M 2\n - h (zk,c) [h(zk,c) - h (zk,c)zk,c].\n n n\n cGEN(Xk)\\Yk cGEN(Xk)\\Yk\n\nSumming over k = 1, . . . , n + 1, we obtain\n\n n+1\n\n sup [h( ~\n wk (fX - f\n k ,Yk Xk,c)) - h(zk,c)]\n cGEN(X\n k=1 k )\\Yk\n\n n+1\n M 2\n [h(zk,c) - h (zk,c)zk,c]\n n cGEN(Xk)\\Yk k=1\n n+1\n M 2\n = h(zk,c) + ~\n w 2 M 2.\n n H\n cGEN(Xk)\\Yk k=1\n\nTherefore given an arbitrary w H, we have\n\n n+1\n\n sup h( ~\n wk (fX - f\n k ,Yk Xk,c))\n cGEN(X\n k=1 k )\\Yk\n\n n+1\n M 2\n (1 + ) h(zk,c) + ~\n w 2 M 2\n n H\n cGEN(Xk)\\Yk k=1\n\n n+1 \n M 2 2M 2 n\n max(1 + , ) h(zk,c) + ~\n w 2\n n n 2 H \n cGEN(Xk)\\Yk k=1\n\n n+1 \n M 2 2M 2 n\n max(1 + , ) h(w (fX - fX w 2 .\n n n k ,Yk k ,c )) + 2 H \n cGEN(Xk)\\Yk k=1\n\nNow, taking expectation with respect to the training data, we obtain the bound.\n\n\nC Proof Sketch of Theorem 3.1\n\nSimilar to the proof of Theorem 2.1, we consider training samples (Xi, Yi) for i =\n1, . . . , n + 1. Let ~\n wk be the solution of (5) with the training sample (Xk, Yk) removed\n\n\f\nfrom the set (that is, the summation is n+1 ), and let ~\n w be the solution of (5) but with\n i=1,i=k\nthe summation n replaced by n+1. It follows from Lemma A.1 that\n i=1 i=1\n 1 M\n ~\n wk - ~\n w H L( ~\n w, Xk, Yk) H (1 - e-pL( ~\n w,Xk,Yk)).\n n n\nTherefore\n M 2\n L( ~\n wk, Xk, Yk) - L( ~\n w, Xk, Yk) (1 - e-pL( ~\n w,Xk,Yk)).\n n\nNow summing over k, we obtain\n\n n+1 n+1 n+1\n 1 1 M 2 1\n L( ~\n wk, Xk, Yk) L( ~\n w, Xk, Yk) + 1 - e-pL( ~\n w,Xk,Yk) .\nn + 1 n + 1 n n + 1\n k=1 k=1 k=1\n\n Taking expectation with respect to the training data, and using the following Jensen's\ninequality:\n\n n+1\n 1 1 n+1\n -E L( ~\n w,Xk,Yk)\n S e-pL( ~\n w,Xk,Yk) -e-pES n+1 k=1 ,\n n + 1 k=1\nwe obtain\n n+1 L( ~\n w, Xk, Yk) M 2 n+1 L( ~\n w,Xk,Yk)\nESE k=1 n+1\n (X + 1 - e-pES .\n k ,Yk ) L( ~\n wk, Xk, Yk) ES n + 1 n\n k=1\n\nNow, using the fact E n+1\n S L( ~\n w, X\n k=1 k , Yk ) (n + 1)Q (which follows from the optimal\nproperty of ~\n w), we obtain the theorem.\n\n\nReferences\n\n [1] Stanley Chen and Ronald Rosenfeld. A survey of smoothing techniques for ME mod-\n els. IEEE Trans. Speech and Audio Processing, 8:3750, 2000.\n\n [2] Michael Collins. Parameter estimation for statistical parsing models: The-\n ory and practice of distribution-free methods. In IWPT, 2001. available at\n http://www.ai.mit.edu/people/mcollins/publications.html.\n\n [3] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass\n kernel-based vector machines. Journal of Machine Learning Research, 2:265292,\n 2001.\n\n [4] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines, theory, and\n application to the classification of microarray data and satellite radiance data. Journal\n of American Statistical Association, 99:6781, 2004.\n\n [5] R. Tyrrell Rockafellar. Convex analysis. Princeton University Press, Princeton, NJ,\n 1970.\n\n [6] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks.\n In Sebastian Thrun, Lawrence Saul, and Bernhard Scholkopf, editors, Advances in\n Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.\n\n [7] J. Weston and C. Watkins. Multi-class support vector machines. Technical Report\n CSD-TR-98-04, Royal Holloway, 1998.\n\n [8] Tong Zhang. Leave-one-out bounds for kernel methods. Neural Computation,\n 15:13971437, 2003.\n\n [9] Tong Zhang. Statistical analysis of some multi-category large margin classification\n methods. Journal of Machine Learning Research, 5:12251251, 2004.\n\n[10] Tong Zhang. Statistical behavior and consistency of classification methods based on\n convex risk minimization. The Annals of Statitics, 32:5685, 2004. with discussion.\n\n\f\n", "award": [], "sourceid": 2657, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}]}