{"title": "Classification by Pairwise Coupling", "book": "Advances in Neural Information Processing Systems", "page_first": 507, "page_last": 513, "abstract": "", "full_text": "Classification by Pairwise Coupling \n\nTREVOR HASTIE * \nStanford University \n\nand \n\nROBERT TIBSHIRANI t \nUniversity of Toronto \n\nAbstract \n\nWe discuss a strategy for polychotomous classification that involves \nestimating class probabilities for each pair of classes, and then cou(cid:173)\npling the estimates together. The coupling model is similar to the \nBradley-Terry method for paired comparisons. We study the na(cid:173)\nture of the class probability estimates that arise, and examine the \nperformance of the procedure in simulated datasets. The classifiers \nused include linear discriminants and nearest neighbors: applica(cid:173)\ntion to support vector machines is also briefly described. \n\n1 \n\nIntroduction \n\nWe consider the discrimination problem with J{ classes and N training observations. \nThe training observations consist of predictor measurements x = (Xl, X2, ... Xp) on \np predictors and the known class memberships. Our goal is to predict the class \nmembership of an observation with predictor vector Xo \nTypically J{ -class classification rules tend to be easier to learn for J{ = 2 than for \nf{ > 2 - only one decision boundary requires attention. Friedman (1996) suggested \nthe following approach for the the K-class problem: solve each of the two-class \nproblems, and then for a test observation, combine all the pairwise decisions to \nform a J{ -class decision. Friedman's combination rule is quite intuitive: assign to \nthe class that wins the most pairwise comparisons. \n\nDepartment of Statistics, \n\nStanford University, \n\nStanford California 94305; \n\ntrevor@playfair.stanford.edu \n\nbepartment of Preventive Medicine and Biostatistics, and Department of Statistics; \n\ntibs@utstat.toronto.edu \n\n\f508 \n\nT. Hastie and R. Tibshirani \n\nFriedman points out that this rule is equivalent to the Bayes rule when the class \nposterior probabilities Pi (at the test point) are known : \n\nargm~[Pd = argma~[LI(pd(Pi + Pj) > Pj/(Pi + Pj\u00bb] \n\nJti \n\nNote that Friedman's rule requires only an estimate of each pairwise decision. Many \n(pairwise) classifiers provide not only a rule, but estimated class probabilities as well. \nIn this paper we argue that one can improve on Friedman's procedure by combining \nthe pairwise class probability estimates into a joint probability estimate for all J{ \nclasses. \n\nThis leads us to consider the following problem. Given a set of events AI, A 2 , ... A.K, \nsome experts give us pairwise probabilities rij = Prob(AilA or Aj) . Is there a set \nof probabilities Pi = Prob(Ai) that are compatible with the 1'ij? \nIn an exact sense, the answer is no. Since Prob( A d Ai or Aj) = Pj /(Pi + pj) and \n2: Pi = 1, we are requiring that J{ -1 free parameters satisfy J{ (/{ -1) /2 constraints \nand , this will not have a solution in general. For example, if the 1'ij are the ijth \nentries in the matrix \n\n(\n\n. \n0.9 0.4) \n0.1 \n. \n0.6 0.3 \n\n0.7 \n. \n\n(1) \n\nthen they are not compatible with any pi's. This is clear since r12 > .5 and 1'23 > .5, \nbut also r31 > .5. \nThe model Prob(A i IAi or Aj) = Pj /(Pi + pj) forms the basis for the Bradley(cid:173)\nTerry model for paired comparisons (Bradley & Terry 1952) . In this paper we \nfit this model by minimizing a Kullback-Leibler distance criterion to find the best \napproximation foij = pd('Pi + pj) to a given set of 1'il's. We carry this out at each \npredictor value x, and use the estimated probabilities to predict class membership \nat x. \nIn the example above, the solution is p = (0.47, 0.25, 0.28). This solution makes \nqualitative sense since event Al \"beats\" A 2 by a larger margin than the winner of \nany of the other pairwise matches. \n\nFigure 1 shows an example of these procedures in action. There are 600 data points \nin three classes, each class generated from a mixture of Gaussians. A linear dis(cid:173)\ncriminant model was fit to each pair of classes, giving pairwise probability estimates \n1'ij at each x. The first panel shows Friedman's procedure applied to the pairwise \nrules. The shaded regions are areas of indecision, where each class wins one vote. \nThe coupling procedure described in the next section was then applied , giving class \nprobability estimates f>(x) at each x. The decision boundaries resulting from these \nprobabilities are shown in the second panel. The procedure has done a reasonable \njob of resolving the confusion, in this case producing decision boundaries similar \nto the three-class LDA boundaries shown in panel 3. The numbers in parentheses \nabove the plots are test-error rates based on a large test sample from the same pop(cid:173)\nulation. Notice that despite the indeterminacy, the max-wins procedure performs \nno worse than the coupling procedure. and both perform better than LDA. Later \nwe show an example where the coupling procedure does substantially better than \nmax-wms. \n\n\fClassification by Pairwise Coupling \n\n509 \n\nPairwise LDA + Max (0.132) \n\nPairwise LOA + Coupling (0.136) \n\n3\u00b7Class LOA (0.213) \n\nFigure 1: A three class problem, with the data in each class generated from a mixture \nof Gaussians. The first panel shows the maximum-win procedure. The second panel \nshows the decision boundary from coupling of the pairwise linear discriminant rules \nbased on d in (6). The third panel shows the three-class LDA boundaries. Test-error \nrates are shown in parentheses. \n\nThis paper is organized as follows. The coupling model and algorithm are given \nin section 2. Pairwise threshold optimization, a key advantage of the pairwise \napproach, is discussed in section 3. In that section we also examine the performance \nof the various methods on some simulated problems, using both linear discriminant \nand nearest neighbour rules. The final section contains some discussion . \n\n2 Coupling the probabilities \n\nLet the probabilities at feature vector x be p(x) = (PI (x) , ... PK (x)). In this section \nwe drop the argument x , since the calculations are done at each x separately. \n\\Ve assume that for each i -# j, there are nij observations in the training set and \nfrom these we have estimated conditional probabilities Tij = Prob( iii or j). \n\nOur model is \n\nor equivalently \n\nJ..Lij \n\nBinomial( nij , J-Lij ) \n\nPi \n\nPi + Pj \n\nlog J-Lij = log (Pi ) -\n\nlog (Pi + Pj), \n\n(2) \n\n(3) \n\na log-nonlinear model. \nWe wish to find Pi'S so that the Uij'S are close to the Tij'S. There are K - 1 \nindependent parameters but K(I{ - 1)/2 equations, so it is not possible in general \nto find .Pi's so that {iij = Tij for all i, j. \nTherefore we must settle for {iij'S that are close to the observed Tij'S. Our closeness \ncriterion is the average (weighted) Kullback-Leibler distance between Tij and J-Lij : \n\n(4) \n\n\f510 \n\nT. Hastie and R. libshirani \n\nand we find p to minimize this function . \n\nThis model and criterion is formally equivalent to the Bradley-Terry model for \npreference data. One observes a proportion fij of nij preferences for item i, and \nthe sampling model is binomial, as in (2) . If each of the fij were independent, then \nR(p) would be equivalent to the log-likelihood under this model. However our fij \nare not independent as they share a common training set and were obtained from \na common set of classifiers. Furthermore the binomial models do not apply in this \ncase; the fij are evaluations of functions at a point, and the randomness arises in \nthe way these functions are constructed from the training data. We include the nij \nas weights in (4); this is a crude way of accounting for the different precisions in \nthe pairwise probability estimates. \n\nThe score (gradient) equations are: \n\nLnijj1ij = Lnijfij; i= 1,2 .... K \njti \n\nj#i \n\n(5) \n\nsubject to L Pi = 1. We use the following iterative procedure to compute the iN's: \n\nAlgorithm \n\n1. Start with some guess for the Pi, and corresponding Pij. \n\n2. Repeat (i = 1,2, . .. , K, 1, ... ) until convergence: \n\nrenormalize the Pi, and recompute the Pij. \n\nThe algorithm also appears in Bradley & Terry (1952). The updates in step 2 at(cid:173)\ntempt to modify p so that the sufficient statistics match their expectation, but go \nonly part of the way. We prove in Hastie & Tibshirani (1996) that R(p) increases \nat each step. Since R(p) is bounded above by zero, the procedure converges. At \nconvergence, the score equations are satisfied, and the PijS and p are consistent. \nThis algorithm is similar in flavour to the Iterative Proportional Scaling (IPS) pro(cid:173)\ncedure used in log-linear models. IPS has a long history, dating back to Deming & \nStephan (1940). Bishop, Fienberg & Holland (1975) give a modern treatment and \nmany references. \n\nThe resulting classification rule is \n\n(6) \n\nFigure 2 shows another example similar to Figure 1, where we can compare the \nperformance of the rules d and d. The hatched area in the top left panel is an \nindeterminate region where there is more than one class achieving max(pd. In the \ntop right panel the coupling procedure has resolved this indeterminacy in favor of \nclass 1 by weighting the various probabilities. See the figure caption for a description \nof the bottom panels. \n\n\fClassification by Pairwise Coupling \n\n511 \n\nPallWlse LOA + Max (0.449) \n\nPairwise LOA + Coupling (0 358) \n\nLOA (0.457) \n\naDA (0.334) \n\nFigure 2: A three class problem similar to that in figure 1, with the data in each \nclass generated from a mixture of Gaussians. The first panel shows the maximum(cid:173)\nwins procedure d). The second panel shows the decision boundary from coupling of \nthe pairwise linear discriminant rules based on d in (6). The third panel shows the \nthree-class LDA boundaries, and the fourth the QDA boundaries. The numbers in \nthe captions are the error rates based on a large test set from the same population. \n\n3 Pairwise threshold optimization \n\nAs pointed out by Friedman (1996), approaching the classification problem in a \npairwise fashion allows one to optimize the classifier in a way that would be com(cid:173)\nputationally burdensome for a J< -class classifier . Here we discuss optimization of \nthe classification threshold. \nFor each two class problem, let logit Pij(X) = dij(x). Normally we would classify to \nclass i if dij (x) > O. Suppose we find that dij (x) > tij is better. Then we define \ndij (x) = dij (x) - tij, and hence pij (x) = logiC 1 di/x). We do this for all pairs, and \nthen apply the coupling algorithm to the P~j (x) to obtain probabilities pi( x) . In this \nway we can optimize over J\u00abJ< - 1)/2 parameters separately, rather than optimize \njointly over J< parameters. With nearest neigbours, there are other approaches to \nthreshold optimization, that bias the class probability estimates in different ways. \nSee Hastie & Tibshirani (1996) for details. An example of the benefit of threshofd \noptimization is given next. \n\nExample: ten Gaussian classes with unequal covariance \n\nIn this simulated example taken from Friedman (1996), there are 10 Gaussian classes \nin 20 dimensions. The mean vectors of each class were chosen as 20 independent \nuniform [0,1] random variables. The covariance matrices are constructed from \neigenvectors whose square roots are uniformly distributed on the 20-dimensional \nunit sphere (subject to being mutually orthogonal) , and eigenvalues uniform on \n[0.01,1.01]. There are 100 observations per class in the training set, and 200 per \n\n\f512 \n\nT. Hastie and R .. 1ibshirani \n\nclass in the test set. The optimal decision boundaries in this problem are quadratic, \nand neither linear nor nearest-neighbor methods are well-suited. Friedman states \nthat the Bayes error rate is less than 1%. \n\nFigure 3 shows the test error rates for linear discriminant analysis, J -nearest neigh(cid:173)\nbor and their paired versions using threshold optimization. We see that the coupled \nclassifiers nearly halve the error rates in each case. In addition, the coupled rule \nworks a little better than Friedman's max rule in each task. Friedman (1996) re(cid:173)\nports a median test error rate of about. 16% for his thresholded version of pairwise \nnearest neighbor. \n\nWhy does the pairwise t.hresholding work in this example? We looked more closely \nat the pairwise nearest neighbour rules rules that were constructed for this problem. \nThe thresholding biased the pairwise distances by about 7% on average. The average \nnumber of nearest neighbours used per class was 4.47 (.122), while t.he standard J(cid:173)\nnearest neighbour approach used 6.70 (.590) neighbours for all ten classes. For all \nten classes, the 4.47 translates into 44.7 neighbours. Hence relative to t.he standard \nJ -NN rule, the pairwise rule, in using the threshold optimization to reduce bias, is \nable to use about six times as many near neighbours. \n\nIt) \nC\\I \nci \n\no \nC\\I \nci \n\no \nci \n\nT \n\n, , \n\nII \n\n1 \n\n: \n\nc....l...\" \n\n! \n\nI \n\n1.' \n\nc....l...\" \n\nI T \n1 I , \n\nJ-nn \n\nnn/max \n\nnn/coup \n\nIda \n\nIda/max \n\nIda/coup \n\nFigure 3: Test errors for 20 simulations of ten-class Gaussian example. \n\n4 Discussion \n\nDue to lack of space, there are a number of issues that we did not discuss here. In \nHastie & Tibshirani (1996), we show the relationship between the pairwise coupling \nand the max-wins rule: specifically, if the classifiers return 0 or Is rather than \nprobabilities, the two rules give the same classification. We also apply the pairwise \ncoupling procedure to nearest neighbour and support vector machines. In the latter \ncase, this provides a natural way of extending support vector machines, which are \ndefined for two-class problems, to multi-class problems. \n\n\fClassification by Pairwise Coupling \n\n513 \n\nThe pairwise procedures, both Friedman 's max-win and our coupling, are most \nlikely to offer improvements when additional optimization or efficiency gains are \npossible in the simpler 2-class scenarios. In some situations they perform exactly \nlike the multiple class classifiers. Two examples are: a) each of the pairwise rules \nare based on QDA: i.e. each class modelled by a Gaussian distribution with sep(cid:173)\narate covariances, and then the rijS derived from Bayes rule; b) a generalization \nof the above, where the density in each class is modelled in some fashion, perhaps \nnonparametrically via density estimates or near-neighbor methods, and then the \ndensity estimates are used in Bayes rule. \n\nPairwise LDA followed by coupling seems to offer a nice compromise between LDA \nand QDA, although the decision boundaries are no longer linear. For this special \ncase one might derive a different coupling procedure globally on the logit scale , \nwhich would guarantee linear decision boundaries. Work of this nature is currently \nin progress with Jerry Friedman. \n\nAcknowledgments \n\nWe thank Jerry Friedman for sharing a preprint of his pairwise classification paper \nwith us, and acknowledge helpful discussions with Jerry, Geoff Hinton, Radford Neal \nand David Tritchler. Trevor Hastie was partially supported by grant DMS-9504495 \nfrom the National Science Foundation, and grant ROI-CA-72028-01 from the Na(cid:173)\ntional Institutes of Health. Rob Tibshirani was supported by the Natural Sciences \nand Engineering Research Council of Canada and the iRIS Centr of Excellence. \n\nReferences \n\nBishop, Y., Fienberg, S. & Holland, P. (1975), Discrete multivariate analysis, MIT \n\nPress, Cambridge. \n\nBradley, R. & Terry, M. (1952), 'The rank analysis of incomplete block designs . i. \n\nthe method of paired comparisons', Biometrics pp . 324-345 . \n\nDeming, W. & Stephan, F . (1940), 'On a least squares adjustment of a sampled \nfrequency table when the expected marginal totals are known', A nn. Math . \nStatist. pp. 427-444. \n\nFriedman, J . (1996), Another approach to polychotomous classification, Technical \n\nreport, Stanford University. \n\nHastie, T. & Tibshirani, R. (1996), Classification by pairwise coupling, Technical \n\nreport, University of Toronto. \n\n\f", "award": [], "sourceid": 1375, "authors": [{"given_name": "Trevor", "family_name": "Hastie", "institution": null}, {"given_name": "Robert", "family_name": "Tibshirani", "institution": null}]}