{"title": "Probability Estimates for Multi-Class Classification by Pairwise Coupling", "book": "Advances in Neural Information Processing Systems", "page_first": 529, "page_last": 536, "abstract": "", "full_text": "Probability Estimates for Multi-class\nClassi\ufb01cation by Pairwise Coupling\n\nTing-Fan Wu\nChih-Jen Lin\nDepartment of Computer Science\n\nNational Taiwan University\n\nTaipei 106, Taiwan\n\nRuby C. Weng\n\nDepartment of Statistics\n\nNational Chenechi University\n\nTaipei 116, Taiwan\n\nAbstract\n\nPairwise coupling is a popular multi-class classi\ufb01cation method that\ncombines together all pairwise comparisons for each pair of classes. This\npaper presents two approaches for obtaining class probabilities. Both\nmethods can be reduced to linear systems and are easy to implement. We\nshow conceptually and experimentally that the proposed approaches are\nmore stable than two existing popular methods: voting and [3].\n\n1\n\nIntroduction\n\nThe multi-class classi\ufb01cation problem refers to assigning each of the observations into one\nof k classes. As two-class problems are much easier to solve, many authors propose to use\ntwo-class classi\ufb01ers for multi-class classi\ufb01cation. In this paper we focus on techniques that\nprovide a multi-class classi\ufb01cation solution by combining all pairwise comparisons.\n\nA common way to combine pairwise comparisons is by voting [6, 2]. It constructs a rule\nfor discriminating between every pair of classes and then selecting the class with the most\nwinning two-class decisions. Though the voting procedure requires just pairwise decisions,\nit only predicts a class label. In many scenarios, however, probability estimates are desired.\nAs numerous (pairwise) classi\ufb01ers do provide class probabilities, several authors [12, 11, 3]\nhave proposed probability estimates by combining the pairwise class probabilities.\n\nGiven the observation x and the class label y, we assume that the estimated pairwise class\nprobabilities rij of (cid:22)ij = p(y = i j y = i or j; x) are available. Here rij are obtained\nby some binary classi\ufb01ers. Then, the goal is to estimate fpigk\ni=1, where pi = p(y =\ni j x); i = 1; : : : ; k. We propose to obtain an approximate solution to an identity, and\nthen select the label with the highest estimated class probability. The existence of the\nsolution is guaranteed by theory in \ufb01nite Markov Chains. Motivated by the optimization\nformulation of this method, we propose a second approach. Interestingly, it can also be\nregarded as an improved version of the coupling approach given by [12]. Both of the\nproposed methods can be reduced to solving linear systems and are simple in practical\nimplementation. Furthermore, from conceptual and experimental points of view, we show\nthat the two proposed methods are more stable than voting and the method in [3].\n\nWe organize the paper as follows. In Section 2, we review two existing methods. Sections\n3 and 4 detail the two proposed approaches. Section 5 presents the relationship among the\nfour methods through their corresponding optimization formulas. In Section 6, we compare\n\n\fthese methods using simulated and real data. The classi\ufb01ers considered are support vector\nmachines. Section 7 concludes the paper. Due to space limit, we omit all detailed proofs.\nA complete version of this work is available at http://www.csie.ntu.edu.tw/\n\u02dccjlin/papers/svmprob/svmprob.pdf.\n\n2 Review of Two Methods\n\nLet rij be the estimates of (cid:22)ij = pi=(pi + pj). The voting rule [6, 2] is\n\n(cid:14)V = argmaxi[Xj:j6=i\n\nIfrij >rjig]:\n\n(1)\n\nA simple estimate of probabilities can be derived as pv\nThe authors of [3] suggest another method to estimate class probabilities, and they claim\nthat the resulting classi\ufb01cation rule can outperform (cid:14)V in some situations. Their approach\nis based on the minimization of the Kullback-Leibler (KL) distance between rij and (cid:22)ij:\n\ni = 2Pj:j6=i Ifrij >rjig=(k(k (cid:0) 1)).\n\nnijrij log(rij=(cid:22)ij);\n\n(2)\n\nl(p) =Xi6=j\n\nwherePk\n\ni=1 pi = 1; pi > 0; i = 1; : : : ; k, and nij is the number of instances in class i or\nj. By letting rl(p) = 0, a nonlinear system has to be solved. [3] proposes an iterative\nprocedure to \ufb01nd the minimum of (2). If rij > 0, 8i 6= j, the existence of a unique global\nminimal solution to (2) has been proved in [5] and references therein. Let p\n(cid:3) denote this\npoint. Then the resulting classi\ufb01cation rule is\n\nIt is shown in Theorem 1 of [3] that\n\n(cid:14)HT (x) = argmaxi[p(cid:3)\ni ]:\n\np(cid:3)\ni > p(cid:3)\n\nj if and only if ~pi > ~pj; where ~pj =\n\n2Ps:s6=j rjs\n\nk(k (cid:0) 1)\n\n;\n\n(3)\n\nthat is, the ~pi are in the same order as the p(cid:3)\ni . Therefore, ~p are suf\ufb01cient if one only requires\nthe classi\ufb01cation rule. In fact, as pointed out by [3], ~p can be derived as an approximation\nto the identity by replacing pi + pj with 2=k, and (cid:22)ij with rij.\n\n(\n\npi = Xj:j6=i\n\npi + pj\nk (cid:0) 1\n\n)(\n\npi\n\npi + pj\n\npi + pj\nk (cid:0) 1\n\n)(cid:22)ij\n\n(\n\n) = Xj:j6=i\n\n(4)\n\n3 Our First Approach\n\nNote that (cid:14)HT is essentially argmaxi[~pi], and ~p is an approximate solution to (4). Instead\nof replacing pi + pj by 2=k, in this section we propose to solve the system:\n\n(\n\npi = Xj:j6=i\n\npi + pj\nk (cid:0) 1\n\n)rij; 8i;\n\nsubject to\n\nk\n\nXi=1\n\npi = 1; pi (cid:21) 0; 8i:\n\n(5)\n\nLet (cid:22)p denote the solution to (5). Then the resulting decision rule is\n\nAs (cid:14)HT relies on pi + pj (cid:25) k=2, in Section 6.1 we use two examples to illustrate possible\nproblems with this rule.\n\n(cid:14)1 = argmaxi[(cid:22)pi]:\n\n\fTo solve (5), we rewrite it as\n\nQp = p;\n\nXi=1\nObserve thatPk\n\nk\n\npi = 1; pi (cid:21) 0; 8i; where Qij =(rij=(k (cid:0) 1)\n\nif i 6= j;\nif i = j:\n\n(6)\n\nPs:s6=i ris=(k (cid:0) 1)\n\nj=1 Qij = 1 for i = 1; : : : ; k and 0 (cid:20) Qij (cid:20) 1 for i; j = 1; : : : ; k, so there\nexists a \ufb01nite Markov Chain whose transition matrix is Q. Moreover, if rij > 0 for all\ni 6= j, then Qij > 0, which implies this Markov Chain is irreducible and aperiodic. These\nconditions guarantee the existence of a unique stationary probability and all states being\npositive recurrent. Hence, we have the following theorem:\n\nTheorem 1 If rij > 0, i 6= j, then (6) has a unique solution p with 0 < pi < 1, 8i.\n\nWith Theorem 1 and some further analyses, if we remove the constraint pi (cid:21) 0; 8i, the\nlinear system with k + 1 equations still has the same unique solution. Furthermore, if any\none of the k equalities Qp = p is removed, we have a system with k variables and k\nequalities, which, again, has the same single solution. Thus, (6) can be solved by Gaussian\nelimination. On the other hand, as the stationary solution of a Markov Chain can be derived\nby the limit of the n-step transition probability matrix Qn, we can solve p by repeatedly\nmultiplying QT with any initial vector.\nNow we reexamine this method to gain more insight. The following arguments show that\nthe solution to (5) is a global minimum of a meaningful optimization problem. To begin,\n\nwe express (5) asPj:j6=i rjipi (cid:0)Pj:j6=i rijpj = 0; i = 1; : : : ; k; using the property that\n\nrij + rji = 1, 8i 6= j. Then the solution to (5) is in fact the global minimum of the\nfollowing problem:\n\nk\n\nk\n\nrijpj)2\n\nsubject to\n\npi = 1; pi (cid:21) 0; 8i:\n\n(7)\n\nmin\n\np\n\n(Xj:j6=i\nXi=1\n\nrjipi (cid:0) Xj:j6=i\n\nSince the object function is always nonnegative, and it attains zero under (5) and (6).\n\nXi=1\n\nk\n\n4 Our Second Approach\n\nNote that both approaches in Sections 2 and 3 involve solving optimization problems using\n\nthe relations like pi=(pi + pj) (cid:25) rij orPj:j6=i rjipi (cid:25)Pj:j6=i rijpj. Motivated by (7), we\n\nsuggest another optimization formulation as follows:\n\nk\n\nmin\n\np\n\n1\n2\n\n(rjipi (cid:0) rijpj)2\n\nsubject to\n\nXi=1 Xj:j6=i\n\npi = 1; pi (cid:21) 0; 8i:\n\n(8)\n\nXi=1\n\nIn related work, [12] proposes to solve a linear system consisting of Pk\n\ni=1 pi = 1 and\nany k (cid:0) 1 equations of the form rjipi = rijpj. However, pointed out in [11], the results\nof [12] strongly depends on the selection of k (cid:0) 1 equations. In fact, as (8) considers all\nrijpj (cid:0) rjipi, not just k (cid:0) 1 of them, it can be viewed as an improved version of [12].\n\nLet p\n\ny denote the corresponding solution. We then de\ufb01ne the classi\ufb01cation rule as\n\n(cid:14)2 = argmaxi[py\ni ]:\n\nSince (7) has a unique solution, which can be obtained by solving a simple linear system, it\nis desired to see whether the minimization problem (8) has these nice properties. In the rest\nof the section, we show that this is true. The following theorem shows that the nonnegative\nconstraints in (8) are redundant.\n\n\fTheorem 2 Problem (8) is equivalent to a simpli\ufb01cation without conditions pi (cid:21) 0; 8i.\n\nNote that we can rewrite the objective function of (8) as\n\nmin\n\n=\n\np\n\n1\n2\n\nT Qp;\n\np\n\nwhere Qij =(Ps:s6=i r2\n\nrjirij\n\nsi\n\nif i = j;\nif i 6= j:\n\n(9)\n\nFrom here we can show that Q is positive semi-de\ufb01nite. Therefore, without constraints\npi (cid:21) 0; 8i, (9) is a linear-constrained convex quadratic programming problem. Conse-\nquently, a point p is a global minimum if and only if it satis\ufb01es the KKT optimality condi-\ntion: There is a scalar b such that\n\n(cid:20) Q e\n\ne\n\nT\n\n0(cid:21)(cid:20)p\n\nb(cid:21) =(cid:20)0\n1(cid:21) :\n\n(10)\n\nPk\n\nHere e is the vector of all ones and b is the Lagrangian multiplier of the equality constraint\ni=1 pi = 1. Thus, the solution of (8) can be obtained by solving the simple linear system\n(10). The existence of a unique solution is guaranteed by the invertibility of the matrix\nof (10). Moreover, if Q is positive de\ufb01nite(PD), this matrix is invertible. The following\ntheorem shows that Q is PD under quite general conditions.\n\nTheorem 3 If for any i = 1; : : : ; k, there are s 6= i and j 6= i such that rsirsj\nris\nthen Q is positive de\ufb01nite.\n\n6= rjirjs\nrij\n\n,\n\nIn addition to direct methods, next we propose a simple iterative method for solving (10):\n\nAlgorithm 1\n\n2. Repeat (t = 1; : : : ; k; 1; : : :)\n\n1. Start with some initial pi (cid:21) 0; 8i andPk\n[(cid:0) Xj:j6=t\n\n1\nQtt\n\npt  \n\nnormalize p\n\ni=1 pi = 1.\n\nQtjpj + p\n\nT Qp]\n\n(11)\n\n(12)\n\nuntil (10) is satis\ufb01ed.\n\nTheorem 4 If rsj > 0; 8s 6= j, and fp\nany convergent sub-sequence goes to a global minimum of (8).\n\nig1\n\ni=1 is the sequence generated by Algorithm 1,\n\nAs Theorem 3 indicates that in general Q is positive de\ufb01nite, the sequence fp\nAlgorithm 1 usually globally converges to the unique minimum of (8).\n\nig1\n\ni=1 from\n\n5 Relations Among Four Methods\n\nThe four decision rules (cid:14)HT , (cid:14)1, (cid:14)2, and (cid:14)V can be written as argmaxi[pi], where p is\ni=1 pi = 1\n\nderived by the following four optimization formulations under the constantsPk\n\n\fand pi (cid:21) 0; 8i:\n\nk\n\nk\n\n(rij\n\n1\nk\n\n(cid:0)\n\n1\n2\n\npi)]2;\n\n(rijpj (cid:0) rjipi)]2;\n\n(rijpj (cid:0) rjipi)2;\n\n(cid:14)HT : min\n\np\n\n(cid:14)1 : min\n\np\n\n(cid:14)2 : min\n\np\n\n(cid:14)V : min\n\np\n\nk\n\nk\n\nXi=1\nXi=1\nXi=1\nXi=1\n\nk\n\n[\n\n[\n\nk\n\nXj:j6=i\nXj:j6=i\nXj:j6=i\nXj:j6=i\n\nk\n\nk\n\n(13)\n\n(14)\n\n(15)\n\n(16)\n\n(Ifrij >rjigpj (cid:0) Ifrji>rij gpi)2:\n\nNote that (13) can be easily veri\ufb01ed, and that (14) and (15) have been explained in Sections\n3 and 4. For (16), its solution is\n\npi =\n\nc\n\n;\n\nPj:j6=i Ifrji>rij g\n\nwhere c is the normalizing constant;(cid:3) and therefore, argmaxi[pi] is the same as (1). Clearly,\n(13) can be obtained from (14) by letting pj (cid:25) 1=k; 8j and rji (cid:25) 1=2, 8i; j. Such approx-\nimations ignore the differences between pi. Similarly, (16) is from (15) by taking the\nextreme values of rij: 0 or 1. As a result, (16) may enlarge the differences between pi.\nNext, compared with (15), (14) may tend to underestimate the differences between the pi\u2019s.\nThe reason is that (14) allows the difference between rijpj and rjipi to get canceled \ufb01rst.\nThus, conceptually, (13) and (16) are more extreme \u2013 the former tends to underestimate\nthe differences between pi\u2019s, while the latter overestimate them. These arguments will be\nsupported by simulated and real data in the next section.\n\n6 Experiments\n\n6.1 Simple Simulated Examples\n\n[3] designs a simple experiment in which all pi\u2019s are fairly close and their method (cid:14)HT\noutperforms the voting strategy (cid:14)V . We conduct this experiment \ufb01rst to assess the per-\nformance of our proposed methods. As in [3], we de\ufb01ne class probabilities p1 = 1:5=k,\npj = (1 (cid:0) p1)=(k (cid:0) 1), j = 2; : : : ; k; and then set\n\npi\n\nrij =\n\npi + pj\nrji = 1 (cid:0) rij\n\n+ 0:1zij if i > j;\n\nif j > i;\n\n(17)\n\n(18)\n\nwhere zij are standard normal variates. Since rij are required to be within (0,1), we truncate\nrij at (cid:15) below and 1 (cid:0) (cid:15) above, with (cid:15) = 0:00001. In this example, class 1 has the highest\nprobability and hence is the correct class.\n\nFigure 1 shows accuracy rates for each of the four methods when k = 3; 5; 8; 10; 12; 15; 20.\nThe accuracy rates are averaged over 1,000 replicates. Note that in this experiment all\nclasses are quite competitive, so, when using (cid:14)V , sometimes the highest vote occurs at two\n\n(cid:3)For I to be well de\ufb01ned, we consider rij 6= rji, which is generally true. In addition, if there is\nan i for which Pj :j6=i Ifrji>rij g = 0, an optimal solution of (16) is pi = 1, and pj = 0, 8j 6= i.\nThe resulting decision is the same as that of (1).\n\n\f1.05\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\ns\ne\n\nt\n\na\nR\n \ny\nc\na\nr\nu\nc\nc\nA\n\n1.05\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\ns\ne\n\nt\n\na\nR\n \ny\nc\na\nr\nu\nc\nc\nA\n\n0.6\n2\n\n3\n\n5\n\n4\n\nlog\n k\n2\n\n6\n\n7\n\n0.6\n2\n\n3\n\n5\n\n4\n\nlog\n k\n2\n\n6\n\n7\n\n1.05\n\n1\n\n0.95\n\ns\ne\n\nt\n\na\nR\n \ny\nc\na\nr\nu\nc\nc\nA\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n2\n\n3\n\n5\n\n4\n\nlog\n k\n2\n\n6\n\n7\n\n(a) balanced pi\n\n(b) unbalanced pi\n\n(c) highly unbalanced pi\n\nFigure 1: Accuracy of predicting the true class by the methods (cid:14)HT (solid line, cross\nmarked), (cid:14)V (dash line, square marked), (cid:14)1 (dotted line, circle marked), and (cid:14)2 (dashed\nline, asterisk marked) from simulated class probability pi; i = 1; 2 (cid:1) (cid:1) (cid:1) k.\n\nor more different classes. We handle this problem by randomly selecting one class from\nthe ties. This partly explains why (cid:14)V performs poor. Another explanation is that the rij\nhere are all close to 1/2, but (16) uses 1 or 0 instead; therefore, the solution may be severely\nbiased. Besides (cid:14)V , the other three rules have done very well in this example.\nSince (cid:14)HT relies on the approximation pi + pj (cid:25) k=2, this rule may suffer some losses\nif the class probabilities are not highly balanced. To examine this point, we consider the\nfollowing two sets of class probabilities:\n\n(1) We let k1 = k=2 if k is even, and (k + 1)=2 if k is odd; then we de\ufb01ne p1 =\n0:95(cid:2)1:5=k1, pi = (0:95(cid:0)p1)=(k1(cid:0)1) for i = 2; : : : ; k1, and pi = 0:05=(k(cid:0)k1)\nfor i = k1 + 1; : : : ; k.\n\n(2) If k = 3, we de\ufb01ne p1 = 0:95 (cid:2) 1:5=2, p2 = 0:95 (cid:0) p1, and p3 = 0:05. If\nk > 3, we de\ufb01ne p1 = 0:475, p2 = p3 = 0:475=2, and pi = 0:05=(k (cid:0) 3) for\ni = 4; : : : ; k.\n\nAfter setting pi, we de\ufb01ne the pairwise comparisons rij as in (17)-(18). Both experiments\nare repeated for 1,000 times. The accuracy rates are shown in Figures 1(b) and 1(c). In\nboth scenarios, pi are not balanced. As expected, (cid:14)HT is quite sensitive to the imbalance of\npi. The situation is much worse in Figure 1(c) because the approximation pi + pj (cid:25) k=2 is\nmore seriously violated, especially when k is large.\nIn summary, (cid:14)1 and (cid:14)2 are less sensitive to pi, and their overall performance are fairly\nstable. All features observed here agree with our analysis in Section 5.\n\n6.2 Real Data\n\nIn this section we present experimental results on several multi-class problems: segment,\nsatimage, and letter from the Statlog collection [9], USPS [4], and MNIST [7]. All data\nsets are available at http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/\nt. Their numbers of classes are 7, 6, 26, 10, and 10, respectively. From thousands of\ninstances in each data, we select 300 and 500 as our training and testing sets.\n\nWe consider support vector machines (SVM) with RBF kernel e(cid:0)(cid:13)kxi(cid:0)xj k2 as the binary\nclassi\ufb01er. The regularization parameter C and the kernel parameter (cid:13) are selected by cross-\nvalidation. To begin, for each training set, a \ufb01ve-fold cross-validation is conducted on\nthe following points of (C; (cid:13)): [2(cid:0)5; 2(cid:0)3; : : : ; 215] (cid:2) [2(cid:0)5; 2(cid:0)3; : : : ; 215]. This is done by\nmodifying LIBSVM [1], a library for SVM. At each (C; (cid:13)), sequentially four folds are\n\n\fTable 1: Testing errors (in percentage) by four methods: Each row reports the testing errors\nbased on a pair of the training and testing sets. The mean and std (standard deviation) are\nfrom \ufb01ve 5-fold cross-validation procedures to select the best (C; (cid:13)).\n\nDataset\n\nk\n\nsatimage\n\n6\n\nsegment\n\n7\n\nUSPS\n\n10\n\nMNIST\n\n10\n\nletter\n\n26\n\n(cid:14)HT\n\nmean\n14.080\n12.960\n14.520\n12.400\n16.160\n9.960\n6.040\n6.600\n5.520\n7.440\n14.840\n12.080\n10.640\n12.320\n13.400\n17.400\n15.200\n17.320\n14.720\n12.560\n39.880\n41.640\n41.320\n35.240\n43.240\n\nstd\n1.306\n0.320\n0.968\n0.000\n0.294\n0.480\n0.528\n0.000\n0.466\n0.625\n0.388\n0.560\n0.933\n0.845\n0.310\n0.000\n0.400\n1.608\n0.449\n0.294\n1.412\n0.463\n1.700\n1.439\n0.637\n\n(cid:14)1\n\nmean\n14.600\n13.400\n14.760\n12.200\n16.400\n9.480\n6.280\n6.680\n5.200\n8.160\n13.520\n11.440\n10.000\n11.960\n12.640\n16.560\n14.600\n14.280\n14.160\n12.600\n37.160\n39.400\n38.920\n32.920\n40.360\n\nstd\n0.938\n0.400\n1.637\n0.000\n0.379\n0.240\n0.299\n0.349\n0.420\n0.637\n0.560\n0.625\n0.657\n1.031\n0.080\n0.080\n0.000\n0.560\n0.196\n0.000\n1.106\n0.769\n0.854\n1.121\n1.472\n\n(cid:14)2\n\nmean\n14.760\n13.400\n13.880\n12.640\n16.120\n9.000\n6.200\n6.920\n5.400\n8.040\n12.760\n11.600\n9.920\n11.560\n12.920\n15.760\n13.720\n13.400\n13.360\n13.080\n34.560\n35.920\n35.800\n29.240\n36.960\n\nstd\n0.784\n0.400\n0.392\n0.294\n0.299\n0.400\n0.456\n0.271\n0.580\n0.408\n0.233\n1.081\n0.483\n0.784\n0.299\n0.196\n0.588\n0.657\n0.686\n0.560\n2.144\n1.389\n1.453\n1.335\n1.741\n\n(cid:14)V\nmean\n15.400\n13.360\n14.080\n12.680\n16.160\n8.880\n6.760\n7.160\n5.480\n7.840\n12.520\n11.440\n10.320\n11.840\n12.520\n15.960\n12.360\n13.760\n13.520\n12.440\n33.480\n33.440\n35.000\n27.400\n34.520\n\nstd\n0.219\n0.080\n0.240\n1.114\n0.344\n0.271\n0.445\n0.196\n0.588\n0.344\n0.160\n0.991\n0.744\n1.248\n0.917\n0.463\n0.196\n0.794\n0.325\n0.233\n0.325\n1.061\n1.066\n1.117\n1.001\n\nused as the training set while one fold as the validation set. The training of the four folds\nconsists of k(k (cid:0) 1)=2 binary SVMs. For the binary SVM of the ith and the jth classes,\nusing decision values ^f of training data, we employ an improved implementation [8] of\nPlatt\u2019s posterior probabilities [10] to estimate rij:\n\nrij = P (i j i or j; x) =\n\n1\n\n1 + eA ^f +B\n\n;\n\n(19)\n\nwhere A and B are estimated by minimizing the negative log-likelihood function.y\nThen, for each validation instance , we apply the four methods to obtain classi\ufb01cation\ndecisions. The error of the \ufb01ve validation sets is thus the cross-validation error at (C; (cid:13)).\nAfter the cross-validation is done, each rule obtains its best (C; (cid:13)).z Using these param-\neters, we train the whole training set to obtain the \ufb01nal model. Next, the same as (19),\nthe decision values from the training data are employed to \ufb01nd rij. Then, testing data are\ntested using each of the four rules.\n\nDue to the randomness of separating training data into \ufb01ve folds for \ufb01nding the best (C; (cid:13)),\nwe repeat the \ufb01ve-fold cross-validation \ufb01ve times and obtain the mean and standard devi-\nation of the testing error. Moreover, as the selection of 300 and 500 training and testing\ninstances from a larger dataset is also random, we generate \ufb01ve of such pairs. In Table 1,\neach row reports the testing error based on a pair of the training and testing sets. The re-\nsults show that when the number of classes k is small, the four methods perform similarly;\nhowever, for problems with larger k, (cid:14)HT is less competitive. In particular, for problem\nletter which has 26 classes, (cid:14)2 or (cid:14)V outperforms (cid:14)HT by at least 5%. It seems that for\n\ny[10] suggests to use ^f from the validation instead of the training. However, this requires a further\n\ncross-validation on the four-fold data. For simplicity, we directly use ^f from the training.\n\nzIf more than one parameter sets return the smallest cross-validation error, we simply choose one\n\nwith the smallest C.\n\n\fproblems here, their characteristics are closer to the setting of Figure 1(c), rather than that\nof Figure 1(a). All these results agree with the previous \ufb01ndings in Sections 5 and 6.1. Note\nthat in Table 1, some standard deviations are zero. That means the best (C; (cid:13)) by different\ncross-validations are all the same. Overall, the variation on parameter selection due to the\nrandomness of cross-validation is not large.\n\n7 Discussions and Conclusions\n\nAs the minimization of the KL distance is a well known criterion, some may wonder why\nthe performance of (cid:14)HT is not quite satisfactory in some of the examples. One possi-\nble explanation is that here KL distance is derived under the assumptions that nijrij (cid:24)\nBin(nij; (cid:22)ij) and rij are independent; however, as pointed out in [3], neither of the as-\nsumptions holds in the classi\ufb01cation problem.\n\nIn conclusion, we have provided two methods which are shown to be more stable than both\n(cid:14)HT and (cid:14)V . In addition, the two proposed approaches require only solutions of linear\nsystems instead of a nonlinear one in [3].\nThe authors thank S. Sathiya Keerthi for helpful comments.\n\nReferences\n[1] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software\n\navailable at http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[2] J. Friedman.\n\nAnother\n\napproach\n\nreport, Department of Statistics, Stanford University,\n\ncal\nhttp://www-stat.stanford.edu/reports/friedman/poly.ps.Z.\n\n1996.\n\nto\n\npolychotomous\n\nclassi\ufb01cation.\n\nTechni-\nat\n\nAvailable\n\n[3] T. Hastie and R. Tibshirani. Classi\ufb01cation by pairwise coupling. The Annals of Statistics,\n\n26(1):451\u2013471, 1998.\n\n[4] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern\n\nAnalysis and Machine Intelligence, 16(5):550\u2013554, May 1994.\n\n[5] D. R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics,\n\n2004. To appear.\n\n[6] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise procedure\nfor building and training a neural network. In J. Fogelman, editor, Neurocomputing: Algorithms,\nArchitectures and Applications. Springer-Verlag, 1990.\n\n[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, November 1998. MNIST database\navailable at http://yann.lecun.com/exdb/mnist/.\n\n[8] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt\u2019s probabilistic outputs for support vector\nmachines. Technical report, Department of Computer Science and Information Engineering,\nNational Taiwan University, 2003.\n\n[9] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Sta-\nPrentice Hall, Englewood Cliffs, N.J., 1994. Data available at\n\ntistical Classi\ufb01cation.\nhttp://www.ncc.up.pt/liacc/ML/statlog/datasets.html.\n\n[10] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized like-\nlihood methods. In A. Smola, P. Bartlett, B. Sch\u00a8olkopf, and D. Schuurmans, editors, Advances\nin Large Margin Classi\ufb01ers, Cambridge, MA, 2000. MIT Press.\n\n[11] D. Price, S. Knerr, L. Personnaz, and G. Dreyfus. Pairwise nerual network classi\ufb01ers with\nprobabilistic outputs. In G. Tesauro, D. Touretzky, and T. Leen, editors, Neural Information\nProcessing Systems, volume 7, pages 1109\u20131116. The MIT Press, 1995.\n\n[12] P. Refregier and F. Vallet. Probabilistic approach for multiclass classi\ufb01cation with neural net-\nworks. In Proceedings of International Conference on Arti\ufb01cial Networks, pages 1003\u20131007,\n1991.\n\n\f", "award": [], "sourceid": 2454, "authors": [{"given_name": "Ting-fan", "family_name": "Wu", "institution": null}, {"given_name": "Chih-jen", "family_name": "Lin", "institution": null}, {"given_name": "Ruby", "family_name": "Weng", "institution": null}]}