{"title": "The Error Coding and Substitution PaCTs", "book": "Advances in Neural Information Processing Systems", "page_first": 542, "page_last": 548, "abstract": "", "full_text": "The Error Coding and Substitution PaCTs \n\nGARETH JAMES \n\nand \n\nTREVOR HASTIE \n\nDepartment of Statistics, Stanford University \n\nAbstract \n\nA new class of plug in classification techniques have recently been de(cid:173)\nveloped in the statistics and machine learning literature. A plug in clas(cid:173)\nsification technique (PaCT) is a method that takes a standard classifier \n(such as LDA or TREES) and plugs it into an algorithm to produce a \nnew classifier. The standard classifier is known as the Plug in Classi(cid:173)\nfier (PiC). These methods often produce large improvements over using \na single classifier. In this paper we investigate one of these methods and \ngive some motivation for its success. \n\n1 Introduction \n\nDietterich and Bakiri (1995) suggested the following method, motivated by Error Correct(cid:173)\ning Coding Theory, for solving k class classification problems using binary classifiers. \n\n\u2022 Produce a k by B (B large) binary coding matrix, ie a matrix of zeros and ones. \nWe will denote this matrix by Z, its i, jth component by Zij, its ith row by Zi and \nits j th column by zj. \n\n\u2022 Use the first column of the coding matrix (Zl) to create two super groups by \nassigning all groups with a one in the corresponding element of Zl to super group \none and all other groups to super group zero. \n\n\u2022 Train your plug in classifier (PiC) on the new two class problem. \n\u2022 Repeat the process for each of the B columns (Zl, Z2, ... ,ZB) to produce B \n\ntrained classifiers. \n\n\u2022 For a new test point apply each of the B classifiers to it. Each classifier will \nproduce a 'Pi which is the estimated probability the test point comes from the \njth super group one. This will produce a vector of probability estimates, f> = \n(PI , ih., . .. ,P B) T . \n\n\fThe Error Coding and Substitution PaCTs \n\n543 \n\n\u2022 To classify the point calculate Li = 2::=1 IPi - Zij I for each of the k groups (ie \nfor i from 1 to k). This is the LI distance between p and Zi (the ith row of Z). \nClassify to the group with lowest L 1 distance or equivalently argi min Li \n\nWe call this the ECOC PaCT. Each row in the coding matrix corresponds to a unique (non(cid:173)\nminimal) coding for the appropriate class. Dietterich's motivation was that this allowed \nerrors in individual classifiers to be corrected so if a small number of classifiers gave a \nbad fit they did not unduly influence the final classification. Several PiC's have been tested. \nThe best results were obtained by using tree's, so all the experiments in this paper are stated \nusing a standard CART PiC. Note however, that the theorems are general to any Pic. \nIn the past it has been assumed that the improvements shown by this method were at(cid:173)\ntributable to the error coding structure and much effort has been devoted to choosing an \noptimal coding matrix. In this paper we develop results which suggest that a randomized \ncoding matrix should match (or exceed) the performance of a designed matrix. \n\n2 The Coding Matrix \n\nEmpirical results (see Dietterich and Bakiri (1995\u00bb suggest that the ECOC PaCT can pro(cid:173)\nduce large improvements over a standard k class tree classifier. However, they do not shed \nany light on why this should be the case. To answer this question we need to explore its \nprobability structure. The coding matrix, Z, is central to the PaCT. In the past the usual \napproach has been to choose one with as large a separation between rows (Zi) as possible \n(in terms of hamming distance) on the basis that this allows the largest number of errors \nto be corrected. In the next two sections we will examine the tradeoffs between a designed \n(deterministic) and a completely randomized matrix. \nSome of the results that follow will make use of the following assumption. \n\nE[fji I Z,X] = LZiiqi = ZjT q \n\nk \n\nj = 1, ... ,B \n\n(1) \n\ni=l \n\nwhere qi = P ( Gil X) is the posterior probability that the test observation is from group \ni given that our predictor variable is X. This is an unbiasedness assumption. It states that \non average our classifier will estimate the probability of being in super group one correctly. \nThe assumption is probably not too bad given that trees are considered to have low bias. \n\n2.1 Deterministic Coding Matrix \nLet Di = 1 - 2Ld B for i = 1 ... k. Notice that ar& min Li = argi max Di so using Di \nto classify is identical to the ECOC PaCT. Theorem 3 in section 2.2 explains why this is an \nintuitive transformation to use. \nObviously no PaCT can outperform the Bayes Classifier. However we would hope that it \nwould achieve the Bayes Error Rate when we use the Bayes Classifier as our PiC for each \n2 class problem. We have defined this property as Bayes Optimality. Bayes Optimality is \nessentially a consistency result It states, if our PiC converges to the Bayes Classifier, as \nthe training sample size increases, then so will the PaCT. \n\nDefinition 1 A PaCT is said to be Bayes Optimal if, for any test set, it always classifies to \nthe bayes group when the Bayes Classifier is our PiC. \nFor the ECOC PaCT this means that argi max qi = argi max D i , for all points in the \npredictor space, when we use the Bayes Classifier as our Pic. However it can be shown \nthat in this case \n\ni = 1, ... ,k \n\n\f544 \n\nG. James and T. Hastie \n\nIt is not clear from this expression why there should be any guarantee that argi max Vi = \nargi max qi . In fact the following theorem tells us that only in very restricted circumstances \nwill the ECOC PaCT be Bayes Optimal. \n\nTheorem 1 The Error Coding method is Bayes Optimal iff the Hamming distance between \nevery pair of rows of the coding matrix is equal. \n\nThe hamming distance between two binary vectors is the number of points where they \ndiffer. For general B and k there is no known way to generate a matrix with this property \nso the ECOC PaCT will not be Bayes Optimal. \n\n2.2 Random Coding Matrix \n\nWe have seen in the previous section that there are potential problems with using a deter(cid:173)\nministic matrix. Now suppose we randomly generate a coding matrix by choosing a zero \nor one with equal probability for every coordinate. Let Pi = E(I- 21ih - Zilll T) where \nT is the training set. Then Pi is the conditional expectation of D i and we can prove the \nfollowing theorem. \n\nTheorem 2 For a random coding matrix, conditional on T, argi max Vi --+ argi max Pi \na.s. as B --+ 00. Or in other words the classification from the ECOC PaCT approaches the \nclassification from just using argi max Pi a.s. \n\nThis leads to corollary 1 which indicates we have eliminated the main concern of a deter(cid:173)\nministic matrix. \n\nCorollary 1 When the coding matrix is randomly chosen the ECOC PaCT is asymptoti(cid:173)\ncally Bayes Optimal ie argi max Di --+ argi max qi a.s. as B --+ 00 \n\nThis theorem is a consequence of the strong law. Theorems 2 and 3 provide motivation for \nthe ECOC procedure. \n\nTheorem 3 Under assumption 1 for a randomly generated coding matrix \n\nE j) i = E Pi = qi \n\ni = 1 ... k \n\nThis tells us that Vi is an unbiased estimate of the conditional probability so classifying to \nthe maximum is in a sense an unbiased estimate of the Bayes classification. \nNow theorem 2 tells us that for large B the ECOC PaCT will be similar to classifying using \nargi max ILi only. However what we mean by large depends on the rate of convergence. \nTheorem 4 tells us that this rate is in fact exponential. \n\nTheorem 4 lfwe randomly choose Z then, conditional on T , for any fixed X \n\nPr(argi max Vi i argi max ILi) ::; (k - 1) . e- mB \n\nfor some constant m. \n\nNote that theorem 4 does not depend on assumption 1. This tells us that the error rate for \nthe ECOC PaCT is equal to the error rate using argi max Pi plus a tenn which decreases \nexponentially in the limit. This result can be proved using Hoeffding's inequality (Hoeffd(cid:173)\ning (1963\u00bb. \nOf course this only gives an upper bound on the error rate and does not necessarily indicate \nthe behavior for smaller values of B. Under certain conditions a Taylor expansion indicates \nthat Pr(argi maxDi i argi maxPi) :::::: 0.5 - mVE for small values of mVE. So we \n\n\fThe Error Coding and Substitution PaCTs \n\n545 \n\n1/sqrt(B) convergence \nlIB convergence \n\no \n'\" o \n\no \nCO> o \n\n50 \n\n100 \n\nB \n\n150 \n\n200 \n\nFigure 1: Best fit curves for rates 1/ VB and 1/ B \n\nmight expect that for smaller values of B the error rate decreases as some power of B but \nthat as B increases the change looks more and more exponential. \nTo test this hypothesis we calculated the error rates for 6 different values of B \n(15,26,40,70,100,200) on the LEITER data set (available from the Irvine Repository \nof machine learning). Each value of B contains 5 points corresponding to 5 random matri(cid:173)\nces. Each point is the average over 20 random training sets. Figure 1 illustrates the results. \nHere we have two curves. The lower curve is the best fit of 1/ VB to the first four groups. \nIt fits those groups well but under predicts errors for the last two groups. The upper curve \nis the best fit of 1/ B to the last four groups. It fits those groups well but over predicts errors \nfor the first two groups. This supports our hypothesis that the error rate is moving through \nthe powers of B towards an exponential fit. \nWe can see from the figure that even for relatively low values of B the reduction in error \nrate has slowed substantially. This indicates that almost all the remaining errors are a result \nof the error rate of argi max J-li which we can not reduce by changing the coding matrix. \nThe coding matrix can be viewed as a method for sampling from the distribution of \n1- 21pj - Zij I. If we sample randomly we will estimate J-li (its mean). It is well known that \nthe optimal way to estimate such a parameter is by random sampling so it is not possible to \nimprove on this by designing the coding matrix. Of course it may be possible to improve \non argi max J-li by using the training data to influence the sampling procedure and hence \nestimating a different quantity. However a designed coding matrix does not use the training \ndata. It should not be possible to improve on random sampling by using such a procedure \n(as has been attempted in the past). \n\n3 Why does the ECOC PaCT work? \n\nThe easiest way to motivate why the ECOC PaCT works, in the case of tree classifiers, is \nto consider a very similar method which we call the Substitution PaCT. We will show that \nunder certain conditions the ECOC PaC!' is very similar to the Substitution PaCT and then \nmotivate the success of the later. \n\n\f546 \n\n3.1 Substitution PaCT \n\nG. James and T. Hastie \n\nThe Substitution PaCT uses a coding matrix to fonn many different trees just as the ECOC \nPaCT does. However, instead of using the transformed training data to fonn a probability \nestimate for each two class problem, we now plug the original (ie k-class) training data back \ninto the new tree. We use this training data to fonn probability estimates and classifications \njust as we would with a regular tree. The only difference is in how the tree is fonned. \nTherefore, unlike the ECOC PaCT, each tree will produce a probability estimate for each \nof the k classes. For each class we simply average the probability estimate for that class \nover our B trees. So if pf is the probability estimate for the Substitution PaCT, then \n\n1 B \n\npf = B LPij \n\nj=l \n\n(2) \n\nwhere Pij is the probability estimate for the ith group for the tree fonned from the jth \ncolumn of the coding matrix. \nTheorem 5 shows that under certain conditions the ECOC PaCT can be thought of as an \napproximation to the Substitution PaCT. \n\nTheorem 5 Suppose that Pij is independentfrom the jth column of the coding matrix, for \nall i and j. Then as B approaches infinity the ECOC PaCT and Substitution PaCT will \nconverge ie they will give identical classification rules. \n\nThe theorem depends on an unrealistic assumption. However, empirically it is well known \nthat trees are unstable and a small change in the data set can cause a large change in the \nstructure of the tree so it may be reasonable to suppose that there is a low correlation. \nTo test this empirically we ran the ECOC and Substitution PaCT's on a simulated data \nset. The data set was composed of 26 classes. Each class was distributed as a bivariate \nnormal with identity covariance matrix and uniformly distributed means. The training data \nconsisted of 10 observations from each group. Figure 2 shows a plot of the estimated \nprobabilities for each of the 26 classes and 1040 test data points averaged over 10 training \ndata sets. Only points where the true posterior probability is greater than 0.01 have been \nplotted since groups with insignificant probabilities are unlikely to affect the classification. \nIf the two groups were producing identical estimates we would expect the data points to \nlie on the dotted 45 degree.line. Clearly this is not the case. The Substitution PaCT is \nsystematically shrinking the probability estimates. However there is a very clear linear \nrelationship (R2 \n:::::: 95%) and since we are only interested in the arg max for each test \npoint we might expect similar classifications. In fact this is the. case with fewer than 4% of \npoints correctly classified by one group but not the other. \n\n3.2 Why does the Substitution PaCT work? \nThe fact that pf is an average of probability estimates suggests that a reduction in vari(cid:173)\nability may be an explanation for the success of the Substitution PaCT. Unfortunately it \nhas been well shown (see for example Friedman (1996\u00bb that a reduction in variance of \nthe probability estimates does not necessarily correspond to a reduction in the error rate. \nHowever theorem 6 provides simplifying assumptions under which a relationship between \nthe two quantities exists. \n\nTheorem 6 Suppose that \n\npT \nand pf \n\n(a[ > 0) \n(af > 0) \n\n(3) \n(4) \n\n\fThe Error Coding and Substitution PaCTs \n\n547 \n\n~ \n\nII) \nd \n\nco \n\nco \n~ \ni \n~ d \nC .g \n.... \n:0 ] \nd \n\nQ. \n\n:0 \nCI) \n\n'\" 0 \n\n0 \nd \n\n0.0 \n\n0.2 \n\n0.6 \n0.4 \nEeoc probabilitJea \n\n08 \n\n1.0 \n\nFigure 2: Probability estimates from both the ECOC and Substitution PaCT's \n\nwhere eS and eT have identical joint distributions with variance 1. pT is the probability \nestimate of the ith group for a k class tree method, ao and al are constants and qi is the \ntrue posterior probability. Let \n\n'V _ \n, -\n\nVar(p'!'jaT ) \n1 \nS \nV ar(pil / a 1 ) \n\n~ \n\nand p = corr(pil ,Pi2) (assumed constantfor all i). Then \n\nPr(argmaxP7 = argmaxqi) ~ Pr(argmaxp; = argmaxqi) \n\nif \n\nand \n\nB>-(cid:173)\n\nI-p \n'Y - P \n\n-\n\n(5) \n\n(6) \n\n(7) \n\nThe theorem states that under fairly general conditions, the probability that the Substitution \nPaCT gives the same classification as the Bayes classifier is at least as great as that for the \ntree method provided that the standardized variability is low enough. It should be noted \nthat only in the case of two groups is there a direct correspondence between the error rate \nand 5. The inequality in 5 is strict for most common distributions (e.g. normal, uniform, \nexponential and gamma) of e. \nNow there is reason to believe that in general p will be small. This is a result of the \nempirical variability of tree classifiers. A small change in the training set can cause a large \nchange in the structure of the tree and also the final probability estimates. So by changing \nthe super group coding we might expect a probability estimate that is fairly unrelated to \nprevious estimates and hence a low correlation. \nTo test the accuracy of this theory we examined the results from the simulation performed in \nsection 3.1. We wished to estimate 'Y and p. The following table summarizes our estimates \nfor the variance and standardizing (al) terms from the simulated data set. \n\nI Classifier \nSubstitution PaCT \nTree Method \n\n\f548 \n\nG. James and T. Hastie \n\nTree \nECOC \ns..t>stilUtion \n\n'\" o \n\n\u2022 o \n\n-------------------------\n\n5 \n\n10 \n\n50 \n\n100 \n\nB (log scale) \n\nFigure 3: Error rates on the simulated data set for tree method, Substitution PaCf and \nECOC PaCT plotted against B (on log scale) \n\nThese quantities give us an estimate for, of l' = 0.227 We also derived an estimate for p \nof p = 0.125 \nWe see that p is less than, so provided B ~ ~=~ ~ 8.6 we should see an improvement \nin the Substitution PaCT over a k class tree classifier. Figure 3 shows that the Substitution \nerror rate drops below that of the tree classifier at almost exactly this point. \n\n4 Conclusion \n\nThe ECOC PaCT was originally envisioned as an adaption of error coding ideas to clas(cid:173)\nsification problems. Our results indicate that the error coding matrix is simply a method \nfor randomly sampling from a fixed distribution. This idea is very similar to the Boot(cid:173)\nstrap where we randomly sample from the empirical distribution for a fixed data set. There \nyou are trying to estimate the variability of some parameter. Your estimate will have two \nsources of error, randomness caused by sampling from the empirical distribution and the \nrandomness from the data set itself. In our case we have the same two sources of error, \nerror caused by sampling from 1 - 2!ftj - Zij! to estimate J-ti and error's caused by J-t itself. \nIn both cases the first sort of error will reduce rapidly and it is the second type we are really \ninterested in. It is possible to motivate the reduction in error rate of using argi max J-ti in \nterms of a decrease in variability, provided B is large enough and our correlation (p) is \nsmall enough. \n\nReferences \n\nDietterich, T.G. and Bakiri G. (1995) Solving Multiclass Learning Problems via Error(cid:173)\nCorrecting Output Codes, Journal of Artificial Intelligence Research 2 (1995) 263-286 \nDiet~rich, T. G. and Kong, E. B. (1995) Error-Correcting Output Coding Corrects Bias \nand Variance, Proceedings of the 12th International Conference on Machine Learning pp. \n313-321 Morgan Kaufmann \nFriedman, 1.H. (1996) On Bias, Variance, Oil-loss, and the Curse of Dimensionality, Dept \nof Statistics, Stanford University, Technical Report \nHoeffding, W. (1963) Probability Inequalities for Sums of Bounded Random Variables. \n\"Journal of the American Statistical Association\", March, 1963 \n\n\f", "award": [], "sourceid": 1481, "authors": [{"given_name": "Gareth", "family_name": "James", "institution": null}, {"given_name": "Trevor", "family_name": "Hastie", "institution": null}]}