{"title": "Intransitive Likelihood-Ratio Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 1141, "page_last": 1148, "abstract": null, "full_text": "Intransitive Likelihood-Ratio Classi\ufb01ers\n\nJeff Bilmes\n\nand\n\nGang Ji\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\nSeattle, WA 98195-2500\n\n bilmes,gji\u0001 @ee.washington.edu\n\nMarina Meil\u02d8a\n\nDepartment of Statistics\nUniversity of Washington\nSeattle, WA 98195-4322\nmmp@stat.washington.edu\n\nAbstract\n\nIn this work, we introduce an information-theoretic based correction term\nto the likelihood ratio classi\ufb01cation method for multiple classes. Under\ncertain conditions, the term is suf\ufb01cient for optimally correcting the dif-\nference between the true and estimated likelihood ratio, and we analyze\nthis in the Gaussian case. We \ufb01nd that the new correction term signif-\nicantly improves the classi\ufb01cation results when tested on medium vo-\ncabulary speech recognition tasks. Moreover, the addition of this term\nmakes the class comparisons analogous to an intransitive game and we\ntherefore use several tournament-like strategies to deal with this issue.\nWe \ufb01nd that further small improvements are obtained by using an appro-\npriate tournament. Lastly, we \ufb01nd that intransitivity appears to be a good\nmeasure of classi\ufb01cation con\ufb01dence.\n\n1 Introduction\n\n\b\n\t\nis the loss in choosing \u0002\n\u0003\u001a\u0006\n\nAn important aspect of decision theory is multi-way pattern classi\ufb01cation whereby one\n\nmust determine the class \u0002\u0004\u0003 for a given data vector\u0005\n\u0002\u0011\u0010\n\u0012\n\b\r\f\u000f\u000e\n\u0010 when the true class is \u0002 . This decision rule is\n\nwhere\nprovably optimal for the given loss function [3]. For the 0/1-loss functions, it is optimal to\nsimply use the posterior probability to determine the optimal class\n\nthat minimizes the overall risk:\n\nargmin\n\n\u0005\u0018\u0013\u0011\u0019\n\n\u0002\u0014\u0013\u0016\u0015\n\n\u0003\u0007\u0006\n\n\u0002\u0017\u0012\n\n\u0002\u0014\u0013\n\n\f\u000f\u000e\n\nargmax\n\nThis procedure may equivalently be speci\ufb01ed using a tournament style game-playing strat-\n\n\u0002\u0017\u0012\negy. In this case, there is an implicit class ordering \n( ) scoring function for an unknown sample \u0005 :\n*.,\n)+*-,\n\u0005\u0018\u00130/21\n\u00058\u0012\nthe log prior odds. The strategy proceeds by evaluating )8\b\u0015\n\n\b\nA\u001e\b\n\n*-,\n\n*.,\n\nis\n\n\u0002\n\u000b\n\u000e\n\u0002\n\u0010\n\u0012\n\u0002\n\b\n\u0015\n\u000e\n\u000e\n\u0006\n\f\n\u000e\n\f\n\u0015\n\u000e\n\u000e\n(\n\u0013\n\u0015\n\u000e\n\u000e\n(\n\u0013\n\b\n?\n\fany event, this style of classi\ufb01cation can be seen as a transitive game [5] between players\nwho correspond to the individual classes.\n\nIn this work we extend the likelihood-ratio based classi\ufb01cation with a term, based on the\nKullback-Leibler divergence [2], that expresses the inherent posterior confusability be-\ntween the underlying likelihoods being compared for a given pair of players. We \ufb01nd that\nby including this term, the results of a classi\ufb01cation system signi\ufb01cantly improve, without\nchanging or increasing the quantity of the estimated free model parameters. We also show\nhow, under certain assumptions, the term can be seen as an optimal correction between the\nestimated model likelihood ratio and the true likelihood ratio, and gain further intuition by\n\nexamining the case when the likelihoods \u0015\n\nthat the new strategy leads to an intransitive game [5], and we investigate several strate-\ngies for playing such games. This results in further (but small) improvements. Finally, we\nconsider the instance of intransitivity as a con\ufb01dence measure, and investigate an iterative\napproach to further improve the correction term.\n\n'9\u0013 are Gaussians. Furthermore, we observe\n\n\u00058\u0012\n\nSection 2 \ufb01rst motivates and de\ufb01nes our approach, and shows the conditions under which it\nis optimal. Section 2.1 then reports experimental results which show signi\ufb01cant improve-\nments where the likelihoods are hidden Markov models trained on speech data. Section 3\nthen recasts the procedure as intransitive games, and evaluates a variety of game playing\nstrategies yielding further (small) error reductions. Section 3.1 attempts to better under-\nstand our results via empirical analysis, and evaluates additional classi\ufb01cation strategies.\nSection 4 explores an iterative strategy for improving our technique, and \ufb01nally Section 5\nconcludes and discusses future work.\n\n2 Extended Likelihood-Ratio-based Classi\ufb01cation\n\nThe Kullback-Leibler (KL) divergence[2], an asymmetric measure of the distance between\ntwo probability densities, is de\ufb01ned as follows:\n\n\u0015\u0002\u0001\u0004\u0003 \u0013\n\n\u0006\u0006\u0005\b\u000783$5\u00176\n\n\u000e\n\t\n\u000e\u000b\t\n\n\u00058\u0012\n\n'9\u0013\n\n'<\u0013\n\n\u00058\u0012\n\u00058\u0012\n\nis the class number:\n\nare probability densities over the same sample space. The KL-divergence is\n. For our\n\nwhere\u0015 and\nalso called the average (under\u0015 ) information for discrimination in favor of\u0015 over\npurposes, we are interested in KL-divergence between class-conditional likelihoods \u0015\nwhere'\n3$5\u00176\n'<\u0013\u000f\u000e7\u0005\n\u00058\u0012\n'\f\u0001\nOne intuitive way of viewing \nis as follows: if \n'\u0010\u0001\nclass ' are more likely to be erroneously classi\ufb01ed as class (\nthan when \n'\u0010\u0001\n\u0001>'<\u0013 should tell us which of ' and(\nComparing \nsamples mis-classi\ufb01ed by the other model. Therefore, the difference \n\u0001\u0011'9\u0013 ,\n\u0013\b\u0011\n'\u0010\u0001\nwhen positive, indicates that samples of class ( are more likely to be mis-classi\ufb01ed as\nthan samples of class ' are to be mis-classi\ufb01ed as class (\nclass '\nthe difference is negative). In other words, ' \u201csteals\u201d from ( more than (\nsteals from '\nwhen the difference is positive, thereby suggesting that class (\n*.,\nlikelihood ratio) with this posterior bias, to obtain a new function comparing classes ' and\n( as follows:\n\nshould receive aid in this\ncase. This difference can be viewed as a form of posterior (i.e., based on the data) \u201cbias\u201d\nindicating which class should receive favor over the other. 1 We can adjust\n(the log-\n\nis small, then samples of\nis large.\nis more likely to have its\n\n(and vice-versa when\n\n\u0013 and \n\n'\f\u0001\n\n1Note that this is not the normal notion of statistical bias as in\n\n'\u0010\u0001\n\n*-,\n\nmodel parameters.\n\n*-,\n\n*-,\n\n*.,\n\nwhere\n\nis an estimate of\n\n\u0012\u0014\u0013\n\n\u000e\n\n\u000e\n\u0015\n\u0013\n\u0003\n\u0013\n\u0003\n\u0003\n\u000e\n\n\u000e\n(\n\u0013\n\u0006\n\n\u0015\n\u000e\n\u0015\n\u000e\n(\n\u0013\n\u0015\n\u000e\n\u000e\n(\n\u0013\n\u000e\n(\n\u0013\n\u000e\n(\n\u0013\n\u000e\n(\n\u000e\n(\n\u000e\n(\n\n\u000e\n(\n\f\n)\n\u0006\n\f\n/\n1\n\u0011\n\n!\n\u0013\n\fwhere\n\n*.,\n\n*-,\n\n'\f\u0001\n\n*-,\n'9\u0013\u001e:\n\n\u0001>'<\u0013\nis positive, and in favor of ' when\n\nis negative. We then use )\n*-, along with 1\n\u00058\u0012\n\u0005\u0018\u0013\n\n\u0013\u0002\u0011\nThe likelihood ratio is adjusted in favor of ( when \n*.,\n\nThe above intuition does not explain why such a correction factor should be used, since\nusing\nis already optimal. In practice, however, we do not have access\nto the true likelihood ratios but instead to an approximation that has been estimated from\ntraining data. Let the variable\n\n*-,\u0001\n*-, , and when it is positive, choose class ' .\n*.,\n\u0005\u0018\u0013\n\u0013 be the model-based log ratio. Furthermore, let\n'\f\u0001\n'<\u0013 , and let\n)+*.,\n\u0005\u0018\u0013\n\nand\u0004\nthe true distribution\u0015\n*., ) be the true (resp. estimated) log prior odds. Our (usable) scoring function becomes:\n\nbe the modi\ufb01ed KL-divergence between the class conditional models, measured modulo\n(resp.\n\n\u0013 be the true log-likelihood ratio,\n'<\u0013\u000f\u000e7\u0005\n\nwhich has an intuitive explanation similar to the above.\n\n'<\u0013=:>\u0015\n\u00058\u0012\n'<\u0013\n\u00058\u0012\n'\u0010\u0001\n*-,\n\n. Finally, let 1\n\n\u0006\r3$5\u00176\n3$5\u00176\n\n\u00058\u0012\n\u0013\u0002\u0011\n\n\u0001>'<\u0013\b\u0007\n\n*.,\u0005\n\n3#576\n\n\u00058\u0012\n\n\u00058\u0012\n\n\u00130/\n\n(1)\n\n*-,\n\n*-,\n\n*.,\n\nThere are certain conditions under which the above approach is theoretically justi\ufb01able. Let\n\nus assume for now a two-class problem where' and( are the two classes, so\u0015\n\u0002 .\n'9\u0013;/\n( -dependent\n.2 Since this is not the case in practice, an '\n\nA suf\ufb01cient condition for the estimated quantities above to yield optimal performance is\nfor\nconstant term\nyields\n\nmay be added correcting for any differences as best as possible. This\n\n-dependent cost function\n\n. We can de\ufb01ne an\n\n/\r1\n\nfor all \u0005\n/\n\t\n\nIt is easy to see that for\nderivative\npositive iff\n\n2Note that we have dropped the\n\n%'&)(+*\n*.-\n%',\n\nadditional assumptions lead to Equation 1. First, let us assume that the prior probabilities\n) and that the estimated and true priors are negligibly different (i.e.,\n\n). Secondly, if we assume that\n\n'<\u0013\n\n\u0007\u000e\r\u0010\u000f\u0012\u0011\nare equal (so\u0015\n'<\u0013\n1\u0006\u0011\n1\u0017\u0016\nwhich means that \n\nwhich, when minimized, yields\nunder this cost function is just the mean of the difference of the remaining terms. Note that\n\nstating that the optimal\n\nmetric in general, we can see that if this holds (or is approximately true for a given problem)\nthen the remaining correction is\n\nin Equation 1.\n\nTo gain further insight, we can examine the case when the likelihoods are Gaussian uni-\nvariate distributions, with means\n\n\u0001>'<\u0013\n\n'\u0010\u0001\n\u0006\u0014\u0013\n\n'\f\u0001\n\n\u000e\u0017\u0005\n\n, this implies that\n\n\u0001\u0011'9\u0013 . Several\n\n\u0005\u0018\u0013\n\u0007\u000e\r\u0010\u000f\u0012\u0011\n\u0001\u0011'<\u0013 and\n\n/21\n1\u0006\u0011\f\t\n\u0007\u000e\r\u0010\u000f\u0012\u0011\n'<\u0013\n\u0007\u000e\r\u0010\u000f\u0012\u0011\n\u0006\u0014\u0013\n\u0007\u000e\r\u0010\u000f\u0012\u0011\n\u0001>'<\u0013=:\n'<\u0013\u001e:;\u0015\n'\f\u0001\n\u0013 under equal priors. While KL-divergence is not sym-\n'\f\u0001\nexactly yielding \u0004\n, and variances\n\u0019\u001a\u0019\n\u001f! \n\u0011\u001f\u0019\nthe value of \n*-,\n\n, . In this case,\n*#\"\u0001$\n\u0019\u001a\u0019\n\nargument for notational simplicity.\n\n*-,\n\u0019\u001c\u001b\n\n(2)\n\n, . By computing the\n*.,\nwe can show that \n, and therefore it penalizes the distribution (class) with higher variance.\n\nis monotonically increasing with\n\n* . Hence, \n\nis zero for any\n\n\u001d\u0014\u001e\n\n*-,\n\n*.,\n\nis\n\n\n\u0006\n\u0002\n\u0003\n\u000e\n\n\u000e\n(\n\n\u000e\n(\n\u0013\n\n\f\n\f\n\u000e\n\u0015\n\u000e\n\u000e\n\u0005\n\u0012\n(\n\f\n\u000e\n\u0006\n\u0004\n\u0015\n\u000e\n\u0004\n\u0015\n\u000e\n(\n\u0004\n\n\u000e\n(\n\u0013\n\n\u0006\n\n\u0004\n\u0015\n\u000e\n\u0004\n\u0015\n\u000e\n(\n\u0013\n\u0015\n\u000e\n\u000e\n\u0005\n\u0012\n\u0004\n\n\u0006\n\u001b\n\u001f\n\u0006\n\u0004\n\n\u000e\n(\n\u0004\n\n\u000e\n(\n\u0004\n1\n\u000e\n\u0006\n\u0004\n\f\n\u000e\n\u0005\n\u0004\n1\n\u0011\n\u0004\n\n!\n\u000e\n\u0015\n\u000e\n(\n\u0013\n\u0006\n\f\n\u0006\n\u0004\n\f\n/\n\u0004\n1\n\t\n\f\n/\n1\n\u0006\n\u0004\n\f\n/\n\u0004\n1\n\t\n\u000b\n\u000e\n\t\n\u0013\n\u0006\n\n\u0015\n\u000e\n\u0006\n\f\n\u0011\n\u0004\n\f\n\u0011\n\u0004\n\u0007\n\u001f\n\t\n\u0003\n\u0006\n\u0005\n\f\n/\n1\n\u0011\n\u0005\n\u0004\n\f\n\u0011\n\u0004\n1\n\t\n\u0005\n\f\n\u0006\n\u0015\n\u000e\n\n\u000e\n(\n\u0013\n\u0011\n\u0015\n\u000e\n(\n\u0013\n\n\u000e\n(\n\u0005\n\u0004\n\f\n\u0006\n\u0015\n\u000e\n\u0004\n\n\u000e\n(\n\u0013\n\u0011\n\u0015\n\u000e\n(\n\u0013\n\u0004\n\n\u000e\n(\n\u000e\n!\n\u0015\n\u0004\n\u0013\n\u0005\n\f\n\u0015\n\u000e\n\u000e\n(\n\u0013\n\u0006\n\n\u000e\n(\n\n\u000e\n(\n\u0013\n\u000e\n(\n\u0006\n\n\u000e\n(\n\u0011\n\u0005\n\u0004\n\u0018\n\n\u0019\n*\n\u001b\n\u001f\n*\n\u001f\n\n\u0006\n\u0002\n\u001b\n\u001f\n*\n\u001b\n\u001f\n,\n\u0011\n\u001b\n\u001f\n,\n\u001b\n\u001f\n*\n/\n\u000e\n\u0019\n*\n,\n\u0013\n\u0002\n\u001b\n\u001f\n,\n\u0011\n\u0002\n\u001b\n\u001f\n\u001b\n\u001f\n*\n\u0006\n\u001b\n\u001f\n,\n\u0019\n*\n@\n(\n\u001b\n\u001f\n\u001b\n\u001f\n\u001b\n\u001f\n/\n\fVOCAB SIZE WER\n\n75\n150\n300\n600\n\n*-, WER)+*-,\n\n1.91561\n2.89833\n4.51365\n6.18517\n\n2.33584\n3.31072\n5.22513\n7.39268\n\nTable 1: Word error rates (WER) for likelihood ratio\n\n)+*-, based classi\ufb01cation for various numbers of classes (VOCAB SIZE).\n\n*., and augmented likelihood ratio\n\n, and variances \u0001\n\u0001\u0007\u0006\n\n, .\n\n(3)\n\nSimilar relations hold for multivariate Gaussians with means\n\n*-,\n\n\u0006\u0003\u0002\u0005\u0004\n\n\u0001\u0007\u0006\n\n\u0001\u0007\u0006\n\n\u0001\u0007\u0006\n\nThe above is zero when the two covariance matrices are equal. This implies that for Gaus-\n\n*.,\n\n*-,\n\nfor\n\n\u0012$\u0012\n\n'<\u0013\n\nfor\n\n2.1 Results\n\nWe tried this method (assuming that 1\n\nrecognition task.\n(HMM) scores3. The task we chose is NYNEX PHONEBOOK[4], an isolated word speech\ncorpus. Details of the experimental setup, training/test sets, and model topologies, are\ndescribed in [1]4.\n\nIn our case the likelihood functions \u0015\n\nsians with equal covariance matrices, \nThis is the same as the condition for Fisher\u2019s linear discriminant analysis (LDA). Moreover,\n, with\nin the case \u0001\nwhich again implies that \n\n, we have that \n\n'\u0011\u0012$\u0012\n*-,\f\u000b\n\n) on a medium vocabulary speech\n\n\u0013 and our correction term is optimal.\n*-,\n\u0002 and \n*., penalizes the class that has larger covariance.\n\t\r\t\n\t\n\t\n'<\u0013 are hidden Markov model\n*-, . These include 1) analytically, using\n*., -based classi\ufb01cation,\n\nIn general, there are a number of ways to compute \n\nestimated model parameters (possible, for example, with Gaussian densities), 2) computing\nthe KL-divergences on training data using a law-of-large-numbers-like average of likeli-\nhood ratios and using training-data estimated model parameters, 3) doing the same as 2 but\nusing test data where hypothesized answers come from a \ufb01rst pass\nand 4) Monte-Carlo methods where again the same procedure as 2 is used, but the data is\nsampled from the training-data estimated distributions. For HMMs, method 1 above is not\npossible. Also, the data set we used (PHONEBOOK) uses different classes for the train-\ning and test sets. In other words, the training and test vocabularies are different. During\ntraining, phone models are constructed that are pieced together for the test vocabularies.\nTherefore, method 2 above is also not possible for this data.\n\nEither method 3 or 4 can be used in our case, and we used method 3 in all our experiments.\nOf course, using the true test labels in method 3 would be the ideal measure of the degree\nof confusion between models, but these are of course not available (see Figure 2, however,\nshowing the results of a cheating experiment). Therefore, we use the hypothesized labels\n\n*-, .\nfrom a \ufb01rst stage to compute \u0004\nThe procedure thus is as follows: 1) obtain\u0015\n2) classify the test set using only \u0004\nclass labels (answers with errors) to step 2, compute \u0004\n*., and record the new error rate. )\nthe score)\n\n'<\u0013 using maximum likelihood EM training,\n*-, and record the error rate, 3) using the hypothesized\n*., , 4) re-classify the test set using\n*-,\n'\u0011\u0012#\u0012\n\n3Using 4 state per phone, 12 Gaussian mixtures per state HMMs, totaling 200k free model pa-\n\nis used if either one of \n\n*-,\n\n*-,\n\nrameters for the system.\n\n4Note, however, that error results here are reported on the development set, i.e., PHONEBOOK\n\nlists\n\na,b,c,d\n\no,y\n\n\u000f\u0010\u000e\n\n\f\n\f\n\u0019\n*\n\u0019\n\u0019\n*\n\u0019\n\u0001\n\u001d\n\n\u000e\n\u0001\n*\n\u001b\n,\n\u0011\n\u0001\n,\n\u001b\n*\n\u0013\n/\n\u000e\n\u0019\n*\n\u0011\n\u0019\n,\n\u0013\n\b\n\u000e\n\u001b\n,\n\u0011\n\u001b\n*\n\u0013\n\u000e\n\u0019\n*\n\u0011\n\u0019\n,\n\u0013\n\u000e\n(\n\u0006\n\n\u000e\n(\n*\n\u0006\n\t\n\u0001\n\u0013\n\u0013\n\t\n\u000b\n\t\n\u0013\n\u0002\n\u0006\n\u0004\n1\n\u0006\n\u0013\n\u000e\n\u0005\n\u0012\n\f\n\n\u000e\n\u0005\n\u0012\n\f\n\n\u0006\n\u0004\n\f\n\u0011\n\u0004\n\n\u000e\n(\n\u0013\n\u000e\n\u000f\n\fVOCAB\n\n75\n150\n300\n600\n\n*-,\n\n2.33584\n3.31072\n5.22513\n7.39268\n\nRAND1 RAND500 RAND1000 WORLD CUP\n1.87198\n2.88505\n4.41428\n6.15828\n\n1.82047\n2.71881\n4.34608\n6.13085\n\n1.91467\n2.72809\n4.28930\n5.91440\n\n2.12777\n2.79516\n3.81583\n5.93883\n\nTable 2: The WER under different tournament strategies\n\n\u0012$\u0012\n\n'<\u0013\n\nor \n\nfor classi\ufb01cation.\n\nis below a threshold (i.e., when a likely confusion exists), otherwise \u0004\n*-, , and the third column shows WER using )\n\nTable 1 shows the result of this experiment. The \ufb01rst column shows the vocabulary size\nof the system (identical to the number of classes) 5. The second column shows the word\nerror rate (WER) using just\nseen, the WER decreases signi\ufb01cantly with this approach. Note also that no additional free\nparameters are used to obtain these improvements.\n\n*., . As can be\n\nis used\n\n*-,\n\n3 Playing Games\n\n*-, or )\n\nWe may view either\n\n*., as providing a score of class ' over ( \u2014 when positive,\n\nclass ' wins, and when negative, class ( wins.\nmay be viewed as a tournament-style game, where for a given sample \u0005\n\nIn general, the classi\ufb01cation procedure\n, different classes\ncorrespond to different players. Players pair together and play each other, and the winner\ngoes on to play another match with a different player. The strategy leading to table 1\nrequired a particular class presentation order \u2014 in that case the order was just the numeric\nordering of the arbitrarily assigned integer classes (corresponding to words in this case).\n\nwho might win over\n\nquantity )\n\nOf course when\na transitive game [5] (the order of player pairings do not change the \ufb01nal winner). The\n\n*-, alone is used, the order of the comparisons do not matter, leading to\n*-, , however, is not guaranteed to be transitive, and when used in a tournament\n\nmight\nit results in what is called an intransitive game[5]. This means, for example, that\n. Games may be depicted\nwin over\nas directed graphs, where an edge between two players point towards the winner. In an\nintransitive game, the graph contains directed cycles. There has been very little research on\nintransitive game strategies \u2014 there are in fact a number of philosophical issues relating\nto if such games are valid or truly exist. Nevertheless, we derived a number of tourna-\nment strategies for playing such intransitive games and evaluated their performance in the\nfollowing.\n\nwho then might win over\n\nBroadly, there are two tournament types that we considered. Given a particular ordering of\n\nthe classes\nplays \u0002\u0004\u0003\nplays\u0002\u0014\u001f , \u0002\n\nthe resulting\n\n\u0001 , we de\ufb01ne a sequential tournament when\u0002\n\n\u001f , the winner\nand so on. We also de\ufb01ne a tree-based tournament when \u0002\n\n, and so on. The tree-based tournament is then applied recursively on\n\n\u001b plays\u0002\n\n, the winner plays \u0002\u0004\u0005\n\n\u0019\"!#!$!$\u0019=\u0002\n\u0019\u001e\u0002\nplays \u0002\n\u0003 winners until a \ufb01nal winner is found.\n\nBased on the above, we investigated several intransitive game playing strategies. For\nRAND1, we just choose a single random tournament order in a sequential tournament. For\nRAND500, we run 500 sequential tournaments, each one with a different random order.\nThe ultimate winner is taken to be the player who wins the most tournaments. The third\nstrategy plays 1000 rather than 500 tournaments. The \ufb01nal strategy is inspired by world-\ncup soccer tournaments: given a randomly generated permutation, the class sequence is\n\n5The 75-word case is an average result of 8 experiments, the 150-word case is an average of 4\ncases, and the 300-word case is an average of 2 cases. There are 7291 separate test samples in the\n600-word case, and on average about 911 samples per 75-word test case.\n\n\f\n\u000e\n(\n\f\n\f\n\f\n\f\n\n\u0001\n\u0002\n\n\u0002\n\u001b\n\u001f\n%\n\u001b\n\u0003\n\u0005\n\u0006\n:\n\fvocabulary\n\n75\n150\n300\n600\n\n\u0003\u0001\n\n1.0047\n1.0061\n1.0241\n1.0319\n\nvar\n\n0.0071\n0.0126\n0.0551\n0.0770\n\nmax\n2.7662\n3.6539\n4.0918\n5.0460\n\n\u0003\u0001\n\n1.0285\n1.0118\n1.0170\n1.0533\n\nvar\n\n0.0759\n0.0263\n0.0380\n0.1482\n\nmax\n3.8230\n3.8724\n3.9072\n5.5796\n\nTable 3: The statistics of winners. Columns 2-4: 500 random tournaments, Columns 5-7:\n1000 random tournaments.\n\nseparated into 8 groups. We pick the winner of each group using a sequential tournament\n(the \u201cregionals\u201d). Then a tree-based tournament is used on the group winners.\n\nTable 1 compares these different strategies. As can be seen, the results get slightly better\n(particularly with a larger number of classes) as the number of tournaments increases. Fi-\nnally, the single word cup strategy does surprisingly well for the larger class sizes. Note that\nthe improvements are statistically signi\ufb01cant over the baseline (0.002 using a difference of\nproportions signi\ufb01cance test) and the improvements are more dramatic for increasing vo-\ncabulary size. Furthermore, the it appears that the larger vocabulary sizes bene\ufb01t more\nfrom the larger number (1000 rather than 500) of random tournaments.\n\n)\n\n%\n\n(\n \nr\no\nr\nr\ne\n\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n1\n\n2\n\n)\n\n%\n\n(\n \nr\no\nr\nr\ne\n\n \nf\n\no\n\n \ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\n0\n\n5\n\n6\n\n1\nnumber of cycles detected\n\n3\n\n2\n\n4\n\n5\n\n3\n\n4\n\nlength of cycle\n\nFigure 1: 75-word vocabulary case. Left: probability of error given that there exists a cycle\nof at least the given length (a cycle length of one means no cycle found). Right:probability\nof error given that at least the given number of cycles exist.\n\n3.1 Empirical Analysis\n\nIn order to better understand our results, this section analyzes the 500 and 1000 random\ntournament strategies described above. Each set of random tournaments produces a set\nof that histogram\nof winners which may be described by a histogram. The entropy \u0002\n. This is of\ndescribes its spread, and the number of typical winners is approximately\u0003\u0003\ncourse relative to each sample \u0005\nso we may look at the average (\n\nmaximum of this number (the minimum is 1.0 in every case). This is given in Table 3 for\nthe 500 and 1000 cases.\n\n\u0005\u0005\u0004\n\nis approximately 1\nThe table indicates that there is typically only one winner since\nand the variances are small. This shows further that the winner is typically not in a cycle, as\nthe existence of a directed cycle in the tournament graph would probably lead to different\nwinners for each random tournament. The relationship between properties of cycles and\nWER is explored below.\n\n\u0003\u0003\n\n\u0003\u0003\u0007\u0006 ), variance, and\n\nWhen the tournament is intransitive (and therefore the graph possess a cycle), our second\nanalyses shows that the probability of error tends to increase. This is shown in Figure 1\nshowing that the error probability increases both as the detected cycle length and the num-\n\n\u0005\n\fvocabulary\n\n75\n150\n300\n600\n\n*.,\n\n2.33584\n3.31072\n5.22513\n7.39268\n\nskip WER #cycles(%)\n1.90237\n2.76814\n4.46296\n6.50117\n\n13.89\n19.6625\n22.38\n31.96\n\nbreak WER #cycles(%)\n\n1.90223\n2.67814\n4.46296\n6.50117\n\n9.34\n16.83\n21.34\n31.53\n\nTable 4: WER results using two strategies (skip and break) that utilize information about\ncycles in the tournament graphs, compared to baseline\n\u0003\u0002 columns show\nthe number of cycles detected relative to the number of samples in each case.\n\n*., . The\u001d\u0001\u0003\u0002 and \u0004\n\nber of detected cycles increases. 6 This property suggests that the existence of intransitivity\ncould be used as a con\ufb01dence measure, or could be used to try to reduce errors.\n\nAs an attempt at the latter, we evaluated two very simple heuristics that try to eliminate\ncycles as detected during classi\ufb01cation.\nIn the \ufb01rst method (skip), we run a sequential\ntournament (using a random class ordering) until either a clear winner is found (a transitive\ngame), or a cycle is detected.\nIf a cycle is detected, we select two players not in the\ncycle, effectively jumping out of the cycle, and continue playing until the end of the class\nordering. If winner cannot be determined (because there are too few players remaining), we\nbackoff and use\nto select the winner. In a second method (break), if a cycle is detected,\nwe eliminate the class having the smallest likelihood from that cycle, and then continue\nplaying as before. Neither method detects all the cycles in the graph (their number can be\nexponentially large).\n\n*-,\n\nAs can be seen, the WER results still provide signi\ufb01cant improvements over the baseline,\nbut are no better than earlier results. Because the tournament strategy is coupled with cycle\ndetection, the cycles detected are different in each case (the second method detecting fewer\ncycles presumably because the eliminated class is in multiple cycles). In any case, it is\napparent that further work is needed to investigate the relationship between the existence\nand properties of cycles and methods to utilize this information.\n\n4 Iterative Determination of KL-divergence\n\nIn all of our experiments so far, KL-divergence is calculated according to the initial hy-\npothesized answers. We would expect that using the true answers to determine the KL-\ndivergence would improve our results further. The top horizontal lines in Figure 2 shows\nthe original baseline results, and the bottom lines show the results using the true answers (a\ncheating experiment) to determine the KL-divergence. As can be seen, the improvement is\n\nsigni\ufb01cant thereby con\ufb01rming that using \n\nformance. Note also that the relative improvement stays about constant with increasing\nvocabulary size.\n\n*-, can signi\ufb01cantly improve classi\ufb01cation per-\n\n\ufb01rst set of KL-divergences used in )\nswers which then is used to compute a new scores )\n\nThis further indicates that an iterative strategy for determining KL-divergence might fur-\nis used to determine the answers to compute the\nther improve our results. In this case,\n. This is then used to compute a new set of an-\nand so on. The remaining plots in\nFigure 2 show the results of this strategy for the 500 and 1000 random trials case (i.e., the\nanswers used to compute the KL-divergences in each case are obtained from the previous\nset of random tournaments using the histogram peak procedure described earlier). Rather\nsurprisingly, the results show that iterating in this fashion does not in\ufb02uence the results in\n\n*-,\n\n*-,\n\n*-,\n\n6Note that this shows a lower bound on the number of cycles detected. This is saying that if we\n\n\ufb01nd, for example, four or more cycles then the chance of error is high.\n\n\f\n\f\n\f\n\u0004\n\f\n\n\u001b\n\u0011\n\n\u001f\n\u0011\n\f2.5\n\n2\n\n)\n\n%\n\n(\n \ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\nd\nr\no\nw\n\n \n\n1.5\n\n0\n\n2\n\n5.4\n\n5.2\n\n5\n\n4.8\n\n4.6\n\n4.4\n\n4.2\n\n)\n\n%\n\n(\n \ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\nd\nr\no\nw\n\n \n\n4\n\n0\n\n2\n\n75 classes\n\n150 classes\n\n3.5\n\n3\n\n2.5\n\n)\n\n%\n\n(\n \ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\nd\nr\no\nw\n\n \n\n10\n\n2\n\n0\n\n2\n\n4\n\n8\nnumber of iterations\n\n6\n\n4\n\n8\nnumber of iterations\n\n6\n\n300 classes\n\n600 classes\n\n7.5\n\n7\n\n6.5\n\n6\n\n)\n\n%\n\n(\n \ne\n\nt\n\na\nr\n \nr\no\nr\nr\ne\nd\nr\no\nw\n\n \n\nbaseline\ncheating\n500 trials\n1000 trials\n\n4\n\n8\nnumber of iterations\n\n6\n\n10\n\n5.5\n\n0\n\n2\n\n4\n\n8\nnumber of iterations\n\n6\n\n10\n\n10\n\nFigure 2: Baseline using likelihood ratio (top lines), cheating results using correct answers\nfor KL-divergence (bottom lines), and the iterative determination of KL-distance using\nhypothesized answers from previous iteration (middle lines).\n\nany appreciable way \u2014 the WERs seem to decrease only slightly from their initial drop. It\nis the case, however, that as the number of random tournaments increases, the results be-\ncome closer to the ideal as the vocabulary size increases. We are currently studying further\nsuch iterative procedures for recomputing the KL-divergences.\n\n5 Discussion and Conclusion\n\n'<\u0013>\u0019\n\n*-,\n\nWe have introduced a correction term to the likelihood ratio classi\ufb01cation method that is\njusti\ufb01ed by the difference between the estimated and true class conditional probabilities\nis an estimate of the classi\ufb01cation bias that would\n\nparisons intransitive and we introduce several tournament-like strategies to compensate.\n\n\u00058\u0012\noptimally compensate for these differences. The presence of \nWhile the introduction of \n\n*-, makes the class com-\n*., consistently improves the classi\ufb01cation results, further im-\n\nprovements are obtained by the selection of the comparison strategy. Further details and\nresults of our methods will appear in forthcoming publications and technical reports.\n\n'<\u0013 . The correction term \n\nReferences\n\n[1]\n\nJ. Bilmes. Natural Statistic Models for Automatic Speech Recognition. PhD thesis, U.C. Berkeley, Dept. of EECS, CS Division, 1999.\n\n[2] T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley, 1991.\n\n[3] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classi\ufb01cation. John Wiley and Sons, Inc., 2000.\n\n[4]\n\nJ. Pitrelli, C. Fong, S.H. Wong, J.R. Spitz, and H.C. Lueng. PhoneBook: A phonetically-rich isolated-word telephone-speech database. In Proc.\nIEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1995.\n\n[5] P.D. Straf\ufb01n. Game Theory and Strategy. The Mathematical ASsociation of America, 1993.\n\n\u0015\n\u000e\n\u0004\n\u0015\n\u000e\n\u0005\n\u0012\n\f", "award": [], "sourceid": 2084, "authors": [{"given_name": "Jeff", "family_name": "Bilmes", "institution": null}, {"given_name": "Gang", "family_name": "Ji", "institution": null}, {"given_name": "Marina", "family_name": "Meila", "institution": null}]}