{"title": "Pairwise Neural Network Classifiers with Probabilistic Outputs", "book": "Advances in Neural Information Processing Systems", "page_first": 1109, "page_last": 1116, "abstract": null, "full_text": "Pairwise Neural Network Classifiers with \n\nProbabilistic Outputs \n\nDavid Price \nA2iA and ESPCI \n\n3 Rue de l'Arrivee, BP 59 \n\n75749 Paris Cedex 15, France \n\na2ia@dialup.francenet.fr \n\nStefan Knerr \n\nESPCI and CNRS (UPR AOOO5) \n\n10, Rue Vauquelin, 75005 Paris, France \n\nknerr@neurones.espci.fr \n\nLeon Personnaz, Gerard Dreyfus \n\nESPeI, Laboratoire d'Electronique \n\n10, Rue Vauquelin, 75005 Paris, France \n\ndreyfus@neurones.espci.fr \n\nAbstract \n\nMulti-class classification problems can be efficiently solved by \npartitioning the original problem into sub-problems involving only two \nclasses: for each pair of classes, a (potentially small) neural network is \ntrained using only the data of these two classes. We show how to \ncombine the outputs of the two-class neural networks in order to obtain \nposterior probabilities for the class decisions. The resulting probabilistic \npairwise classifier is part of a handwriting recognition system which is \ncurrently applied to check reading. We present results on real world data \nbases and show that, from a practical point of view, these results compare \nfavorably to other neural network approaches. \n\n1 \n\nIntroduction \n\nGenerally, a pattern classifier consists of two main parts: a feature extractor and a \nclassification algorithm. Both parts have the same ultimate goal, namely to transform a \ngiven input pattern into a representation that is easily interpretable as a class decision. In \nthe case of feedforward neural networks, the interpretation is particularly easy if each class \nis represented by one output unit. For many pattern recognition problems, it suffices that \nthe classifier compute the class of the input pattern, in which case it is common practice to \nassociate the pattern to the class corresponding to the maximum output of the classifier. \nOther problems require graded (soft) decisions, such as probabilities, at the output of the \n\n\f1110 \n\nDavid Price, Stefan Knerr, Leon Personnaz, Gerard Dreyfus \n\nclassifier for further use in higher context levels: in speech or character recognition for \ninstance, the probabilistic outputs of the phoneme (character) recognizer are often used by a \nHidden-Markov-Model algorithm or by some other dynamic programming algorithm to \ncompute the most probable word hypothesis. \nIn the context of classification, it has been shown that the minimization of the Mean \nSquare Error (MSE) yields estimates of a posteriori class probabilities [Bourlard & \nWellekens, 1990; Duda & Hart, 1973]. The minimization can be performed by a \nfeedforward multilayer perceptrons (MLP's) using the backpropagation algorithm, which is \none of the reasons why MLP's are widely used for pattern recognition tasks. However, \nMLPs have well-known limitations when coping with real-world problems, namely long \ntraining times and unknown architecture. \nIn the present paper, we show that the estimation of posterior probabilities for a K-class \nproblem can be performed efficiently using estimates of posterior probabilities for K(K -1 )/2 \ntwo-class sub-problems. Since the number of sub-problems increases as K2, this procedure \nwas originally intended for applications involving a relatively small number of classes, \nsuch as the 10 classes for the recognition of handwritten digits [Knerr et aI., 1992]. In this \npaper we show that this approach is also viable for applications with K\u00bb 10. \nThe probabilistic pairwise classifier presented in this paper is part of a handwriting \nrecognition system, discussed elsewhere [Simon, 1992], which is currently applied to check \nreading. The purpose of our character recognizer is to classify pre-segmented characters from \ncursive handwriting. The probabilistic outputs of the recognizer are used to estimate word \nprobabilities. We present results on real world data involving 27 classes, compare these \nresults to other neural network approaches, and show that our probabilistic pairwise \nclassifier is a powerful tool for computing posterior class probabilities in pattern \nrecognition problems. \n\n2 Probabilistic Outputs from Two-class Classifiers \n\nMulti-class classification problems can be efficiently solved by \"divide and conquer\" \nstrategies which partition the original problem into a set of K(K-l)/2 two-class problems. \nFor each pair of classes (OJ and (OJ, a (potentially small) neural network with a single \noutput unit is trained on the data of the two classes [Knerr et aI., 1990, and references \ntherein]. In this section, we show how to obtain probabilistic outputs from each of the \ntwo-class classifiers in the pairwise neural network classifier (Figure 1). \n\nK(K-I)12 \n\ntwo-class networks \n\ninputs \n\nFigure 1: Pairwise neural network classifier. \n\n\fPairwise Neural Network Classifiers with Probabilistic Outputs \n\n1111 \n\nIt has been shown that the \\llinimization of the MSE cost function (or likewise a cost \nfunction based on an entropy measure, [Bridle, 1990]) leads to estimates of posterior \nprobabilities. Of course, the quality of the estimates depends on the number and distribution \nof examples in the training set and on the minimization method used. \nIn the theoretical case of two classes <01 and <02, each Gaussian distributed, with means \nm 1 and m2, a priori probabilities Pq and Pr2, and equal covariance matrices ~, the \nposterior probability of class <01 given the pattern x is: \n\nPr(class=