{"title": "Transforming Neural-Net Output Levels to Probability Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 853, "page_last": 859, "abstract": null, "full_text": "Transforming Neural-Net Output Levels \n\nto Probability Distributions \n\nJohn S. Denker and Yann leCun \n\nAT&T Bell Laboratories \n\nHolmdel, NJ 07733 \n\nAbstract \n\n(1) The outputs of a typical multi-output classification network do not \nsatisfy the axioms of probability; probabilities should be positive and sum \nto one. This problem can be solved by treating the trained network as a \npreprocessor that produces a feature vector that can be further processed, \nfor instance by classical statistical estimation techniques. (2) We present a \nmethod for computing the first two moments ofthe probability distribution \nindicating the range of outputs that are consistent with the input and the \ntraining data. It is particularly useful to combine these two ideas: we \nimplement the ideas of section 1 using Parzen windows, where the shape \nand relative size of each window is computed using the ideas of section 2. \nThis allows us to make contact between important theoretical ideas (e.g. \nthe ensemble formalism) and practical techniques (e.g. back-prop). Our \nresults also shed new light on and generalize the well-known \"soft max\" \nscheme. \n\n1 Distribution of Categories in Output Space \n\nIn many neural-net applications, it is crucial to produce a set of C numbers that \nserve as estimates of the probability of C mutually exclusive outcomes. For exam(cid:173)\nple, in speech recognition, these numbers represent the probability of C different \nphonemes; the probabilities of successive segments can be combined using a Hidden \nMarkov Model. Similarly, in an Optical Character Recognition (\"OCR\") applica(cid:173)\ntion, the numbers represent C possible characters. Probability information for the \n\"best guess\" category (and probable runner-up categories) is combined with con(cid:173)\ntext, cost information, etcetera, to produce recognition of multi-character strings. \n\n853 \n\n\f854 \n\nDenker and IeCun \n\nAccording to the axioms of probability, these C numbers should be constrained to be \npositive and sum to one. We find that rather than modifying the network architec(cid:173)\nture and/or training algorithm to satisfy this constraint directly, it is advantageous \nto use a network without the probabilistic constraint, followed by a statistical post(cid:173)\nprocessor. Similar strategies have been discussed before, e.g. (Fogelman, 1990). \nThe obvious starting point is a network with C output units. We can train the net(cid:173)\nwork with targets that obey the probabilistic constraint, e.g. the target for category \n\"0\" is [1, 0, 0, ... J, the target for category \"1\" is [0, 1, 0, ... J, etcetera. This would \nnot, alas, guarantee that the actual outputs would obey the constraint. Of course, \nthe actual outputs can always be shifted and normalized to meet the requirement; \none of the goals of this paper is to understand the best way to perform such a \ntransformation. A more sophisticated idea would be to construct a network that \nhad such a transformation (e.g. softmax (Bridle, 1990; Rumelhart, 1989)) \"built \nin\" even during training. We tried this idea and discovered numerous difficulties, \nas discussed in (Denker and leCun, 1990). \nThe most principled solution is simply to collect statistics on the trained network. \nFigures 1 and 2 are scatter plots of output from our OCR network (Le Cun et al., \n1990) that was trained to recognize the digits \"0\" through \"9~' In the first figure, \nthe outputs tend to cluster around the target vectors [the points (T-, T+) and \n(T+ , T-)], and even though there are a few stragglers, decision regions can be \nfound that divide the space into a high-confidence \"0\" region, a high-confidence \"I\" \nregion, and a quite small \"rejection\" region. In the other figure, it can be seen that \nthe \"3 versus 5\" separation is very challenging. \n\nIn all cases, the plotted points indicate the output of the network when the input \nimage is taken from a special \"calibration\" dataset \u00a3. that is distinct both from \nthe training set M (used to train the network) and from the testing set 9 (used to \nevaluate the generalization performance of the final, overall system). \n\nThis sort of analysis is applicable to a wide range of problems. The architecture of \nthe neural network (or other adaptive system) should be chosen to suit the problem \nin each case. The network should then be trained using standard techniques. The \nhope is that the output will constitute a sufficent statistic. \n\nGiven enough training data, we could use a standard statistical technique such \nas Parzen windows (Duda and Hart, 1973) to estimate the probability density in \noutput space. It is then straightforward to take an unknown input, calculate the \ncorresponding output vector 0, and then estimate the probability that it belongs \nto each class, according to the density of points of category c \"at\" location 0 in the \nscatter plot. \nWe note that methods such as Parzen windows tend to fail when the number of \ndimensions becomes too large, because it is exponentially harder to estimate prob(cid:173)\nability densities in high-dimensional spaces; this is often referred to as \"the curse \nof dimensionality\" (Duda and Hart, 1973). Since the number of output units (typ(cid:173)\nically 10 in our OCR network) is much smaller than the number of input units \n(typically 400) the method proposed here has a tremendous advantage compared to \nclassical statistical methods applied directly to the input vectors. This advantage \nis increased by the fact that the distribution of points in network-output space is \nmuch more regular than the distribution in the original space. \n\n\fTransforming Neural-Net Output Levels to Probability Distributions \n\n855 \n\nCalibration \nCategory 1 \n\nCalibration \nCategory 0 \n\nFigure 1: Scatter Plot: Category 1 versus 0 \n\nOne axis in each plane represents the activation level of output unit \nj=O, while the other axis represents activation level of output unit j=l; \nthe other 8 dimensions of output space are suppressed in this projection. \nPoints in the upper and lower plane are, respectively, assigned category \n\"I\" and \"0\" by the calibration set. The clusters appear elongated because \nthere are so many ways that an item can be neither a \"I\" nor a \"O~' This \nfigure contains over 500 points; the cluster centers are heavily overexposed. \n\nCalibration \nCategory 5 \n\nCalibration \nCategory 3 \n\nFigure 2: Scatter Plot: Category 5 versus 3 \n\nThis is the same as the previous figure except for the choice of data \n\npoints and projection axes. \n\n\f856 \n\nDenker and leCun \n\n2 Output Distribution for a Particular Input \n\nThe purpose of this section is to discuss the effect that limitations in the quantity \nand/or quality oftraining data have on the reliability of neural-net outputs. Only an \noutline of the argument can be presented here; details of the calculation can be found \nin (Denker and leCun, 1990). This section does not use the ideas developed in the \nprevious section; the two lines of thought will converge in section 3. The calculation \nproceeds in two steps: (1) to calculate the range of weight values consistent with the \ntraining data, and then (2) to calculate the sensitivity of the output to uncertainty in \nweight space. The result is a network that not only produces a \"best guess\" output, \nbut also an \"error bar\" indicating the confidence interval around that output. \n\nThe best formulation of the problem is to imagine that the input-output relation \nof the network is given by a probability distribution P(O, I) [rather than the usual \nfunction 0 = f( I)] where I and 0 represent the input vector and output vec(cid:173)\ntor respectively. For any specific input pattern, we get a probability distribution \nPOl(OII), which can be thought of as a histogram describing the probability of \nvarious output values. \n\nEven for a definite input I, the output will be probabilistic, because there is never \nenough information in the training set to determine the precise value of the weight \nvector W. Typically there are non-trivial error bars on the training data. Even when \nthe training data is absolutely noise-free (e.g. when it is generated by a mathematical \nfunction on a discrete input space (Denker et al., 1987)) the output can still be \nuncertain if the network is underdetermined; the uncertainty arises from lack of \ndata quantity, not quality. In the real world one is faced with both problems: less \nthan enough data to (over ) determine the network, and less than complete confidence \nin the data that does exist. \nWe assume we have a handy method (e.g. back-prop) for finding a (local) minimum \nW of the loss function E(W). A second-order Taylor expansion should be valid in \nthe vicinity of W. Since the loss function E is an additive function of training data, \nand since probabilities are multiplicative, it is not surprising that the likelihood of a \nweight configuration is an exponential function of the loss (Tishby, Levin and SoHa, \n1989). Therefore the probability can be modelled locally as a multidimensional \ngaussian centered at W; to a reasonable (Denker and leCun, 1990) approximation \nthe probability is proportional to: \n\ni \n\n(1) \n\nwhere h is the second derivative of the loss (the Hessian), f3 is a scale factor that \ndetermines our overall confidence in the training data, and po expresses any infor(cid:173)\nmation we have about prior probabilities. The sums run over the dimensions of \nparameter space. The width of this gaussian describes the range of networks in the \nensemble that are reasonably consistent with the training data. \nBecause we have a probability distribution on W, the expression 0 = fw (1) gives \na probability distribution on outputs 0, even for fixed inputs I. We find that the \nmost probable output () corresponds to the most probable parameters W. This \nunsurprising result indicates that we are on the right track. \n\n\f'Ji'ansforming Neural-Net Output Levels to Probability Distributions \n\n857 \n\nWe next would like to know what range of output values correspond to the allowed \nrange of parameter values. We start by calculating the sensitivity of the output \no = fw (1) to changes in W (holding the input I fixed). For each output unit \nj, the derivative of OJ with respect to W can be evaluated by a straightforward \nmodification of the usual back-prop algorithm. \n\nOur distribution of output values also has a second moment, which is given by a \nsurprisingly simple expression: \n\n2 (0 \nUj = \nj -\n\n2 \n0- )2) ~ \"(j,i \np ... = ~ {3h\" \n\nj \n\nII \n\n, \n\n. \n\n(2) \n\nwhere \"(j,i denotes the gradient of OJ with respect to Wi. We now have the first \ntwo moments of the output probability distribution (0 and u)j we could calculate \nmore if we wished. \nIt is reasonable to expect that the weighted sums (before the squashing function) \nat the last layer of our network are approximately normally distributed, since they \nare sums of random variables. If the output units are arranged to be reasonably \nlinear, the output distribution is then given by \n\n(3) \nwhere N is the conventional Normal (Gaussian) distribution with given mean and \nvariance, and where 0 and U depend on I. For multiple output units, we must \nconsider the joint probability distribution POl(OII). If the different output units' \ndistributions are independent, POI can be factored: \n\nPOl(OII) = IT Pjl(OjlI) \n\n(4) \n\nj \n\nWe have achieved the goal of this section: a formula describing a distribution of \noutputs consistent with a given input. This is a much fancier statement than the \nvanilla network's statement that () is \"the\" output. For a network that is not \nunderdetermined, in the limit {3 ~ 00, POI becomes a b function located at 0, \nso our formalism contains the vanilla network as a special case. For general {3, the \nregion where POI is large constitutes a \"confidence region\" of size proportional to the \nfuzziness 1/ {3 of the data and to the degree to which the network is underdetermined. \nNote that algorithms exist (Becker and Le Cun, 1989), (Le Cun, Denker and Solla, \n1990) for calculating \"( and h very efficiently -\nthe time scales linearly with the \ntime of calculation of O. Equation 4 is remarkable in that it makes contact between \nimportant theoretical ideas (e.g. the ensemble formalism) and practical techniques \n(e.g. back-prop). \n\n3 Combining the Distributions \n\nOur main objective is an expression for P(cII), the probability that input I should \nbe assigned category c. We get it by combining the idea that elements of the \ncalibration set I:- are scattered in output space (section 1) with the idea that the \nnetwork output for each such element is uncertain because the network is under(cid:173)\ndetermined (section 2). We can then draw a scatter plot in which the calibration \n\n\f858 \n\nDenker and leCun \n\ndata is represented not by zero-size points but by distributions in output space. One \ncan imagine each element of C, as covering the area spanned by its \"error bars\" of \nsize u as given by equation 2. We can then calculate P(cII) using ideas analogous \nto Parzen windows, with the advantage that the shape and relative size of each \nwindow is calculated, not assumed. The answer comes out to be: \n\nJ E/e'cc POl(OIII) \n\nE/e,C POl(OIJl) POl(OII) dO \n\nP(cII) = \n\n(5) \n\nwhere we have introduced c,e to denote the subset of C, for which the assigned \ncategory is c. Note that POI (given by equation 4) is being used in two ways in this \nformula: to calibrate the statistical postprocessor by summing over the elements of \nc', and also to calculate the fate of the input I (an element of the testing set). \nOur result can be understood by analogy to Parzen windows, although it differs \nfrom the standard Parzen windows scheme in two ways. First, it is pleasing that \nwe have a way of calculating the shape and relative size of the windows, namely \nPOI. Secondly, after we have summed the windows over the calibration set c', the \nstandard scheme would probe each window at the single point OJ our expression \n(equation 5) accounts for the fact that the network's response to the testing input \nI is blurred over a region given by POl(OII) and calls for a convolution. \n\nCorrespondence with Softmax \n\nWe were not surprised that, in suitable limits, our formalism leads to a generaliza(cid:173)\ntion of the highly useful \"softmax\" scheme (Bridle, 1990j Rumelhart, 1989). This \nprovides a deeper understanding of softmax and helps put our work in context. \nThe first factor in equation 5 is a perfectly well-defined function of 0, but it could \nbe impractical to evaluate it from its definition (summing over the calibration set) \nwhenever it is needed. Therefore we sought a closed-form approximation for it. \nAfter making some ruthless approximations and carrying out the integration in \nequation 5, it reduces to \n\nP( II) _ \n\nexp[TL\\(Oe - TO)/u~e] \n\nc \n\n- Eel exp[TL\\(Oel - TO)/U~/C/] \n\n(6) \n\nwhere TL\\ is the difference between the target values (T+ - T-), TO is the average \nof the target values, and u e; is the second moment of output unit j for data in \ncategory c. This can be compared to the standard softmax expression \n\nP( ell) = \n\nexp[rOe] \n\nEel exp[rOe/] \n\n(7) \n\nWe see that our formula has three advantages: (1) it is clear how to handle the \ncase where the targets are not symmetric about zero (non-vanishing ro); (2) the \n\"gain\" of the exponentials depends on the category c; and (3) the gains can be \ncalculated from measurable! properties of the data. Having the gain depend on \nthe category makes a lot of sense; one can see in the figures that some categories \nlOur formulas contain the overall confidence factor /3, which is not as easily measurable \n\nas we would like. \n\n\fTransforming Neural-Net Output Levels to Probability Distributions \n\n859 \n\nare more tightly clustered than others. One weakness that our equation 6 shares \nwith softmax is the assumption that the output distribution of each output j is \ncircular (i.e. independent of c). This can be remedied by retracting some of the \napproximations leading to equation 6. \nSummary: In a wide range of applications, it is extremely important to have good \nestimates of the probability of correct classification (as well as runner-up proba(cid:173)\nbilities). We have shown how to create a network that computes the parameters \nof a probability distribution (or confidence interval) describing the set of outputs \nthat are consistent with a given input and with the training data. The method has \nbeen described in terms of neural nets, but applies equally well to any parametric \nestimation technique that allows calculation of second derivatives. The analysis \noutlined here makes clear the assumptions inherent in previous schemes and offers \na well-founded way of calculating the required probabilities. \n\nReferences \n\nBecker, S. and Le Cun, Y. (1989). Improving the Convergence of Back-Propagation \n\nLearning with Second-Order Methods. In Touretzky, D., Hinton, G., and Se(cid:173)\njnowski, T., editors, Proc. of the 1988 Connectionist Models Summer School, \npages 29-37, San Mateo. Morgan Kaufman. \n\nBridle, J. S. (1990). Training Stochastic Model Recognition Algorithms as Net(cid:173)\n\nworks can lead to Maximum Mutual Information Estimation of Parameters. \nIn Touretzky, D., editor, Advances in Neural Information Processing Systems, \nvolume 2, (Denver, 1989). Morgan Kaufman. \n\nDenker, J. and leCun, Y. (1990). Transforming Neural-Net Output Levels to Proba(cid:173)\n\nbility Distributions. Technical Memorandum TM11359-901120-05, AT&T Bell \nLaboratories, Holmdel NJ 07733. \n\nDenker, J., Schwartz, D., Wittner, B., Solla, S. A., Howard, R., Jackel, L., and \nHopfield, J. (1987). Automatic Learning, Rule Extraction and Generalization. \nComplex Systems, 1:877-922. \n\nDuda, R. and Hart, P. (1973). Pattern Classification And Scene Analysis. Wiley \n\nand Son. \n\nFogelman, F. (1990). personal communication. \nLe Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, \n\nW., and Jackel, L. D. (1990). Handwritten Digit Recognition with a Back(cid:173)\nPropagation Network. In Touretzky, D., editor, Advances in Neural Informa(cid:173)\ntion Processing Systems, volume 2, (Denver, 1989). Morgan Kaufman. \n\nLe Cun, Y., Denker, J. S., and Solla, S. (1990). Optimal Brain Damage. In Touret(cid:173)\nzky, D., editor, Advances in Neural Information Processing Systems, volume 2, \n(Denver, 1989). Morgan Kaufman. \n\nRumelhart, D. E. (1989). personal communication. \nTishby, N., Levin, E., and Solla, S. A. (1989). Consistent Inference of Probabilities \nin Layered Networks: Predictions and Generalization. In Proceedings of the \nInternational Joint Conference on Neural Networks, Washington DC. \n\nIt is a pleasure to acknowledge useful conversations with John Bridle. \n\n\f", "award": [], "sourceid": 419, "authors": [{"given_name": "John", "family_name": "Denker", "institution": null}, {"given_name": "Yann", "family_name": "LeCun", "institution": null}]}