John Denker, Yann LeCun
(1) The outputs of a typical multi-output classification network do not satisfy the axioms of probability; probabilities should be positive and sum to one. This problem can be solved by treating the trained network as a preprocessor that produces a feature vector that can be further processed, for instance by classical statistical estimation techniques. (2) We present a method for computing the first two moments ofthe probability distribution indicating the range of outputs that are consistent with the input and the training data. It is particularly useful to combine these two ideas: we implement the ideas of section 1 using Parzen windows, where the shape and relative size of each window is computed using the ideas of section 2. This allows us to make contact between important theoretical ideas (e.g. the ensemble formalism) and practical techniques (e.g. back-prop). Our results also shed new light on and generalize the well-known "soft max" scheme.
1 Distribution of Categories in Output Space
In many neural-net applications, it is crucial to produce a set of C numbers that serve as estimates of the probability of C mutually exclusive outcomes. For exam(cid:173) ple, in speech recognition, these numbers represent the probability of C different phonemes; the probabilities of successive segments can be combined using a Hidden Markov Model. Similarly, in an Optical Character Recognition ("OCR") applica(cid:173) tion, the numbers represent C possible characters. Probability information for the "best guess" category (and probable runner-up categories) is combined with con(cid:173) text, cost information, etcetera, to produce recognition of multi-character strings.