Part of Advances in Neural Information Processing Systems 7 (NIPS 1994)
Christopher Williams, Michael Revow, Geoffrey E. Hinton
Deformable models are an attractive approach to recognizing non(cid:173) rigid objects which have considerable within class variability. How(cid:173) ever, there are severe search problems associated with fitting the models to data. We show that by using neural networks to provide better starting points, the search time can be significantly reduced. The method is demonstrated on a character recognition task.
In previous work we have developed an approach to handwritten character recogni(cid:173) tion based on the use of deformable models (Hinton, Williams and Revow, 1992a; Revow, Williams and Hinton, 1993). We have obtained good performance with this method, but a major problem is that the search procedure for fitting each model to an image is very computationally intensive, because there is no efficient algorithm (like dynamic programming) for this task. In this paper we demonstrate that it is possible to "compile down" some of the knowledge gained while fitting models to data to obtain better starting points that significantly reduce the search time.
1 DEFORMABLE MODELS FOR DIGIT RECOGNITION
The basic idea in using deformable models for digit recognition is that each digit has a model, and a test image is classified by finding the model which is most likely to have generated it. The quality of the match between model and test image depends on the deformation of the model, the amount of ink that is attributed to noise and the distance of the remaining ink from the deformed model.
·Current address: Department of Computer Science and Applied Mathematics, Aston
University, Birmingham B4 7ET, UK.
Christopher K. T. Williams, Michael D. Revow, Geoffrey E. Hinton
More formally, the two important terms in assessing the fit are the prior probabil(cid:173) ity distribution for the instantiation parameters of a model (which penalizes very distorted models), and the imaging model that characterizes the probability distri(cid:173) bution over possible images given the instantiated model l . Let I be an image, M be a model and z be its instantiation parameters. Then the evidence for model M is given by
P(IIM) = J P(zIM)P(IIM, z)dz
The first term in the integrand is the prior on the instantiation parameters and the second is the imaging model i.e., the likelihood of the data given the instantiated model. P(MII) is directly proportional to P(IIM), as we assume a uniform prior on each digit. Equation 1 is formally correct, but if z has more than a few dimensions the evalua(cid:173) tion of this integral is very computationally intensive. However, it is often possible to make an approximation based on the assumption that the integrand is strongly peaked around a (global) maximum value z*. In this case, the evidence can be ap(cid:173) proximated by the highest peak of the integrand times a volume factor ~(zII, M), which measures the sharpness of the peak2 .
P(IIM) ~ P(zIM)P(Ilz, M)~(zII, M)
(2) By Taylor expanding around z* to second order it can be shown that the volume factor depends on the determinant of the Hessian of 10gP(z, 11M) . Taking logs of equation 2, defining EdeJ as the negative log of P(z*IM), and EJit as the cor(cid:173) responding term for the imaging model, then the aim of the search is to find the minimum of E tot = EdeJ + EJit . Of course the total energy will have many local minima; for the character recognition task we aim to find the global minimum by using a continuation method (see section 1.2).
1.1 SPLINES, AFFINE TRANSFORMS AND IMAGING MODELS
This section presents a brief overview of our work on using deformable models for digit recognition. For a fuller treatment, see Revow, Williams and Hinton (1993) .
Each digit is modelled by a cubic B-spline whose shape is determined by the posi(cid:173) tions of the control points in the object-based frame. The models have eight control points, except for the one model which has three, and the seven model which has five. To generate an ideal example of a digit the control points are positioned at their "home" locations. Deformed characters are produced by perturbing the con(cid:173) trol points away from their home locations. The home locations and covariance matrix for each model were adapted in order to improve the performance.
The deformation energy only penalizes shape deformations. Affine transformations, i.e., translation, rotation, dilation, elongation, and shear, do not change the under(cid:173) lying shape of an object so we want the deformation energy to be invariant under them . We achieve this by giving each model its own "object-based frame" and computing the deformation energy relative to this frame.
lThis framework has been used by many authors, e.g. Grenander et al (1991) . 2The Gaussian approximation has been popularized in the neural net community by