Vinod Nair, Geoffrey E. Hinton
We describe a generative model for handwritten digits that uses two pairs of opposing springs whose stiffnesses are controlled by a motor program. We show how neural networks can be trained to infer the motor programs required to accurately reconstruct the MNIST digits. The inferred motor programs can be used directly for digit classiﬁcation, but they can also be used in other ways. By adding noise to the motor program inferred from an MNIST image we can generate a large set of very different images of the same class, thus enlarging the training set available to other methods. We can also use the motor programs as additional, highly informative outputs which reduce overﬁtting when training a feed-forward classiﬁer.
1 Overview The idea that patterns can be recognized by ﬁguring out how they were generated has been around for at least half a century [1, 2] and one of the ﬁrst proposed applications was the recognition of handwriting using a generative model that involved pairs of opposing springs [3, 4]. The “analysis-by-synthesis” approach is attractive because the true generative model should provide the most natural way to characterize a class of patterns. The handwritten 2’s in ﬁgure 1, for example, are very variable when viewed as pixels but they have very similar motor programs. Despite its obvious merits, analysis-by-synthesis has had few successes, partly because it is computationally expensive to invert non-linear generative models and partly because the underlying parameters of the generative model are unknown for most large data sets. For example, the only source of information about how the MNIST digits were drawn is the images themselves.
We describe a simple generative model in which a pen is controlled by two pairs of op- posing springs whose stiffnesses are speciﬁed by a motor program. If the sequence of stiffnesses is speciﬁed correctly, the model can produce images which look very like the MNIST digits. Using a separate network for each digit class, we show that backpropaga- tion can be used to learn a “recognition” network that maps images to the motor programs required to produce them. An interesting aspect of this learning is that the network creates its own training data, so it does not require the training images to be labelled with motor programs. Each recognition network starts with a single example of a motor program and grows an “island of competence” around this example, progressively extending the region over which it can map small changes in the image to the corresponding small changes in the motor program (see ﬁgure 2).
Figure 1: An MNIST image of a 2 and the additional images that can be generated by infer- ring the motor program and then adding random noise to it. The pixels are very different, but they are all clearly twos.
Fairly good digit recognition can be achieved by using the 10 recognition networks to ﬁnd 10 motor programs for a test image and then scoring each motor program by its squared error in reconstructing the image. The 10 scores are then fed into a softmax classiﬁer. Recognition can be improved by using PCA to model the distribution of motor trajectories for each class and using the distance of a motor trajectory from the relevant PCA hyperplane as an additional score.
Each recognition network is solving a difﬁcult global search problem in which the correct motor program must be found by a single, “open-loop” pass through the network. More accurate recognition can be achieved by using this open-loop global search to initialize an iterative, closed-loop local search which uses the error in the reconstructed image to re- vise the motor program. This requires reconstruction errors in pixel space to be mapped to corrections in the space of spring stiffnesses. We cannot backpropagate errors through the generative model because it is just a hand-coded computer program. So we learn “genera- tive” networks, one per digit class, that emulate the generator. After learning, backpropa- gation through these generative networks is used to convert pixel reconstruction errors into stiffness corrections.
Our ﬁnal system gives 1.82% error on the MNIST test set which is similar to the 1.7% achieved by a very different generative approach  but worse than the 1.53% produced by the best backpropagation networks or the 1.4% produced by support vector machines . It is much worse than the 0.4% produced by convolutional neural networks that use cleverly enhanced training sets . Recognition of test images is quite slow because it uses ten different recognition networks followed by iterative local search. There is, however, a much more efﬁcient way to make use of our ability to extract motor programs. They can be treated as additional output labels when using backpropagation to train a single, multi- layer, discriminative neural network. These additional labels act as a very informative regularizer that reduces the error rate from 1.53% to 1.27% in a network with two hidden layers of 500 units each. This is a new method of improving performance that can be used in conjunction with other tricks such as preprocessing the images, enhancing the training set or using convolutional neural nets [8, 7].
2 A simple generative model for drawing digits
The generative model uses two pairs of opposing springs at right angles. One end of each spring is attached to a frictionless horizontal or vertical rail that is 39 pixels from the center of the image. The other end is attached to a “pen” that has signiﬁcant mass. The springs themselves are weightless and have zero rest length. The pen starts at the equilibrium position deﬁned by the initial stiffnesses of the four springs. It then follows a trajectory that is determined by the stiffness of each spring at each of the 16 subsequent time steps in the motor program. The mass is large compared with the rate at which the stiffnesses change, so the system is typically far from equilibrium as it follows the smooth trajectory. On each time step, the momentum is multiplied by 0.9 to simulate viscosity. A coarse-grain trajectory is computed by using one step of forward integration for each time step in the motor program, so it contains 17 points. The code is at www.cs.toronto.edu/∼ hinton/code.
Figure 2: The training data for each class-speciﬁc recognition network is produced by adding noise to motor programs that are inferred from MNIST images using the current parameters of the recognition network. To initiate this process, the biases of the output units are set by hand so that they represent a prototypical motor program for the class.
Given a coarse-grain trajectory, we need a way of assigning an intensity to each pixel. We tried various methods until we hand-evolved one that was able to reproduce the MNIST im- ages fairly accurately, but we suspect that many other methods would be just as good. For each point on the coarse trajectory, we share two units of ink between the the four closest pixels using bilinear interpolation. We also use linear interpolation to add three ﬁne-grain trajectory points between every pair of coarse-grain points. These ﬁne-grain points also contribute ink to the pixels using bilinear interpolation, but the amount of ink they con- tribute is zero if they are less than one pixel apart and rises linearly to the same amount as the coarse-grain points if they are more than two pixels apart. This generates a thin skeleton with a fairly uniform ink density. To ﬂesh-out the skeleton, we use two “ink parameters”, a, b, to specify a 3 × 3 kernel of the form b(1 + a)[ a 12 ] which is convolved with the image four times. Finally, the pixel intensities are clipped to lie in the interval [0,1]. The matlab code is at www.cs.toronto.edu/∼ hinton/code. The values of 2a and b/1.5 are additional, logistic outputs of the recognition networks1.