{"title": "Globally Trained Handwritten Word Recognizer using Spatial Representation, Convolutional Neural Networks, and Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 937, "page_last": 944, "abstract": null, "full_text": "Globally Trained Handwritten Word \n\nRecognizer using Spatial Representation, \n\nConvolutional Neural Networks and \n\nHidden Markov Models \n\nDept. Informatique et Recherche Operationnelle \n\nYoshua Bengio ... \n\nUniversite de Montreal \nMontreal, Qc H3C-3J7 \n\nYann Le Cun \nAT&T Bell Labs \nHolmdel NJ 07733 \n\nDonnie Henderson \n\nAT&T Bell Labs \nHolmdel NJ 07733 \n\nAbstract \n\nWe introduce a new approach for on-line recognition of handwrit(cid:173)\nten words written in unconstrained mixed style. The preprocessor \nperforms a word-level normalization by fitting a model of the word \nstructure using the EM algorithm. Words are then coded into low \nresolution \"annotated images\" where each pixel contains informa(cid:173)\ntion about trajectory direction and curvature. The recognizer is a \nconvolution network which can be spatially replicated. From the \nnetwork output, a hidden Markov model produces word scores. The \nentire system is globally trained to minimize word-level errors. \n\n1 \n\nIntroduction \n\nNatural handwriting is often a mixture of different \"styles\", lower case printed, \nupper case, and cursive. A reliable recognizer for such handwriting would greatly \nimprove interaction with pen-based devices, but its implementation presents new \n\n*also, AT&T Bell Labs, Holmdel NJ 07733 \n\n937 \n\n\f938 \n\nBengio, Le Cun, and Henderson \n\ntechnical challenges. Characters taken in isolation can be very ambiguous, but con(cid:173)\nsiderable information is available from the context of the whole word. We propose \na word recognition system for pen-based devices based on four main modules: a \npreprocessor that normalizes a word, or word group, by fitting a geometrical model \nto the word structure using the EM algorithm; a module that produces an \"anno(cid:173)\ntated image\" from the normalized pen trajectory; a replicated convolutional neural \nnetwork that spots and recognizes characters; and a Hidden Markov Model (HMM) \nthat interprets the networks output by taking word-level constraints into account. \nThe network and the HMM are jointly trained to minimize an error measure defined \nat the word level. \nMany on-line handwriting recognizers exploit the sequential nature of pen trajec(cid:173)\ntories by representing the input in the time domain. While these representations \nare compact and computationally advantageous, they tend to be sensitive to stroke \norder, writing speed, and other irrelevant parameters. In addition, global geometric \nfeatures, such as whether a stroke crosses another stroke drawn at a different time, \nare not readily available in temporal representations. To avoid this problem we \ndesigned a representation, called AMAP, that preserves the pictorial nature of the \nhandwriting. \nIn addition to recognizing characters, the system must also correctly segment the \ncharacters within the words. One approach, that we call INSEG, is to recognize \na large number of heuristically segmented candidate characters and combine them \noptimally with a postprocessor (Burges et al 92, Schenkel et al 93). Another ap(cid:173)\nproach, that we call OUTSEG, is to delay all segmentation decisions until after the \nrecognition, as is often done in speech recognition. An OUTSEG recognizer must \naccept entire words as input and produce a sequence of scores for each character at \neach location on the input. Since the word normalization cannot be done perfectly, \nthe recognizer must be robust with respect to relatively large distortions, size vari(cid:173)\nations, and translations. An elastic word model -e.g., an HMM- can extract word \ncandidates from the network output. The HMM models the long-range sequential \nstructure while the neural network spots and classifies characters, using local spatial \nstructure. \n\n2 Word Normalization \n\nInput normalization reduces intra-character variability, simplifying character recog(cid:173)\nnition. This is particularly important when recognizing entire words. We propose a \nnew word normalization scheme, based on fitting a geometrical model of the word \nstructure. Our model has four \"flexible\" lines representing respectively the ascen(cid:173)\nders line, the core line, the base line and the descenders line (see Figure 1). Points \non the lines are parameterized as follows: \n\ny = fk(X) = k(x - XO)2 + s(x - xo) + YOk \n\n(1) \n\nwhere k controls curvature, s is the skew, and (xo,Yo) is a translation vector. The \nparameters k, s, and Xo are shared among all four curves, whereas each curve has \nits own vertical translation parameter YOk. First the set of local maxima U and \nminima L of the vertical displacement are found. Xo is determined by taking the \naverage abscissa of extrema points. The lines of the model are then fitted to the \nextrema: the upper two lines to the maxima, and the lower two to the minima. \nThe fit is performed using a probabilistic model for the extrema points given the \nlines. The idea is to find the line parameters 8* that maximize the probability of \n\n\fGlobally Trained Handwritten Word Recognizer \n\n939 \n\n----\n\n---\n\n--' \n\nFigure 1: Word Normalization Model: Ascenders and core curves fit y-maxima \nwhereas descenders and baseline curves fit y-minima. There are 6 parameters: a \n(ascenders curve height relative to baseline), b (baseline absolute vertical position), \nc (core line position), d (descenders curve position), k (curvature), s (angle). \n\ngenerating the observed points. \n\n0* = argmax log P(X I 0) + log P(O) \n\n(J \n\n(2) \n\nThe above conditional distribution is chosen to be a mixture of Gaussians (one \nper curve) whose means are the y-positions obtained from the actual x-positions \nthrough equation 1: \n\nP(Xi, Yi 1 0) = log L WkN(Yi; fk(xd, (J'y) \n\n3 \n\nk=O \n\n(3) \n\nwhere N(x; J1, (J') is a univariate Normal distribution of mean J1 and standard devi(cid:173)\nation (J'. The Wk are the mixture parameters, some of which are set to 0 in order to \nconstrain the upper (lower) points to be fitted to the upper (lower) curves. They are \ncomputed a-priori using measured frequencies of associations of extrema to curves \non a large set of words. The priors P(O) on the parameters are required to prevent \nthe collapse of the curves. They can be used to incorporate a-priori information \nabout the word geometry, such as the expected position of the baseline, or the \nheight of the word. These priors for each parameter are chosen to be independent \nnormal distributions whose standard deviations control the strength of the prior. \nThe variables that associate each point with one of the curves are taken as hidden \nvariables of the EM algorithm. One can thus derive an auxiliary function which can \nbe analytically (and cheaply) solved for the 6 free parameters O. Convergence of \nthe EM algorithm was typically obtained within 2 to 4 iterations (of maximization \nof the auxiliary function). \n\n3 AMAP \n\nThe recognition of handwritten characters from a pen trajectory on a digitizing \nsurface is often done in the time domain. Trajectories are normalized, and local \n\n\f940 \n\nBengio, Le Cun, and Henderson \n\ngeometrical or dynamical features are sometimes extracted. The recognition is \nperformed using curve matching (Tappert 90), or other classification techniques such \nas Neural Networks (Guyon et al 91). While, as stated earlier, these representations \nhave several advantages, their dependence on stroke ordering and individual writing \nstyles makes them difficult to use in high accuracy, writer independent systems that \nintegrate the segmentation with the recognition. \n\nSince the intent of the writer is to produce a legible image, it seems natural to \npreserve as much of the pictorial nature of the signal as possible, while at the same \ntime exploit the sequential information in the trajectory. We propose a representa(cid:173)\ntion scheme, called AMAP, where pen trajectories are represented by low-resolution \nimages in which each picture element contains information about the local proper(cid:173)\nties of the trajectory. More generally, an AMAP can be viewed as a function in a \nmultidimensional space where each dimension is associated with a local property of \nthe trajectory, say the direction of motion e, the X position, and the Y position of \nthe pen. The value of the function at a particular location (e, X, Y) in the space \nrepresents a smooth version of the \"density\" of features in the trajectory that have \nvalues (e, X, Y) (in the spirit of the generalized Hough transform). An AMAP is a \nmultidimensional array (say 4x10x10) obtained by discretizing the feature density \nspace into \"boxes\". Each array element is assigned a value equal to the integral of \nthe feature density function over the corresponding box. In practice, an AMAP is \ncomputed as follows. At each sample on the trajectory, one computes the position \nof the pen (X, Y) and orientation of the motion () (and possibly other features, such \nas the local curvature c). Each element in the AMAP is then incremented by the \namount of the integral over the corresponding box of a predetermined point-spread \nfunction centered on the coordinates of the feature vector. The use of a smooth \npoint-spread function (say a Gaussian) ensures that smooth deformations of the \ntrajectory will correspond to smooth transformations of the AMAP. An AMAP can \nbe viewed as an \"annotated image\" in which each pixel is a feature vector. \n\nA particularly useful feature of the AMAP representation is that it makes very few \nassumptions about the nature of the input trajectory. It does not depend on stroke \nordering or writing speed, and it can be used with all types of handwriting (capital, \nlower case, cursive, punctuation, symbols). Unlike many other representations (such \nas global features), AMAPs can be computed for complete words without requiring \nsegmentation. \n\n4 Convolutional Neural Networks \n\nImage-like representations such as AMAPs are particularly well suited for use in \ncombination with Multi-Layer Convolutional Neural Networks (MLCNN) (Le Cun \n89, Le Cun et al 90). MLCNNs are feed-forward neural networks whose architectures \nare tailored for minimizing the sensitivity to translations, rotations, or distortions \nof the input image. They are trained with a variation of the Back-Propagation \nalgorithm (Rumelhart et al 86, Le Cun 86). \n\nThe units in MCLNNs are only connected to a local neighborhood in the previous \nlayer. Each unit can be seen as a local feature detector whose function is determined \nby the learning procedure. Insensitivity to local transformations is built into the \nnetwork architecture by constraining sets of units located at different places to use \nidentical weight vectors, thereby forcing them to detect the same feature on different \nparts of the input. The outputs of the units at identical locations in different feature \nmaps can be collectively thought of as a local feature vector. Features of increasing \n\n\fGlobally Trained Handwritten Word Recognizer \n\n941 \n\ncomplexity and globality are extracted by the neurons in the successive layers. \n\nThis weight-sharing technique has two interesting side effects. First, the number \nof free parameters in the system is greatly reduced since a large number of units \nshare the same weights. Classically, MLCNNs are shown a single character at the \ninput, a.nd have a single set of outputs. However, an essential feature of MLCNNs \nis that they can be scanned (replicated) over large input fields containing multiple \nunsegmented characters (whole words) very economically by simply performing the \nconvolutions on larger inputs. Instead of producing a single output vector, SDNNs \nproduce a series of output vectors. The outputs detects and recognize characters at \ndifferent (and overlapping) locations on the input. These multiple-input, multiple(cid:173)\noutput MLCNN are called Space Displacement Neural Networks (SDNN) (Matan \net al 92). \n\nOne of the best networks we found for character recognition has 5 layers arranged \nas follows: layer 1: convolution with 8 kernels of size 3x3, layer 2: 2x2 subsampling, \nlayer 3: convolution with 25 kernels of size 5x5, layer 4 convolution with 84 kernels \nof size 4x4, layer 5: 2x2 subsampling. The subsampling layers are essential to the \nnetwork's robustness to distortions. The output layer is one (single MLCNN) or \na series of (SDNN) 84-dimensional vectors. The target output configuration for \neach character class was chosen to be a bitmap of the corresponding character in a \nstandard 7x12 (=84) pixel font. Such a code facilitates the correction of confusable \ncharacters by the postprocessor. \n\n5 Post-Processing \n\nThe convolutional neural network can be used to give scores associated to characters \nwhen the network (or a piece of it corresponding to a single character output) has \nan input field, called a segment, that covers a connected subset of the whole word \ninput. A segmentation is a sequence of such segments that covers the whole word \ninput. Because there are in general many possible segmentations, sophisticated \ntools such as hidden Markov models and dynamic programming are used to search \nfor the best segmentation. \n\nIn this paper, we consider two approaches to the segmentation problem called IN(cid:173)\nSEG (for input segmentation) and OUTSEG (for output segmentation). The post(cid:173)\nprocessor can be generally decomposed into two levels: 1) character level scores and \nconstraints obtained from the observations, 2) word level constraints (grammar, \ndictionary). The INSEG and OUTSEG systems share the second level. \n\nIn an INSEG system, the network is applied to a large number of heuristically \nsegmented candidate characters. A cutter generates candidate cuts, which can po(cid:173)\ntentially represent the boundary between two character segments. It also generates \ndefinite cuts, which we assume that no segment can cross. Using these, a number \nof candidate segments are constructed and the network is applied to each of them \nseparately. Finally, for each high enough character score in each of the segment, a \ncharacter hypothesis is generated, corresponding to a node in an observation graph . \nThe connectivity and transition probabilities on the arcs of the observation graph \nrepresent segmentation and geometrical constraints (e.g., segments must not over(cid:173)\nlap and must cover the whole word, some transitions between characters are more \nor less likely given the geometrical relations between their images). \n\nIn an OUTSEG system, all segmentation decisions are delayed until after the recog-\n\n\f942 \n\nBengio, Le Cun, and Henderson \n\nnition, as is often done in speech recognition. The AMAP of the entire word is \nshown to an SDNN, which produces a sequence of output vectors equivalent to (but \nobtained much more cheaply than) scanning the single-character network over all \npossible pixel locations on the input. The Euclidean distances between each output \nvector and the targets are interpreted as log-likelihoods of the output given a class. \nTo construct an observation graph, we use a set of character models (HMMs) . Each \ncharacter HMM models the sequence of network outputs observed for that charac(cid:173)\nter . We used three-state HMMs for each character, with a left and right state to \nmodel transitions and a center state for the character itself. The observation graph \nis obtained by connecting these character models , allowing any character to follow \nany character. \n\nOn top of the constraints given in the observation graph , additional constraints that \nare independent of the observations are given by what we call a gram mar graph, \nwhich can embody lexical constraints. These constraints can be given in the form \nof a dictionary or of a character-level grammar (with transition probabilities), such \nas a trigram (in which we use the probability of observing a character in the context \nof the two previous ones). The recognition finds the best path in the observation \ngraph that is compatible with the grammar graph. The INSEG and OUTSEG \narchitecture are depicted in Figure 2. \n\nOUTSEG ARCHITECTURE \nFOR WORD RECOGNITION \n\nraw word \n\nword \n\nnormalization \n\nnormalized \nword ~--~'''''---\"'''''1i \n\nAMAP \n\ncomputation \n\nAMAP \n\nSDNN \n\ns~f \nMi.pf \n::::. ::.:. :~:~: t \n':':\":: :\",:, ~~'::: .. , .. ~: ': \n~~~} \nt} \n\nINSEG ARCHITECTURE \n\nFOR WORD RECOGNITION \n\nraw w0i\"'r_d_\"\"\",,, ___ ~ Sec; p t \n\n~~~r~~~ ~,= .. . \"\"\"'_-_<5~t'>r,ff: \n\n. Cut hypotheses \nI \n\ngeneration \n\nsegme~n~\"\"\"\"_\",,.. __ \"\"\"\"'1 \ngraph \n\n\\r\"!:~'1\":\"IWPII\"\"\"'~W \n\nAMAP \n\ngraph \nofchar~a~c~e~r--'~----~ \ncandi-'r-__ f-_ _ \n\"\"\"'II \ndates \n\nCharacter \n\nHMMs \n\nLexical \n\nconstraints \n\ngraph ~~_\",-__ -d! S .... c ..... r ...... i .... p .... t ~~~arliooa-c-te-r----~ \nof character \ncdaantedsi \n\ns .... e ..... n ..... e.j ... o.T \n5 ...... a ... i ... u ...... p .... .f dates\" \" - -+ --\"\"\"\",!! \n\nh \ncandl \n\nConvolutional \nNeural Network \n\nLexical \n\nconstraints \n\nword \n\n\"Script\" \n\nwo r\"\"'d--\"\"\"\"\",---J \" Script \" \n\nFigure 2: INSEG and OUTSEG architectures for word recognition. \n\nA crucial contribution of our system is the joint training of the neural network and \nthe post-processor with respect to a single criterion that approximates word-level \nerrors. We used the following discriminant criterion: minimize the total cost (sum \nof negative log-likelihoods) along the \"correct\" paths (the ones that yield the correct \ninterpretations) , while minimizing the costs of all the paths (correct or not). The \ndiscriminant nature of this criterion can be shown with the following example. If \n\n\fGlobally Trained Handwritten Word Recognizer \n\n943 \n\nthe cost of a path associated to the correct interpretation is much smaller than all \nother paths, then the criterion is very close to 0 and no gradient is back-propagated. \nOn the other hand , if the lowest cost path yields an incorrect interpretation but dif(cid:173)\nfers from a path of correct interpretation on a sub-path, then very strong gradients \nwill be propagated along that sub-path , whereas the other parts of the sequence \nwill generate almost no gradient. \\Vithin a probabilistic framework, this criterion \ncorresponds to the maximizing the mutual information (MMI) between the obser(cid:173)\nvations and the correct interpretation. During global training , it is optimized using \n(enhanced) stochastic gradient descent with respect to all the parameters in the sys(cid:173)\ntem, most notably the network weights. Experiments described in the next section \nhave shown important reductions in error rates when training with this word-level \ncriterion instead of just training the network separately for each character. Similar \ncombinations of neural networks with HMMs or dynamic programming have been \nproposed in the past, for speech recognition problems (Bengio et al 92). \n\n6 Experimental Results \n\nIn a first set of experiments, we evaluated the generalization ability of the neural \nnetwork classifier coupled with the word normalization preprocessing and AMAP \ninput representation. All results are in writer independent mode (different writers \nin training and testing). Tests on a da tabase of isolated characters were performed \nseparately on four types of characters: upper case (2.99% error on 9122 patterns), \nlower case (4.15% error on 8201 patterns), digits (1.4% error on 2938 patterns), and \npunctuation (4.3% error on 881 patterns). Experiments were performed with the \nnetwork architecture described above. \n\nThe second and third set of experiments concerned the recognition of lower case \nwords (writer independent). The tests were performed on a database of 881 words. \nFirst we evaluated the improvements brought by the word normalization to the \nINSEG system. For the OUTSEG system we have to use a word normalization \nsince the network sees a whole word at a time. With the INSEG system, and \nbefore doing any word-level training, we obtained without word normalization 7.3% \nand 3.5% word and character errors (adding insertions, deletions and substitutions) \nwhen the search was constrained within a 25461-word dictionary. When using the \nword normalization preprocessing instead of a character level normalization, error \nrates dropped to 4.6% and 2.0% for word and character errors respectively, i.e., a \nrelative drop of 37% and 43% in word and character error respectively. \nIn the third set of experiments, we measured the improvements obtained with the \njoint training of the neural network and the post-processor with the word-level \ncriterion, in comparison to training based only on the errors performed at the char(cid:173)\nacter level. Training was performed with a database of 3500 lower case words. For \nthe OUTSEG system, without any dictionary constraints, the error rates dropped \nfrom 38% and 12.4% word and character error to 26% and 8.2% respectively after \nword-level training, i.e., a relative drop of 32% and 34%. For the INSEG system \nand a slightly improved architecture, without any dictionary constraints, the error \nrates dropped from 22.5% and 8.5% word and character error to 17% and 6.3% \nrespectively, i.e., a relative drop of 24.4% and 25.6%. With a 25461-word dictio(cid:173)\nnary, errors dropped from 4.6% and 2.0% word and character errors to 3.2% and \n1.4% respectively after word-level training, i.e., a relative drop of 30.4% and 30.0%. \nFinally, some further improvements can be obtained by drastically reducing the size \nof the dictionary to 350 words, yielding 1.6% and 0.94% word and character errors. \n\n\f944 \n\nBengio, Le Cun, and Henderson \n\n7 Conclusion \n\nWe have demonstrated a new approach to on-line handwritten word recognition \nthat uses word or sentence-level preprocessing and normalization, image-like repre(cid:173)\nsentations, Convolutional neural networks, word models, and global training using \na highly discriminant word-level criterion. Excellent accuracy on various writer \nindependent tasks were obtained with this combination. \n\nReferences \n\nBengio, Y., R. De Mori and G. Flammia and R. Kompe. 1992. Global Optimization \nof a Neural Network-Hidden Markov Model Hybrid. IEEE Transactions on Neural \nNetworks v.3, nb.2, pp.252-259. \nBurges, C., O. Matan, Y. Le Cun, J. Denker, L. Jackel, C. Stenard, C. Nohl and J. \nBen. 1992. Shortest Path Segmentation: A Method for Training a Neural Network \nto Recognize character Strings. Proc. IJCNN'92 (Baltimore), pp. 165-172, v.3. \n\nGuyon, 1., Albrecht, P., Le Cun, Y., Denker, J. S., and Weissman, H. 1991 design \nof a neural network character recognizer for a touch terminal. Pattern Recognition, \n24(2):105-119. \n\nLe Cun, Y. 1986. Learning Processes in an Asymmetric Threshold Network. In \nBienenstock, E., Fogelman-Soulie, F., and Weisbuch, G., editors, Disordered systems \nand biological organization, pages 233-240, Les Houches, France. Springer-Verlag. \nLe Cun, Y. 1989. Generalization and Network Design Strategies. In Pfeifer, R., \nSchreter, Z., Fogelman, F., and Steels, L., editors, Connectionism in Perspective, \nZurich, Switzerland. Elsevier. an extended version was published as a technical \nreport of the University of Toronto. \nLe Cun, Y., Matan, 0., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hub(cid:173)\nbard, W., Jackel, L. D., and Baird, H. S. 1990. Handwritten Zip Code Recognition \nwith Multilayer Networks. In IAPR, editor, Proc. of the International Conference \non Pattern Recognition, Atlantic City. IEEE. \nMatan, 0., Burges, C. J. C., LeCun, Y., and Denker, J. S. 1992. Multi-Digit \nRecognition Using a Space Displacement Neural Network. In Moody, J. M., Han(cid:173)\nson, S. J., and Lippman, R. P., editors, Neural Information Processing Systems, \nvolume 4. Morgan Kaufmann Publishers, San Mateo, CA. \nRumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal rep(cid:173)\nresentations by error propagation. In Parallel distributed processing: Explorations \nin the microstructure of cognition, volume I, pages 318-362. Bradford Books, Cam(cid:173)\nbridge, MA. \n\nSchenkel, M., Guyon, I., Weissman, H., and Nohl, C. 1993. TDNN Solutions for \nRecognizing On-Line Natural Handwriting. \nIn Advances in Neural Information \nProcessing Systems 5. Morgan Kaufman. \n\nTappert, C., Suen, C., and Wakahara, T. 1990. The state of the art in on-line \nhandwriting recognition. IEEE Trans. PAM!, 12(8). \n\n\f", "award": [], "sourceid": 819, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Yann", "family_name": "LeCun", "institution": null}, {"given_name": "Donnie", "family_name": "Henderson", "institution": null}]}